In this tutorial, I will describe the 3 Join Operators SQL Server uses and the reason why it comes to the individual selections by the Optimizer.
I will focus on the Inner Join and will stick to using 2 tables. We basically have 3 scenarios:
1) None of the tables have an index on the joining column
2) One of the two tables has an index on the joining column
3) Both columns have an index on the joining column
In the samples below I will use 2 tables from the AdventureWorks Database and modify indexes to demonstrate why the Optimizer decides to vary on the join operator. Following 2 tables will be used: Sales.SalesOrderHeader
Sales.SalesTerritory
Following query will be used throughout all samples:
,st.Name
,so.OrderDate
,so.DueDate
,so.ShipDate
,so.SalesOrderNumber
,so.PurchaseOrderNumber
,so.AccountNumber
,so.CurrencyRateID
,so.TotalDue
FROM Sales.SalesOrderHeader so
INNER JOIN Sales.SalesTerritory st ON so.TerritoryID = st.TerritoryID
NESTED LOOP JOIN
Tablename | Row Count | Indexed Column |
Sales.SalesOrderHeader | 31465 | Clustered Index on joining column |
Sales.SalesTerritory | 10 | No index |
Nested Loop Join will consist out of inner (lower object in Execution Plan) and outer (upper object in Execution Plan) inputs.
The outer input gets scanned row by row and an index seek is processed on the inner table (the larger indexed table).
This is exactly the advantage of Nested Loop Joins compared to Hash and Merge Joins. The outer, smaller input, is placed in memory and is compared with an indexed outer input.
Let's have a closer look at the Execution plan by placing the cursor above the object in question. The Clustered Index Scan shows us that there was 1 Execution returning 10 rows (all rows in that table).
The Clustered Index Seek provides information that 10 Executions took place returning 31465 rows. Therefor keep in mind that Nested Loop Joins will always work well with a small outer input and a indexed inner input.
MERGE JOIN
Tablename | Row Count | Indexed Column |
Sales.SalesOrderHeader | 31465 | Clustered Index on joining column |
Sales.SalesTerritory | 10 | Clustered Index on joining column |
The requirements for a Merge Join is that both tables are sorted on the joining column. Assuming you have an index on both columns, then the sort is already covered here.
If you force a Merge Join by passing a hint and an index is missing on one or both of the columns, the Sort Operator will be used here, which is quite expensive due to high memory and I/O resources.
Merge Join can be very fast when columns are ordered by the index. In this case matching rows are created while the sorted columns are compared for equality.
The big advantage with Merge Joins is that both outputs will be only executed once.
Although I have an index on both tables, the Optimizer will probably still go for a Nested Loop Join here, since the outer input is quite small. Merge Joins are perfect for indexed and larger tables.
HASH JOIN
Tablename | Row Count | Indexed Column |
Sales.SalesOrderHeader | 31465 | No index |
Sales.SalesTerritory | 10 | No Index |
Hash Joins are usually selected when there is no other option, meaning the data is not sorted and nonindexed. A Hash Join consists of 2 inputs.
1) Build Input: the upper object in the execution plan and usually the smaller table since it is saved on the system and to keep used memory low
2) Probe Input: the lower object in the execution plan
Hash joins are used by the Optimizer to process large and unsorted as well nonindexed values.
In most cases the Optimizer will make the right choice, but in cases where statistics are not up to date, it may be necessary to enforce the right option.
You can do this, by either adding a OPTION (x JOIN) at the end of the query
OPTION (LOOP JOIN)
OPTION (MERGE JOIN)
OPTION (HASH JOIN)
or add the hint in the inner join part of the query
INNER LOOP JOIN
INNER MERGE JOIN
INNER HASH JOIN
Keine Kommentare:
Kommentar veröffentlichen