DaveWentzel.com            All Things Data

Predicate Implication

 

(work in progress)
 
Predicate implication...what's that?  It is the application of the rule of transitivity.  That's a mouthful...but it may help your query performance in certain cases.  The rule of transitivity in the database world can be best expressed with a simple example. 
 
Sometimes this is called "predicate pushdown".  Predicate pushdown literally means to apply a predicate as early and often in a given query.  It is a heuristic.  Think of SQL Server 2005 DMVs.  Most of the DMVs that show session level data (non-aggregated) will carry a session_id.  Assume you wanted to join all of these views together but you only wanted to see one session_id...assume 100.  You could apply a predicate like WHERE dmv_a.session_id = 50 and the query plan will naturally so you that the session_id = 50 restriction was applied to each DMV...not just dmv_a.  This is predicate pushdown.  The optimizer is applying the WHERE condition before as many JOINs are performed as possible.  But predicate pushdown does not always work, especially when the query is a little complex. 
 
A common question that is often asked is 'Should non-JOIN conditions be placed on the JOIN/ON clause or the WHERE clause?'
 
Assume the following predicate:
 
WHERE t1.col_a = 'foo'
AND t1.col_a = t1.col_b
 
and the following is clearly redundant and not necessary...or is it?
 
WHERE t1.col_a = 'foo'
AND t1.col_a = t1.col_b
AND t1.col_b = 'foo''
 
It could actually make a difference because you are giving the query optimizer a little more information that can help it generate a more efficient query execution plan.  Now SQL Server (or any RDBMS probably) can use its statistics to evaluate the new statement and maybe attack the query differently.  The optimizer may actually do this for you under some circumstances and you may actually see this in your graphical execution plan...redundant predicates showing up in multiple steps. 
 
The optimizer considers predicate implication for many different equality operations, especially on JOINs and correlated subqueries (including those using EXISTS and aggregate functions). 
 
Consider the following example...I used the old join syntax to make things maybe a little easier to visualize.
 
(work in progress)
SELECT 
FROM tbl_a, tbl_b, tbl_c
WHERE tbl_a.col1 <=
AND tbl_a.col2 = @v1  --variable1
AND tbl_a.joincol_a = @v2  --variable2
AND tbl_a.col3 = @v3  --variable3
AND tbl_a.joincol_a = tbl_b.joincol_a  --join tbl_a to tbl_b
AND tbl_a.joincol_a = tbl_c.joincol_a  --join tbl_a to tbl_c
AND tbl_b.joincol_b = tbl_c.joincol_b  --join tbl_b to tbl_c
--AND tbl_b.joincol_b = tbl_c.joincol_b  --join tbl_b
 
Imagine @v2 is assigned the value ABC.  It is assumed that tbl_c.joincol_a will also be equal to ABC.  The redundant predicate that we could add to this may enable the optimizer to assume a more optimal execution plan. 
 
Need to test this but if you look at the execution plan you should now see a redundant statement added to handle predicate implication.  As mentioned above, this occurs automatically because it is an equality operation.  Let's assume an example where the optimizer can't do this because we aren't using an equality operation. 
 
A simple example is using a range operator. 
 
SELECT
FROM tbl_a, tbl_b
WHERE tbl_a.joincol_a BETWEEN @v1 AND @v2  --range
AND tbl_a.col3 = @v3  --parameter
AND tbl_a.joincol_a = tbl_b.joincol_a  --join condition
AND NOT EXISTS (SELECT FROM tbl_c
                            WHERE tbl_a.joincol_a = tbl_c.joincol_a)
 
This example takes tbl_a and tbl_b and joins them and on joincol_a and I'm passing in a parameter to joincol_a as well.  Later we join to tbl_c using a NOT EXISTS clause.  If you look at the query plan it is highly likely that we are scanning tbl_c for joincol_a...but we shouldn't because we passed in a range to tbl_a.joincol_a.  Let's add that to the predicate manually and see what happens.  Remember that the query optimizer will not use predicate implication in this case (probably) because we are using a range operation.  Assume very large tables and you can see how adding the redundant clause will help performance. 
 
The queries are identical in meaning, but performance is radically better. 
 
You can see the problem with aggregate functions in subqueries as well. 
 
SELECT
FROM tbl_a, tbl_b
WHERE tbl_a.joincol_a = tbl_b.joincol_a
AND tbl_a.joincol_a = (
                                  SELECT MAX(joincol_a)
                                  FROM tbl_c
                                  WHERE tbl_a.joincol_a = tbl_c.joincol_a
                                 )
 
Here we see (really need to test this) a suboptimal plan because the optimizer seemingly can't determine how many rows the subquery should return.  In this case we should only need one row returned...so we can help the optimizer a little by specifying the following instead. 
 
SELECT
FROM tbl_a, tbl_b
WHERE tbl_a.joincol_a = tbl_b.joincol_a
AND tbl_a.joincol_a = (
                                  SELECT TOP 1 MAX(joincol_a)
                                  FROM tbl_c
                                  WHERE tbl_a.joincol_a = tbl_c.joincol_a
                                 )
 
 
This may not be considered predicate implication in the strictest sense, but we are giving the optimizer a hint as to how it should process the query.   

Add new comment