Carl's Oracle

Carl's thougts about the Oracle Database Server

2006/1/14

CBO - Cardinality oh Cardinaltiy!

Tags:
@ 01:28 PM (31 months, 18 days ago)

Hi, Applications do not live as data islands. The more an application integrates data of other applications of a company the better is. Even we live in the Year 2006 the most common way to import data from other systems is the flat file. Via SQL*LOADER the data is loaded into an Import table which is a table presentation of the flat file. Then the Processing of data starts to apply the imported data to the applications tables.

In this context was busy to load customer data in the current system. Customers very connected to each other in hierarchical structure. Only Main Customers were interesting for further processing.

So the main customer where these which fulfilled following predicate on IMP_CUSTDAT table:

 

MAINCUSTOMER = SUBCUSTOMER

Now first I explained the following SQL for further processing: 

SELECT Kun2.Maincustomer
  FROM Imp_Custdat Kun2
 WHERE Kun2.Maincustomer = Kun2.Subcustomer
;

 

SELECT STATEMENT, GOAL = CHOOSE                                      Cost=2177   Cardinality=1   Bytes=22  CPU cost=221398000  

 TABLE ACCESS FULL   Object owner=OASIS   Object name=IMP_CUSTDAT     Cost=2177   Cardinality=1   Bytes=22  CPU cost=221398000    

The strange thing is the estimated Cardinality of the FULL Table Scan. The optimizer believed that only one Compare between the MainCustomer Column and the SubCustomer Column would meet the Predicate. Cardinality in this context means not the Number rows of the table but the estimated number rows of the result set.

The problem was that in reality the effective cardinality was not 1 but 269546!

You can see this in the following SQL;

SQL> SELECT COUNT(1)

  2    FROM (SELECT Kun2.MAINCUSTOMER

  3            FROM Imp_CUSTDAT Kun2

  4           WHERE Kun2.MAINCUSTOMER = Kun2.SUBCUSTOMER);

 

  COUNT(1)

----------

    269546

 

SQL>

This is the most dramatically miss estimation of cardinality I every seen! How could CBO so mistaken!

 

Following I worked out :

 

CBO estimates it’s costs in for a relational database (You may never forget that). It believes that all important relational rules are kept. Specially the rule of normalization. The Relationship designed in the import table is flat and not expressible via a foreign key constraint. For the CBO MAINCUSTOMER and SUBCUSTOMER are totally independent columns. The dependency we may see via naming is not recognized by CBO at all.

 

Both NumberDistinctValues (NDV) of the Columns were near to the number rows of the table.

 

SQL>

SQL> SELECT Utcs.Column_Name,

  2         Utcs.Num_Distinct AS Ndv,

  3         Utcs.Density,

  4         Utcs.Sample_Size

  5    FROM User_Tab_Col_Statistics Utcs

  6   WHERE Column_Name IN (' SUBCUSTOMER ', ' MAINCUSTOMER ')

  7     AND Utcs.Table_Name = 'IMP_CUSTDAT';

 

COLUMN_NAME                           NDV    DENSITY SAMPLE_SIZE

------------------------------ ---------- ---------- -----------

SUBCUSTOMER                        301364 3,31824637      301373

MAINCUSTOMER                       269899 3,70508968      301373

 

So CBO estimated that the two independent Columns could met the join with just one row! You might think – what consequences would this have with the shown example? If the explain plan show the Cardinality of 1 or 269546 the time to retrieve the data would be the same. And this is indeed fact if the query addresses only the one table. But what will happen if we start to join?

  

Impact of Cardinality on joins ...

 

The way Oracle joins depends very strong on the expected Cardinality of the joined table.

 

Low Cardinality -> Nested Loops

 

SQL> SET LINESIZE 120

SQL> SET AUTOTRACE TRACEONLY

SQL> SET TIMING ON

SQL>

SQL>

SQL> SELECT Com.*

  2    FROM Company      Com,

  3         Imp_CUSTDAT Kun

  4   WHERE Com.CUSTID = Kun.MAINCUSTOMER

  5     AND Kun.MAINCUSTOMER = Kun.SUBCUSTOMER

  6  ;

 

269546 Zeilen ausgewõhlt.

 

Abgelaufen: 00:01:40.08

 

Ausf³hrungsplan

----------------------------------------------------------

   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=2179 Card=1 Bytes=339)

 

   1    0   NESTED LOOPS (Cost=2179 Card=1 Bytes=339)

   2    1     TABLE ACCESS (FULL) OF 'IMP_CUSTDAT' (Cost=2177 Card=1 Bytes=22) 

   3    1     TABLE ACCESS (BY INDEX ROWID) OF 'COMPANY' (Cost=3 Card=1 Bytes=317) 

   4    3       INDEX (UNIQUE SCAN) OF 'COMPANY_UK1' (UNIQUE) (Cost=2 Card=1)

 

Statistiken

----------------------------------------------------------

          0  recursive calls

          0  db block gets

     889654  consistent gets

      15076  physical reads

          0  redo size

   50383049  bytes sent via SQL*Net to client

     198158  bytes received via SQL*Net from client

      17971  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

     269546  rows processed

 

You see a lot of consistent gets here. The reason for this is that every row of the 269546 causes a unique key index lookup. Usually an index lookup needs a minimum of 3 the Logical I/O’s :

 

    1. the index root block
    2. the branch block
    3. the leave block

Estimated number logical reads (LIO’s) for index lookup is : 808638 = 269546 * 3

 

High Cardinality -> HASH JOIN

 

In the following example the true expected cardinatlity of the operation was set via the CARDINALITY Hint. This hint makes only sense if you now that the Cardinality is often the same which is sometimes given for external application data or for static data.

 

SQL> SELECT /*+ CARDINALITY ( KUN 300000 )*/

  2         Com.*

  3    FROM Company      Com,

  4         Imp_CUSTDAT Kun

  5   WHERE Com.CUSTID = Kun.MAINCUSTOMER

  6     AND Kun.MAINCUSTOMER = Kun.SUBCUSTOMER

  7  ;

 

269546 Zeilen ausgewõhlt.

 

Abgelaufen: 00:01:10.08

 

Ausf³hrungsplan

----------------------------------------------------------

   0      SELECT STATEMENT Optimizer=CHOOSE (Cost=9223 Card=300000 Bytes=101700000) 

   1    0   HASH JOIN (Cost=9223 Card=300000 Bytes=101700000)

   2    1     TABLE ACCESS (FULL) OF 'IMP_CUSTDAT' (Cost=2177 Card=300000 Bytes=6600000) 

   3    1     TABLE ACCESS (FULL) OF 'COMPANY' (Cost=2329 Card=272847 Bytes=86492499)

  

Statistiken

----------------------------------------------------------

          0  recursive calls

          0  db block gets

      29537  consistent gets

      14851  physical reads

          0  redo size

   53441829  bytes sent via SQL*Net to client

     198158  bytes received via SQL*Net from client

      17971  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

     269546  rows processed

 

SQL>

As you clearly see the Full table of both the tables combined with a HASH Join is the much cheaper operation!

 

Response time Comparison

 

Nested Loop join:  00:01:40.08/ LIO: 889654

Hash Join       :  00:01:10.08/ LIO:  29537

Even the has join scenario had less LIO’s it was only 30 seconds, about 30% faster. You may not forget that the system was a test instance with few connections. More sessions accessing the buffer cache would increase the Response Time of the Nested Loop Join dramatically.

 

Conclusion

 

If you look at explain try to concentrate on the cardinality. It has a very big impact on the design of the whole execution plan. Miss estimations can lead to SQL’s with dramatically increased response times.

 

Karl

 

PS.: Very valuable Content you will find in an article from Wolfgang Breitling : http://www.centrexcc.com/Fallacies%20of%20the%20Cost%20Based%20Optimizer.pdf