Oracle Exadata - Fallacious implementation of Parallelism & HCC

Issue decription

In data warehousing database one of the MERGE statement with dimension and staging table was taking long time and never gets completed even after 4 hours. Also this MERGE statement was causing the whole database performance down to its knees due to high CPU and other resource usage. Database has been hosted in Exadata commodity and thus expectation were high with the MERGE statement response time. Dimension table was having about 500 Million rows and staging table was having about 8 Million rows.

MERGE statement in question is as shown below:

MERGE /*+ PARALLEL(t1,8)*/
INTO    DIMENSION t1
USING   (
          SELECT /*+ PARALLEL(t3,8)*/ *
          FROM   stage t3
        ) t2
ON ( t1.dim_key = t2.dim_key )
WHEN MATCHED THEN
  UPDATE
  SET     t1.col1 = t2.col1,
      t1.col2 = t2.col2,
      t1.col3 = t2.col3,
      t1.col4 = t2.col4,
WHEN NOT MATCHED THEN
  INSERT (
      t1.dim_key,
      t1.col1,
      t1.col2,
      t1.col3,
      t1.col4 )
  VALUES (
      t2.dim_key,
      t2.col1,
      t2.col2,
      t2.col3,
      t2.col4 )
/

Troubleshooting / Investigation

Since MERGE statement ran for more than 4 hours there are high chances that it might have been captured in the AWR. Thus finding the execution plan along with it’s response time break down waiting on events at each operation in the execution plan was easier through ASH data stored in AWR. Though ASH data present in the AWR doesn’t contain every second detail but infact has every 10th second details captured its worth to consider it in this investigation. Also we have the privilege to re-execute MERGE from application team so that DBA can closely monitor the situation and collect all the necessary information. But before proceeding it’s better to gain historical background information of this MERGE statement as much as possible by mining AWR base tables.

SQL> select * from table(dbms_xplan.display_awr('b4k66afzzbbmw'));

Plan hash value: 452146253

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name                       | Rows  |   TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                       |                            |       |       |      |            |
|   1 |  PX COORDINATOR                       |                            |       |       |      |            |
|   2 |   PX SEND QC (RANDOM)                 | :TQ10002                   |     8M| Q1,02 | P->S | QC (RAND)  |
|   3 |    INDEX MAINTENANCE                  | DIMENSION                  |       | Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                        |                            |     8M| Q1,02 | PCWP |            |
|   5 |      PX SEND RANGE                    | :TQ10001                   |     8M| Q1,01 | P->P | RANGE      |
|   6 |       MERGE                           | DIMENSION                  |       | Q1,01 | PCWP |            |
|   7 |        PX RECEIVE                     |                            |     8M| Q1,01 | PCWP |            |
|   8 |         PX SEND HYBRID (ROWID PKEY)   | :TQ10000                   |     8M| Q1,00 | P->P | HYBRID (ROW|
|   9 |          VIEW                         |                            |       | Q1,00 | PCWP |            |
|  10 |           NESTED LOOPS OUTER          |                            |     8M| Q1,00 | PCWP |            |
|  11 |            PX BLOCK ITERATOR          |                            |       | Q1,00 | PCWC |            |
|  12 |             TABLE ACCESS STORAGE FULL | STAGE                      |     8M| Q1,00 | PCWP |            |
|  13 |            TABLE ACCESS BY INDEX ROWID| DIMENSION                  |     1 | Q1,00 | PCWP |            |
|  14 |             INDEX UNIQUE SCAN         | DIMENSION_PK               |     1 | Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

Output has been edited for brevity. As per the execution plan extracted from AWR it looks like DIMENSION and STAGE tables will be joined by using NESTED LOOP OUTER and since STAGE table is the driving table for each row in STAGE table DIMENSION table will be looked up for corresponding row using primary key index DIMENSION_PK. By looking at the parallel operations in the execution plan it can be confirmed that both DML and SELECT were executed in parallel. Now let’s check how much time it has spent for each execution of this MERGE, this can be found easily by seggregating the data for each sql_exec_id which is unique id for each execution of the same sql. Since ASH information in AWR is sample of sample data having 1 out of 10 records we need to multiply the result by 10 to approximately find elapsed time of this MERGE statement for each distinct execution.

SQL> select   sql_id,sql_plan_hash_value,sql_exec_id,round((count(*) * 10)/60/60, 2) as Hours
     from     dba_hist_active_sess_history 
   where    sql_id='b4k66afzzbbmw' 
   group by   sql_id,sql_plan_hash_value,sql_exec_id 
   order by   4;

SQL_ID        SQL_PLAN_HASH_VALUE SQL_EXEC_ID      Hours
------------- ------------------- ----------- ----------
b4k66afzzbbmw           452146253    67108864       5.31
b4k66afzzbbmw           452146253    83886080       9.92

This MERGE statement has indeed ran for more than 4 hours, so now let’s check for what purpose did MERGE has consumed most of it’s elapsed time by seggregating the data at each operation of MERGE execution plan for distinct sql_exec_id on individual object.

SQL> l
  1  select dash.sql_exec_id,dash.SQL_PLAN_OPERATION||' '||dash.SQL_PLAN_OPTIONS as Operation,dsp.OBJECT_NAME,count(*)
  2  from dba_hist_active_sess_history dash, dba_hist_sql_plan dsp
  3  where dash.sql_id='b4k66afzzbbmw'
  4  and   dash.SQL_PLAN_LINE_ID=dsp.id
  5  and   dash.sql_id=dsp.sql_id
  6  and   dash.sql_plan_hash_value=dsp.plan_hash_value
  7* group by dash.sql_exec_id,dash.SQL_PLAN_OPERATION,dash.SQL_PLAN_OPTIONS,dsp.OBJECT_NAME order by dash.sql_exec_id,count(*)
SQL> /

SQL_EXEC_ID OPERATION                      OBJECT_NAME                       COUNT(*)
----------- ------------------------------ ------------------------------- ----------
   67108864 PX RECEIVE                                                              1
            INDEX UNIQUE SCAN              DIMENSION_PK                             3
            TABLE ACCESS STORAGE FULL      STAGE                                    4
            PX SEND HYBRID (ROWID PKEY)    :TQ10000                                 8
            TABLE ACCESS BY INDEX ROWID    DIMENSION                              218
            MERGE                          DIMENSION                             1677

   83886080 PX BLOCK ITERATOR                                                       1
            VIEW                                                                    1
            INDEX UNIQUE SCAN              DIMENSION_PK                             2
            TABLE ACCESS STORAGE FULL      STAGE                                    3
            PX RECEIVE                                                              8
            PX SEND HYBRID (ROWID PKEY)    :TQ10000                                11
            TABLE ACCESS BY INDEX ROWID    DIMENSION                              666
            MERGE                          DIMENSION                             2874

This distribution confirms that index lookup on dimension table is the one consuming most of the time and in result causing high CPU usage. So the question arises why NESTED LOOP is being performed annoyingly instead of HASH JOIN when we are joining huge amount of DIMENSION and STAGE tables data. This made us curious to execute the MERGE statement manually by DBA and confirm the hypothesis of Optimizer not considering HASH JOIN.

Upon manually executing the MERGE statement, execution plan shows it is using HASH JOIN instead of NESTED LOOP join as shown below. Then why MERGE is selecting different NESTED LOOP plan when executed from Application ?, and also its obvious that performing NESTED LOOP join by probing the primary key index for about 485 Million times was not pratical enough as it totally avoids SMART SCAN.

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                       | E-Rows |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                        |                            |        |       |       |          |
|   1 |  PX COORDINATOR                        |                            |        |       |       |          |
|   2 |   PX SEND QC (RANDOM)                  | :TQ10004                   |   8433K|       |       |          |
|   3 |    INDEX MAINTENANCE                   | DIMENSION                  |        |       |       |          |
|   4 |     PX RECEIVE                         |                            |   8433K|       |       |          |
|   5 |      PX SEND RANGE                     | :TQ10003                   |   8433K|       |       |          |
|   6 |       MERGE                            | DIMENSION                  |        |   256K|   256K|          |
|   7 |        PX RECEIVE                      |                            |   8433K|       |       |          |
|   8 |         PX SEND HYBRID (ROWID PKEY)    | :TQ10002                   |   8433K|       |       |          |
|   9 |          VIEW                          |                            |        |       |       |          |
|* 10 |           HASH JOIN OUTER BUFFERED     |                            |   8433K|  2047M|    63M|          |
|  11 |            PX RECEIVE                  |                            |   8433K|       |       |          |
|  12 |             PX SEND HASH               | :TQ10000                   |   8433K|       |       |          |
|  13 |              PX BLOCK ITERATOR         |                            |   8433K|       |       |          |
|* 14 |               TABLE ACCESS STORAGE FULL| STAGE                      |   8433K|  1025K|  1025K|          |
|  15 |            PX RECEIVE                  |                            |    478M|       |       |          |
|  16 |             PX SEND HASH               | :TQ10001                   |    478M|       |       |          |
|  17 |              PX BLOCK ITERATOR         |                            |    478M|       |       |          |
|* 18 |               TABLE ACCESS STORAGE FULL| DIMENSION                  |    478M|  1025K|  1025K|          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  10 - access("T1"."DIM_KEY"="T3"."DIM_KEY")
  14 - storage(:Z>=:Z AND :Z<=:Z)
  18 - storage(:Z>=:Z AND :Z<=:Z)

After colleting all these details we asked Application to provide us all the information regarding their job which performs this MERGE but vain. So the only option was to re-execute the MERGE and gather all the required data for further troubleshooting. Our first attempt was to collect all the session level Optimizer related settings for this session by querying V$SES_OPTIMIZER_ENV as shown below.

SQL> select * from V$SES_OPTIMIZER_ENV where sid=123 and ISDEFAULT='NO';

       SID    ID NAME                             SQL_FEATURE      ISD VALUE        
---------- ----- -------------------------------- ---------------- --- -------------
   4 parallel_dml_forced_dop          QKSFM_CBO        NO  default
  13 parallel_threads_per_cpu         QKSFM_CBO        NO  1
  25 _pga_max_size                    QKSFM_ALL        NO  838860 KB
  36 parallel_dml_mode                QKSFM_ALL        NO  forced
 247 parallel_min_time_threshold      QKSFM_PQ         NO  10
 258 optimizer_use_invisible_indexes  QKSFM_INDEX      NO  true
 268 cell_offload_plan_display        QKSFM_EXECUTION  NO  ALWAYS
 274 parallel_max_degree              QKSFM_PQ         NO  128
 291 _parallel_cluster_cache_policy   QKSFM_PQ         NO  adaptive
 324 total_processor_group_count      QKSFM_ALL        NO  8

Bingo !! Session has set FORCE DML parallelism. Immediately we stopped the MERGE job and tried to reproduce the NESTED LOOP execution plan manually by executing the MERGE with FORCE DML set at session level and got the exact NESTED LOOP plan which was used by Application initiated MERGE job. By just enabling the DML parallelism without FORCE option HASH JOIN was chosed by the Optimizer as shown below.

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name                       | E-Rows |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | MERGE STATEMENT                        |                            |        |       |       |          |
|   1 |  PX COORDINATOR                        |                            |        |       |       |          |
|   2 |   PX SEND QC (RANDOM)                  | :TQ10004                   |   8433K|       |       |          |
|   3 |    INDEX MAINTENANCE                   | DIMENSION                  |        |       |       |          |
|   4 |     PX RECEIVE                         |                            |   8433K|       |       |          |
|   5 |      PX SEND RANGE                     | :TQ10003                   |   8433K|       |       |          |
|   6 |       MERGE                            | DIMENSION                  |        |   256K|   256K|          |
|   7 |        PX RECEIVE                      |                            |   8433K|       |       |          |
|   8 |         PX SEND HYBRID (ROWID PKEY)    | :TQ10002                   |   8433K|       |       |          |
|   9 |          VIEW                          |                            |        |       |       |          |
|* 10 |           HASH JOIN OUTER BUFFERED     |                            |   8433K|  2047M|    63M|          |
|  11 |            PX RECEIVE                  |                            |   8433K|       |       |          |
|  12 |             PX SEND HASH               | :TQ10000                   |   8433K|       |       |          |
|  13 |              PX BLOCK ITERATOR         |                            |   8433K|       |       |          |
|* 14 |               TABLE ACCESS STORAGE FULL| STAGE                      |   8433K|  1025K|  1025K|          |
|  15 |            PX RECEIVE                  |                            |    478M|       |       |          |
|  16 |             PX SEND HASH               | :TQ10001                   |    478M|       |       |          |
|  17 |              PX BLOCK ITERATOR         |                            |    478M|       |       |          |
|* 18 |               TABLE ACCESS STORAGE FULL| DIMENSION                  |    478M|  1025K|  1025K|          |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  10 - access("T1"."DIM_KEY"="T3"."DIM_KEY")
  14 - storage(:Z>=:Z AND :Z<=:Z)
  18 - storage(:Z>=:Z AND :Z<=:Z)

This is because when we set FORCE parallelism Optimizer will change its cost calculation by reducing the cost estimates for full table scans which is directly proportional to degree of parallelism. Application team were not aware of this and thought FORCE option does something forcefully to obey the degree of parallelism requested which is not true.

At this moment we suggested to use "alter session enable parallel dml" instead of "alter session force parallel dml" to resolve this issue, so that Optimizer will perform HASH JOIN and smart scan could occur to improve the MERGE response time. But to our surprise Application came back saying that their job is still taking more time and is currently running from past 30 minutes. Below is the monitoring report for currently running MERGE job.

Global Information
------------------------------
 Status              :  EXECUTING
 Instance ID         :  1
 Session             :  DW (1092:45235)
 SQL ID              :  b4k66afzzbbmw
 SQL Execution ID    :  16777216
 Execution Started   :  02/19/2016 10:58:48
 First Refresh Time  :  02/19/2016 10:58:48
 Last Refresh Time   :  02/19/2016 11:38:52
 Duration            :  2612s
 Module/Action       :  sqlplus@db10 (TNS V1-/-
 Service             :  stage
 Program             :  sqlplus@db10 (TNS V1-

Global Stats
=========================================================================================================================
| Elapsed |   Cpu   |    IO    | Application | Concurrency | Cluster  | Buffer | Read | Read  | Write | Write |  Cell   |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   |  Waits(s)   | Waits(s) |  Gets  | Reqs | Bytes | Reqs  | Bytes | Offload |
=========================================================================================================================
|   22271 |    7530 |    14261 |        0.00 |        0.33 |      480 |   119M |  24M | 274GB | 34515 |   8GB |  -9.89% |
=========================================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=2664465517)
==================================================================================================================================================
| Id    |                Operation                 |    Name     | Rows   |  Rows   | Cell  | Activity |             Activity Detail             |
|       |                                          |             |(Estim) |(Actual) |Offload|   (%)    |               (# samples)               |
==================================================================================================================================================
|     0 | MERGE STATEMENT                          |             |        |       0 |       |     0.04 | Cpu (7)                                 |
|     1 |   PX COORDINATOR                         |             |        |       0 |       |     0.01 | name-service call wait (1)              |
|     2 |    PX SEND QC (RANDOM)                   | :TQ10004    |     8M |         |       |          |                                         |
|     3 |     INDEX MAINTENANCE                    | DIMENSION   |        |         |       |          |                                         |
|     4 |      PX RECEIVE                          |             |     8M |         |       |          |                                         |
|     5 |       PX SEND RANGE                      | :TQ10003    |     8M |         |       |          |                                         |
|  -> 6 |        MERGE                             | DIMENSION   |        |       0 |       |     6.70 | gc buffer busy release (14)             |
|       |                                          |             |        |         |       |          | gc cr block 2-way (7)                   |
|       |                                          |             |        |         |       |          | gc cr block busy (6)                    |
|       |                                          |             |        |         |       |          | gc cr failure (1)                       |
|       |                                          |             |        |         |       |          | gc cr request (1)                       |
|       |                                          |             |        |         |       |          | gc current block 2-way (4)              |
|       |                                          |             |        |         |       |          | gc current block 3-way (39)             |
|       |                                          |             |        |         |       |          | gc current block busy (274)             |
|       |                                          |             |        |         |       |          | gc current grant busy (120)             |
|       |                                          |             |        |         |       |          | gc current request (5)                  |
|       |                                          |             |        |         |       |          | gc current retry (3)                    |
|       |                                          |             |        |         |       |          | gc remaster (4)                         |
|       |                                          |             |        |         |       |          | Cpu (465)                               |
|       |                                          |             |        |         |       |          | gcs drm freeze in enter server mode (2) |
|       |                                          |             |        |         |       |          | cell single block physical read (372)   |
|  -> 7 |         PX RECEIVE                       |             |     8M |      1M |       |     0.04 | Cpu (7)                                 |
|  -> 8 |          PX SEND HYBRID (ROWID PKEY)     | :TQ10002    |     8M |      1M |       |     0.06 | Cpu (9)                                 |
|       |                                          |             |        |         |       |          | PX Deq: reap credit (2)                 |
|  -> 9 |           VIEW                           |             |        |      1M |       |     0.01 | Cpu (1)                                 |
| -> 10 |            HASH JOIN OUTER BUFFERED      |             |     8M |      1M |       |     1.26 | Cpu (187)                               |
|       |                                          |             |        |         |       |          | direct path read temp (46)              |
|       |                                          |             |        |         |       |          | direct path write temp (14)             |
|    11 |             PX RECEIVE                   |             |     8M |      8M |       |     0.08 | Cpu (6)                                 |
|       |                                          |             |        |         |       |          | PX Deq: reap credit (9)                 |
|    12 |              PX SEND HASH                | :TQ10000    |     8M |      8M |       |     0.08 | Cpu (12)                                |
|       |                                          |             |        |         |       |          | PX Deq: reap credit (3)                 |
|    13 |               PX BLOCK ITERATOR          |             |     8M |      8M |       |          |                                         |
|    14 |                TABLE ACCESS STORAGE FULL | STAGE       |     8M |      8M |       |     0.11 | gc cr grant 2-way (1)                   |
|       |                                          |             |        |         |       |          | gc cr multi block request (1)           |
|       |                                          |             |        |         |       |          | gc current grant 2-way (1)              |
|       |                                          |             |        |         |       |          | Cpu (15)                                |
|       |                                          |             |        |         |       |          | cell list of blocks physical read (2)   |
|       |                                          |             |        |         |       |          | cell multiblock physical read (1)       |
|    15 |             PX RECEIVE                   |             |   478M |    478M |       |     2.66 | Cpu (415)                               |
|       |                                          |             |        |         |       |          | PX Deq: Table Q Normal (2)              |
|       |                                          |             |        |         |       |          | IPC send completion sync (1)            |
|       |                                          |             |        |         |       |          | PX Deq: reap credit (105)               |
|    16 |              PX SEND HASH                | :TQ10001    |   478M |    478M |       |     6.04 | Cpu (974)                               |
|       |                                          |             |        |         |       |          | PX Deq Credit: need buffer (1)          |
|       |                                          |             |        |         |       |          | PX Deq: reap credit (212)               |
|    17 |               PX BLOCK ITERATOR          |             |   478M |    478M |       |          |                                         |
|    18 |                TABLE ACCESS STORAGE FULL | DIMENSION   |   478M |    478M | -7.53%|    82.94 | gc current block 2-way (9)              |
|       |                                          |             |        |         |       |          | gc current block congested (1)          |
|       |                                          |             |        |         |       |          | Cpu (2676)                              |
|       |                                          |             |        |         |       |          | cell single block physical read (13607) |
|       |                                          |             |        |         |       |          | cell smart table scan (9)               |
==================================================================================================================================================

Above sql monitoring report has been edited for brevity. As per the monitoring report more than 80% of the time is spent on Id 18 performing full scan of table DIMENSION. But though full scan of the DIMENSION table has performed smart scan we see that it has been waiting on the wait event "cell single block physical read" for 13K times. This raised the concern about why full table scan has to perform 13K single block reads which is reducing the beneficial effects of smart scan by sending the blocks back to compute nodes from cell nodes and thus raising the CPU usage on compute nodes for decompressing the HCC compression units as DIMENSION table is HCC compressed with Query High option.

Updates on HCC enabled table will have adverse impact, because when row is update in HCC tables then row will be migrated within the segment which is managed using OLTP compression type. So even if there is a single migrated row in HCC compression unit then the block will be accessed by using "cell single block physical read". With this hypothesis we can say that MERGE statement was suffering from migrated rows due to updates on the HCC DIMENSION table.

So how do we find how many rows are migrated in table and whether migrated rows are using OLTP compression method or not? Let’s create a demo to simulate row migration on HCC table and study its behaviour.

Create a table with one million rows as HCC with Query High option.

SQL> create table t1 tablespace
     as
     select
             trunc((rownum-1)/10)    n1,
             trunc((rownum-1)/10)    n2,
             rpad(rownum,200)        v1
     from
             dual
     connect by
             level <= 1e6
/
 
SQL> exec dbms_stats.gather_table_stats(user,'t1',method_opt => 'for all columns size 1');

Now to find the compression type used by each row in an block we can use dbms_compression package having function get_compression_type. This function returns few known values which can be inferred by looking at the definition of each values in dbmscomp.sql script used to create this package.

SQL> !cat $ORACLE_HOME/rdbms/admin/dbmscomp.sql | grep ^COMP_[NFB]
COMP_NOCOMPRESS               CONSTANT NUMBER := 1;
COMP_FOR_OLTP                 CONSTANT NUMBER := 2;
COMP_FOR_QUERY_HIGH           CONSTANT NUMBER := 4;
COMP_FOR_QUERY_LOW            CONSTANT NUMBER := 8;
COMP_FOR_ARCHIVE_HIGH         CONSTANT NUMBER := 16;
COMP_FOR_ARCHIVE_LOW          CONSTANT NUMBER := 32;
COMP_BLOCK                    CONSTANT NUMBER := 64;

By using these values we can construct an sql to decode each value and represent it in meaningful way. Just to have stable demo and to check smart scan efficiency we will force smart scan by setting "_serial_direct_read" to ALWAYS at session level.

SQL> alter session set "_serial_direct_read"=ALWAYS;

SQL> SELECT compression_type, n_rows as num_rows
     FROM  (`
        SELECT compression_type, sum(rows) n_rows
        FROM  (
           SELECT decode(dbms_compression.get_compression_type(USER, 'T1', min_rowid),
                         1, 'No Compression',
                         2, 'Basic/OLTP Compression',
                         4, 'HCC Query High',
                         8, 'HCC Query Low',
                        16, 'HCC Archive High',
                        32, 'HCC Archive Low',
                        64, 'From HCC to OLTP',
                            'Unknown Compression Level' ) AS compression_type, rows
           FROM  ( SELECT   min(rowid) as min_rowid,count(*) as rows 
               FROM     T1 
           GROUP BY dbms_rowid.rowid_block_number(rowid) 
         )
      )
        GROUP  BY compression_type
    )
/

COMPRESSION_TYPE                ROWS
------------------------- ----------
HCC Query High               1000000

Elapsed: 00:00:06.19

This sql ran for 6 seconds and it reports that we have one million rows compressed as HCC Query High option. Now let’s update the rows and again find the compression type used by each row.

SQL> update t1
        set v1 = rpad(rownum,190)
        where n1  between 1 and 100000
/
999990 rows updated.

SQL> commit;
Commit complete.

SQL> SELECT compression_type, n_rows as num_rows
     FROM  (
        SELECT compression_type, sum(rows) n_rows
        FROM  (
           SELECT decode(dbms_compression.get_compression_type(USER, 'T1', min_rowid),
                         1, 'No Compression',
                         2, 'Basic/OLTP Compression',
                         4, 'HCC Query High',
                         8, 'HCC Query Low',
                        16, 'HCC Archive High',
                        32, 'HCC Archive Low',
                        64, 'From HCC to OLTP',
                            'Unknown Compression Level' ) AS compression_type, rows
           FROM  ( SELECT   min(rowid) as min_rowid,count(*) as rows 
               FROM     T1 
           GROUP BY dbms_rowid.rowid_block_number(rowid) 
         )
      )
        GROUP  BY compression_type
    )
/

COMPRESSION_TYPE                ROWS
------------------------- ----------
No Compression                    70
HCC Query High                    10
From HCC to OLTP              999920

Elapsed: 00:03:35.92

Now the same sql took 201 seconds to provide the report !! As per the report now we have 70 rows which are not using any compression type, because after update rows get migrated from HCC to OLTP compression type and these migrated rows will not be compressed into OLTP type until the block is full, so the 70 rows are residing in the blocks which are not yet full to get OLTP compressed. Further we see that we have 10 rows as HCC Query High and 999920 migrated rows compressed as OLTP. These migrated rows will correspond to return value 64 which is not yet documented any where as migrated rows but simply represents as COMP_BLOCK in dbmscomp.sql script. As per the test case we can consider return value 64 as rows migrated from HCC to OLTP.

Since this reporting sql took 201 seconds after update, it would be interesting to see it’s monitoring report to find where it has spent most of its time.

SQL Monitoring Report
Global Stats
==============================================================================================================
| Elapsed |   Cpu   |    IO    | Concurrency | Cluster  | PL/SQL  |  Other   | Fetch | Buffer | Read | Read  |
| Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Time(s) | Waits(s) | Calls |  Gets  | Reqs | Bytes |
==============================================================================================================
|     201 |     198 |     0.99 |        0.00 |     0.28 |    4.29 |     1.26 |     2 |    55M | 1205 |  22MB |
==============================================================================================================

SQL Plan Monitoring Details (Plan Hash Value=4120307800)
===============================================================================================================================================================================
| Id |           Operation            | Name |  Rows   | Cost |   Time    | Start  | Execs |   Rows   | Read | Read  |  Mem  | Activity |           Activity Detail           |
|    |                                |      | (Estim) |      | Active(s) | Active |       | (Actual) | Reqs | Bytes | (Max) |   (%)    |             (# samples)             |
===============================================================================================================================================================================
|  0 | SELECT STATEMENT               |      |         |      |       193 |     +8 |     1 |        3 |      |       |       |          |                                     |
|  1 |   HASH GROUP BY                |      |      1M |  116 |       193 |     +8 |     1 |        3 |      |       |  962K |          |                                     |
|  2 |    VIEW                        |      |      1M |   77 |       193 |     +8 |     1 |    25001 |      |       |       |     1.00 | Cpu (2)                             |
|  3 |     HASH GROUP BY              |      |      1M |   77 |       201 |     +0 |     1 |    25001 |      |       |    3M |     3.00 | Cpu (6)                             |
|  4 |      TABLE ACCESS STORAGE FULL | T1   |      1M |   38 |         7 |     +2 |     1 |       1M |   89 | 776KB |       |     1.00 | cell single block physical read (8) |
===============================================================================================================================================================================

Though we forced smart scan by setting "_serial_direct_read" this sql has performed single block physical reads due to migrated rows. Degrade in performance is huge due to this from 6 seconds to 201 seconds after updating the rows. This proves any access of migrated rows will be done through "cell single block physical read" and thus Exadata uses block shipping mode which in turn increases the CPU usage for decompressing the blocks reducing the efficeincy of smart scan to great order. There are various reasons cellsrv reverts back to block shipping mode rather smart scan and one of the reason is - Migrated rows in HCC compression tables. When migrated rows are encountered by cellsrv process there is no guarantee that the migrated row block resides in the same cell server, hence cellsrv will just ships the entire block back to the compute node to process the request.

Conclusion

This article demonstrates the methodology used for tuning MERGE statement by drilling down into the issue, intially it was using wrong join mechanism due to FORCE parallel session level setting and then later even after converting the plan to use HASH JOIN by disabling FORCE parallelism it was taking more than expected time due to migrated rows being accessed by "cell single block physical reads" and all other rows through smart scan. But the adverse impact due to this was huge and implementing right feature for right requirement is crucial. In this case using HCC compression for frquently updating tables is a bad design strategy. HCC provides tremendous compression ratio but not suitable for tables where frequently conventional DML’s are performed, especially the updates.