This was fun, and I’ve been trying to remember to post about the things that are fun.
I was working with a customer using a purchased application from a relatively over-bearing vendor. You know the type: they have very specific (and somewhat out-of-date) demands for the Oracle database environment on which their application is hosted.
My customer complained that some reports were taking 15-20 minutes to complete, when everyone who knew what the reports were doing were certain that they shouldn’t take more than a few seconds at most. The search parameters to the report were very specific, the report didn’t do a whole lot, only supposed to bring back a few bits n pieces, not the whole kit n kaboodle.
So they extracted the text of the SQL statement from the application and emailed it. Using that text, I searched for the text in the V$SQL view and found the SQL_ID. Now we can have some fun…
First step is get some context of the problem, and that means looking at the recent “history” of executions (as displayed by my “sqlhistory.sql” script) which looked like…
+--------------------------------------------------------------------------------------------------+ |Plan HV Min Snap Max Snap Execs LIO/Exec PIO/Exec CPU/Exec Elapsd/Exec | +--------------------------------------------------------------------------------------------------+ |1834398219 30594 30595 1 15,244,622 0 1,277.87 1,633.08 | +--------------------------------------------------------------------------------------------------+ . ========== PHV = 1834398219========== First seen from "06/13/13 10:30:22" (snap #30594) Last seen from "06/13/13 11:00:23" (snap #30595) . Execs LIO PIO CPU Elapsed ===== === === === ======= 1 15,244,622 0 1,277.87 1,633.08 . Plan hash value: 1834398219 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | 124K(100)| | | 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 95 | 124K (2)| 00:10:15 | | 3 | SORT ORDER BY STOPKEY | | 1 | 94 | 124K (2)| 00:10:15 | | 4 | NESTED LOOPS SEMI | | 1 | 94 | 124K (2)| 00:10:15 | | 5 | NESTED LOOPS SEMI | | 1 | 88 | 35 (0)| 00:00:01 | | 6 | NESTED LOOPS | | 1 | 64 | 33 (0)| 00:00:01 | | 7 | TABLE ACCESS BY INDEX ROWID| CC_CLAIM | 1 | 55 | 32 (0)| 00:00:01 | | 8 | INDEX RANGE SCAN | CC000000D0U3 | 11 | | 22 (0)| 00:00:01 | | 9 | INDEX UNIQUE SCAN | CC000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 | | 10 | INDEX RANGE SCAN | CC000000D2U6 | 346K| 8110K| 2 (0)| 00:00:01 | | 11 | VIEW | VW_NSO_1 | 20 | 120 | 124K (2)| 00:10:15 | | 12 | NESTED LOOPS | | 20 | 680 | 124K (2)| 00:10:15 | | 13 | INDEX FULL SCAN | PERSONLASTFIRSTN | 20 | 380 | 124K (2)| 00:10:14 | | 14 | INDEX RANGE SCAN | CC000000D3U3 | 1 | 15 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------ Summary Execution Statistics Over Time Avg Avg Snapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs) Time Execs Per Exec Per Exec Per Exec Per Exec ------------ -------- ------------------- ------------------- ------------------- ------------------- 13-JUN 10:30 1 3,362,907.00 0.00 284.33 358.08 13-JUN 11:00 0 11,881,715.00 0.00 993.54 1,275.00 -------- ------------------- ------------------- ------------------- ------------------- avg 7,622,311.00 0.00 638.94 816.54 sum 1
So only one execution which started sometime between 10:30am and 11:00am, and continued into the 11:00 – 11:30am period, racking up a little over 27 minutes (1633.08 seconds) of run-time.
Please note the plan_hash_value of 1834398219; my next task is to create a SQL*Plus script to provide a reproducible test case using the exact same execution plan ID. Because I’m a command-line kind of guy…
I copied the text of the SQL statement, and prettied it up manually, a process which is time-consuming but also allows me to learn how the query is written intimately. In order to find the data values to plug into the bind-variables used by the SQL statement, I queried the V$SQL_BIND_CAPTURE view as follows…
SQL> select child_number, name, position, datatype_string, value_string 2 from v$sql_bind_capture 3 where sql_id = '&&V_SQL_ID' 4 order by 1, 3; old 3: where sql_id = '&&V_SQL_ID' new 3: where sql_id = '4xmb4k9uhqn90' CHILD_NUMBER Table Name POSITION DATATYPE_STRING VALUE_STRING ------------ ---------- ---------- ------------------------------ ------------------------------ 0 :1 1 NUMBER 5220 :2 2 NUMBER 2 :3 3 NUMBER 1 :4 4 NUMBER 4 :5 5 NUMBER 6 :6 6 NUMBER 8 :7 7 NUMBER 15 :8 8 NUMBER 20 :9 9 NUMBER 32 :10 10 VARCHAR2(32) lena :11 11 NUMBER 10002 :12 12 NUMBER 92 :13 13 NUMBER 124 :14 14 NUMBER 4 :15 15 NUMBER 5221 :16 16 NUMBER 15
So, now I have a SQL*Plus script ready to try to replicate the plan hash value from real-life…
set echo on feedback on timing on linesize 300 tab off set autotrace traceonly explain /* AUTOTRACE TRACEONLY EXPLAIN just parses, doesn't execute the query */ col plan_plus_exp format a295 /* permits output from AUTOTRACE to display wide, and not wrap */ variable b1 number variable b2 number variable b3 number variable b4 number variable b5 number variable b6 number variable b7 number variable b8 number variable b9 number variable b10 varchar2(32) variable b11 number variable b12 number variable b13 number variable b14 number variable b15 number variable b16 number exec :b1 := 5220; /* plugging in the bind-variable values captured from V$SQL_BIND_CAPTURE */ exec :b2 := 2; exec :b3 := 1; exec :b4 := 4; exec :b5 := 6; exec :b6 := 8; exec :b7 := 15; exec :b8 := 20; exec :b9 := 32; exec :b10 := 'lena'; exec :b11 := 10002; exec :b12 := 92; exec :b13 := 124; exec :b14 := 4; exec :b15 := 5221; exec :b16 := 15; spool x10 (...SQL statement itself edited out, because it might be recognizable...) spool off set echo off feedback 6 timing off linesize 130 tab on autotrace off
So now, I can use the SQL*Plus AUTOTRACE TRACEONLY EXPLAIN command to determine if I have created a reproducible test case…
SQL> SELECT * 2 FROM (SELECT qRoot.ID col0, (...SQL statement body edited out, because it might be recognizable...) 38 WHERE rownum <= :b16; Elapsed: 00:00:00.00 Execution Plan ---------------------------------------------------------- Plan hash value: 605492508 <== not the same Plan Hash Value -------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 95 | 44 (5)| 00:00:01 | |* 1 | COUNT STOPKEY | | | | | | | 2 | VIEW | | 1 | 95 | 44 (5)| 00:00:01 | |* 3 | SORT ORDER BY STOPKEY | | 1 | 88 | 44 (5)| 00:00:01 | | 4 | NESTED LOOPS SEMI | | 1 | 88 | 30 (0)| 00:00:01 | | 5 | NESTED LOOPS | | 1 | 64 | 28 (0)| 00:00:01 | |* 6 | TABLE ACCESS BY INDEX ROWID | TABLE_A1 | 1 | 55 | 27 (0)| 00:00:01 | |* 7 | INDEX RANGE SCAN | XX000000D0U3 | 1 | | 26 (0)| 00:00:01 | | 8 | NESTED LOOPS | | | | | | | 9 | NESTED LOOPS | | 2 | 68 | 12 (0)| 00:00:01 | | 10 | TABLE ACCESS BY INDEX ROWID| TABLE_B2 | 7 | 105 | 7 (0)| 00:00:01 | |* 11 | INDEX RANGE SCAN | XX000000D3N2 | 7 | | 4 (0)| 00:00:01 | |* 12 | INDEX UNIQUE SCAN | SYS_C002390384 | 1 | | 1 (0)| 00:00:01 | |* 13 | TABLE ACCESS BY INDEX ROWID | TABLE_B3 | 1 | 19 | 2 (0)| 00:00:01 | |* 14 | INDEX UNIQUE SCAN | XX000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 | |* 15 | INDEX RANGE SCAN | XX000000D2U6 | 508K| 11M| 2 (0)| 00:00:01 | --------------------------------------------------------------------------------------------------------
Sadly, it appeared that my SQL*Plus test script could not reproduce the exact same execution plan. I looked at the CBO statistics on the underlying tables (and their associated indexes and columns) to see if a stats-regather since the most recent executions had changed the input values provided to the Oracle cost-based optimizer (CBO), but that was not the case.
Another area to check are the session-level parameter values used when the SQL statement was parsed, which is stored in the V$SQL_OPTIMIZER_ENV view, and here I hit the jack-pot…
SQL> select name, value, isdefault from v$sql_optimizer_env where sql_id = '4xmb4k9uhqn90' order by name; NAME VALUE ISD ---------------------------------------- ------------------------- --- _always_semi_join nested_loops NO _enable_row_shipping false NO _fast_full_scan_enabled false NO _hash_join_enabled false NO _optimizer_cost_based_transformation off NO _optimizer_sortmerge_join_enabled false NO _pga_max_size 1677720 KB NO active_instance_count 1 YES bitmap_merge_area_size 1048576 YES cell_offload_compaction ADAPTIVE YES cell_offload_plan_display AUTO YES cell_offload_processing true YES cpu_count 12 YES cursor_sharing exact YES db_file_multiblock_read_count 128 YES hash_area_size 131072 YES is_recur_flags 0 YES optimizer_capture_sql_plan_baselines false YES optimizer_dynamic_sampling 2 YES optimizer_features_enable 11.1.0.7 YES optimizer_index_caching 0 YES optimizer_index_cost_adj 100 YES optimizer_mode all_rows YES optimizer_secure_view_merging true YES optimizer_use_invisible_indexes false YES optimizer_use_pending_statistics false YES optimizer_use_sql_plan_baselines false NO parallel_ddl_mode enabled YES parallel_degree 0 YES parallel_dml_mode disabled YES parallel_execution_enabled true YES parallel_query_default_dop 0 YES parallel_query_mode enabled YES parallel_threads_per_cpu 2 YES pga_aggregate_target 8388608 KB YES query_rewrite_enabled true YES query_rewrite_integrity enforced YES result_cache_mode MANUAL YES skip_unusable_indexes true YES sort_area_retained_size 0 YES sort_area_size 65536 YES star_transformation_enabled false YES statistics_level typical YES transaction_isolation_level read_commited YES workarea_size_policy auto YES 45 rows selected.
It turns out that the session running the SQL statement had eight (8) database parameters set by ALTER SESSION SET commands within the session; seven (7) of these parameters are undocumented “underscore” parameters. It turns out that two (2) of these eight changed parameters have the same value as the global database instance (i.e. “_pga_max_size” and “optimizer_use_sql_plan_baselines”), but six (6) of the underscore parameters do not. So, to understand whether these changes were the determining factor, I re-ran my test case with these six parameters changed to the values shown above at the session level…
Plan hash value: 1834398219
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 95 | 804K (1)| 01:06:08 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 95 | 804K (1)| 01:06:08 |
|* 3 | SORT ORDER BY STOPKEY | | 1 | 94 | 804K (1)| 01:06:08 |
| 4 | NESTED LOOPS SEMI | | 1 | 94 | 804K (1)| 01:06:08 |
| 5 | NESTED LOOPS SEMI | | 1 | 88 | 41 (0)| 00:00:01 |
| 6 | NESTED LOOPS | | 1 | 64 | 39 (0)| 00:00:01 |
|* 7 | TABLE ACCESS BY INDEX ROWID| CC_CLAIM | 1 | 55 | 38 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | CC000000D0U3 | 14 | | 26 (0)| 00:00:01 |
|* 9 | INDEX UNIQUE SCAN | CC000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 |
|* 10 | INDEX RANGE SCAN | CC000000D2U6 | 508K| 11M| 2 (0)| 00:00:01 |
|* 11 | VIEW | VW_NSO_1 | 338K| 1983K| 804K (1)| 01:06:07 |
| 12 | NESTED LOOPS | | 338K| 10M| 804K (1)| 01:06:07 |
|* 13 | INDEX FULL SCAN | PERSONLASTFIRSTN | 338K| 6274K| 126K (4)| 00:10:27 |
|* 14 | INDEX RANGE SCAN | CC000000D3U3 | 1 | 15 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
…and that, my friends, is a direct hit. I hadn’t expected the solution to be this easy. Usually, a problem fights tooth-and-nail before yielding a solution.
So now the big question becomes: which is truly the better plan? The Oracle optimizer clearly state that plan hash value = 1834398219 (i.e. the existing execution plan) has an estimated cost of 894k and an estimated elapsed time of 1 hour, 6 minutes, 8 seconds whereas plan hash value = 605492508 (i.e. the execution plan using the default parameter values) has an estimated cost of 44 and an estimated elapsed time of 1 second.
And those estimates seem to be largely correct: when I run the new execution plan with the same data value inputs, it returns no rows but does so in 0.3 seconds. Subsequently changing the session parameters and re-running the same query took at least 5 mins before I killed it.
So the better execution for this query uses the default database level parameter settings, not the session-level settings.
Now the question becomes: where are the session parameters being changed?
For that answer, I ran another query, first finding a list of all the recently-executed SQL statements whose session-level parameters have been changed, and then scanning the Oracle Active Session History (ASH) data for client information…
SQL> select distinct module, action, client_id, sql_id 2 from v$active_session_history 3 where (sql_id,sql_child_number) in 4 (select sql_id, child_number 5 from v$sql_optimizer_env 6 where name = '_fast_full_scan_enabled' and value = 'false'); MODULE ACTION CLIENT_ID SQL_ID --------------- -------------------------------- --------------- ------------- AppCenterXX WebReq:SimpleXYZSearch aconnell 1rbd5kmgy25fz AppCenterXX WebReq:XYZProcess scalderon 6w7zt7rgjrsnv sqlplus@prod01 1wadmdpy49qg9 (...167 rows edited out for brevity...) AppCenterXX WebReq:SimpleXYZSearch jwilson 5239t67sg662v AppCenterXX WebReq:SimpleXYZSearch jlacaille d9qtthk5000sd 171 rows selected.
I highlighted my own SQL*Plus session in red typeface, which goes toward corroborating that it is ALTER SESSION SET commands that are changing the environment. One other thing to note is that all of the sessions connecting as “ClaimCenter” are connecting to the same account on Oracle, so I checked if there were any AFTER LOGON database triggers on that (or other) accounts in the database, and while there were AFTER LOGON triggers, they did not cause these parameters to be re-set.
So, I can only conclude that the application code itself is changing these eight (8) database parameters, and we need to find out why. It is quite possible that they provide some form of AFTER LOGON functionality within their application, permitting administrators to specify parameter changes if desired, but “seeded” with pre-configured values? In that case, we just have to locate the capability to change those and change them. The worst-case scenario would be having these ALTER SESSION SET commands hard-coded right into their application code somewhere; let’s keep fingers crossed that it is not that! 🙂
At any rate, it was fun finding such a clear example of the pitfalls of fiddling with undocumented “underscore” parameters. I’ll update this post if we find anything interesting in the application code.
SQL_ID and/or PLAN_HASH_VALUE are used almost globally across Oracle database to identify query. This is not a case for SPM mechanism using different identifiers – SQL_HANDLE for query and PLAN_NAME for execution plan. SQL_HANDLE contain hexadecimal representation of EXACT_MATCHING_SIGNATURE from V$SQL but we don’t have any information in DBA_SQL_PLAN_BASELINES about SQL_ID and PLAN_HASH_VALUE. After some research I found two ways to find SQL_ID matched to SQL baseline (if there is more please let me know).