This was fun, and I’ve been trying to remember to post about the things that are fun.
I was working with a customer using a purchased application from a relatively over-bearing vendor. You know the type: they have very specific (and somewhat out-of-date) demands for the Oracle database environment on which their application is hosted.
My customer complained that some reports were taking 15-20 minutes to complete, when everyone who knew what the reports were doing were certain that they shouldn’t take more than a few seconds at most. The search parameters to the report were very specific, the report didn’t do a whole lot, only supposed to bring back a few bits n pieces, not the whole kit n kaboodle.
So they extracted the text of the SQL statement from the application and emailed it. Using that text, I searched for the text in the V$SQL view and found the SQL_ID. Now we can have some fun…
First step is get some context of the problem, and that means looking at the recent “history” of executions (as displayed by my “sqlhistory.sql” script) which looked like…
+--------------------------------------------------------------------------------------------------+
|Plan HV Min Snap Max Snap Execs LIO/Exec PIO/Exec CPU/Exec Elapsd/Exec |
+--------------------------------------------------------------------------------------------------+
|1834398219 30594 30595 1 15,244,622 0 1,277.87 1,633.08 |
+--------------------------------------------------------------------------------------------------+
.
========== PHV = 1834398219==========
First seen from "06/13/13 10:30:22" (snap #30594)
Last seen from "06/13/13 11:00:23" (snap #30595)
.
Execs LIO PIO CPU Elapsed
===== === === === =======
1 15,244,622 0 1,277.87 1,633.08
.
Plan hash value: 1834398219
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 124K(100)| |
| 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 95 | 124K (2)| 00:10:15 |
| 3 | SORT ORDER BY STOPKEY | | 1 | 94 | 124K (2)| 00:10:15 |
| 4 | NESTED LOOPS SEMI | | 1 | 94 | 124K (2)| 00:10:15 |
| 5 | NESTED LOOPS SEMI | | 1 | 88 | 35 (0)| 00:00:01 |
| 6 | NESTED LOOPS | | 1 | 64 | 33 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID| CC_CLAIM | 1 | 55 | 32 (0)| 00:00:01 |
| 8 | INDEX RANGE SCAN | CC000000D0U3 | 11 | | 22 (0)| 00:00:01 |
| 9 | INDEX UNIQUE SCAN | CC000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 |
| 10 | INDEX RANGE SCAN | CC000000D2U6 | 346K| 8110K| 2 (0)| 00:00:01 |
| 11 | VIEW | VW_NSO_1 | 20 | 120 | 124K (2)| 00:10:15 |
| 12 | NESTED LOOPS | | 20 | 680 | 124K (2)| 00:10:15 |
| 13 | INDEX FULL SCAN | PERSONLASTFIRSTN | 20 | 380 | 124K (2)| 00:10:14 |
| 14 | INDEX RANGE SCAN | CC000000D3U3 | 1 | 15 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
Summary Execution Statistics Over Time
Avg Avg
Snapshot Avg LIO Avg PIO CPU (secs) Elapsed (secs)
Time Execs Per Exec Per Exec Per Exec Per Exec
------------ -------- ------------------- ------------------- ------------------- -------------------
13-JUN 10:30 1 3,362,907.00 0.00 284.33 358.08
13-JUN 11:00 0 11,881,715.00 0.00 993.54 1,275.00
-------- ------------------- ------------------- ------------------- -------------------
avg 7,622,311.00 0.00 638.94 816.54
sum 1
So only one execution which started sometime between 10:30am and 11:00am, and continued into the 11:00 – 11:30am period, racking up a little over 27 minutes (1633.08 seconds) of run-time.
Please note the plan_hash_value of 1834398219; my next task is to create a SQL*Plus script to provide a reproducible test case using the exact same execution plan ID. Because I’m a command-line kind of guy…
I copied the text of the SQL statement, and prettied it up manually, a process which is time-consuming but also allows me to learn how the query is written intimately. In order to find the data values to plug into the bind-variables used by the SQL statement, I queried the V$SQL_BIND_CAPTURE view as follows…
SQL> select child_number, name, position, datatype_string, value_string
2 from v$sql_bind_capture
3 where sql_id = '&&V_SQL_ID'
4 order by 1, 3;
old 3: where sql_id = '&&V_SQL_ID'
new 3: where sql_id = '4xmb4k9uhqn90'
CHILD_NUMBER Table Name POSITION DATATYPE_STRING VALUE_STRING
------------ ---------- ---------- ------------------------------ ------------------------------
0 :1 1 NUMBER 5220
:2 2 NUMBER 2
:3 3 NUMBER 1
:4 4 NUMBER 4
:5 5 NUMBER 6
:6 6 NUMBER 8
:7 7 NUMBER 15
:8 8 NUMBER 20
:9 9 NUMBER 32
:10 10 VARCHAR2(32) lena
:11 11 NUMBER 10002
:12 12 NUMBER 92
:13 13 NUMBER 124
:14 14 NUMBER 4
:15 15 NUMBER 5221
:16 16 NUMBER 15
So, now I have a SQL*Plus script ready to try to replicate the plan hash value from real-life…
set echo on feedback on timing on linesize 300 tab off
set autotrace traceonly explain /* AUTOTRACE TRACEONLY EXPLAIN just parses, doesn't execute the query */
col plan_plus_exp format a295 /* permits output from AUTOTRACE to display wide, and not wrap */
variable b1 number
variable b2 number
variable b3 number
variable b4 number
variable b5 number
variable b6 number
variable b7 number
variable b8 number
variable b9 number
variable b10 varchar2(32)
variable b11 number
variable b12 number
variable b13 number
variable b14 number
variable b15 number
variable b16 number
exec :b1 := 5220; /* plugging in the bind-variable values captured from V$SQL_BIND_CAPTURE */
exec :b2 := 2;
exec :b3 := 1;
exec :b4 := 4;
exec :b5 := 6;
exec :b6 := 8;
exec :b7 := 15;
exec :b8 := 20;
exec :b9 := 32;
exec :b10 := 'lena';
exec :b11 := 10002;
exec :b12 := 92;
exec :b13 := 124;
exec :b14 := 4;
exec :b15 := 5221;
exec :b16 := 15;
spool x10
(...SQL statement itself edited out, because it might be recognizable...)
spool off
set echo off feedback 6 timing off linesize 130 tab on autotrace off
So now, I can use the SQL*Plus AUTOTRACE TRACEONLY EXPLAIN command to determine if I have created a reproducible test case…
SQL> SELECT *
2 FROM (SELECT qRoot.ID col0,
(...SQL statement body edited out, because it might be recognizable...)
38 WHERE rownum <= :b16;
Elapsed: 00:00:00.00
Execution Plan
----------------------------------------------------------
Plan hash value: 605492508 <== not the same Plan Hash Value
--------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 95 | 44 (5)| 00:00:01 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 95 | 44 (5)| 00:00:01 |
|* 3 | SORT ORDER BY STOPKEY | | 1 | 88 | 44 (5)| 00:00:01 |
| 4 | NESTED LOOPS SEMI | | 1 | 88 | 30 (0)| 00:00:01 |
| 5 | NESTED LOOPS | | 1 | 64 | 28 (0)| 00:00:01 |
|* 6 | TABLE ACCESS BY INDEX ROWID | TABLE_A1 | 1 | 55 | 27 (0)| 00:00:01 |
|* 7 | INDEX RANGE SCAN | XX000000D0U3 | 1 | | 26 (0)| 00:00:01 |
| 8 | NESTED LOOPS | | | | | |
| 9 | NESTED LOOPS | | 2 | 68 | 12 (0)| 00:00:01 |
| 10 | TABLE ACCESS BY INDEX ROWID| TABLE_B2 | 7 | 105 | 7 (0)| 00:00:01 |
|* 11 | INDEX RANGE SCAN | XX000000D3N2 | 7 | | 4 (0)| 00:00:01 |
|* 12 | INDEX UNIQUE SCAN | SYS_C002390384 | 1 | | 1 (0)| 00:00:01 |
|* 13 | TABLE ACCESS BY INDEX ROWID | TABLE_B3 | 1 | 19 | 2 (0)| 00:00:01 |
|* 14 | INDEX UNIQUE SCAN | XX000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 |
|* 15 | INDEX RANGE SCAN | XX000000D2U6 | 508K| 11M| 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------------
Sadly, it appeared that my SQL*Plus test script could not reproduce the exact same execution plan. I looked at the CBO statistics on the underlying tables (and their associated indexes and columns) to see if a stats-regather since the most recent executions had changed the input values provided to the Oracle cost-based optimizer (CBO), but that was not the case.
Another area to check are the session-level parameter values used when the SQL statement was parsed, which is stored in the V$SQL_OPTIMIZER_ENV view, and here I hit the jack-pot…
SQL> select name, value, isdefault from v$sql_optimizer_env where sql_id = '4xmb4k9uhqn90' order by name;
NAME VALUE ISD
---------------------------------------- ------------------------- ---
_always_semi_join nested_loops NO
_enable_row_shipping false NO
_fast_full_scan_enabled false NO
_hash_join_enabled false NO
_optimizer_cost_based_transformation off NO
_optimizer_sortmerge_join_enabled false NO
_pga_max_size 1677720 KB NO
active_instance_count 1 YES
bitmap_merge_area_size 1048576 YES
cell_offload_compaction ADAPTIVE YES
cell_offload_plan_display AUTO YES
cell_offload_processing true YES
cpu_count 12 YES
cursor_sharing exact YES
db_file_multiblock_read_count 128 YES
hash_area_size 131072 YES
is_recur_flags 0 YES
optimizer_capture_sql_plan_baselines false YES
optimizer_dynamic_sampling 2 YES
optimizer_features_enable 11.1.0.7 YES
optimizer_index_caching 0 YES
optimizer_index_cost_adj 100 YES
optimizer_mode all_rows YES
optimizer_secure_view_merging true YES
optimizer_use_invisible_indexes false YES
optimizer_use_pending_statistics false YES
optimizer_use_sql_plan_baselines false NO
parallel_ddl_mode enabled YES
parallel_degree 0 YES
parallel_dml_mode disabled YES
parallel_execution_enabled true YES
parallel_query_default_dop 0 YES
parallel_query_mode enabled YES
parallel_threads_per_cpu 2 YES
pga_aggregate_target 8388608 KB YES
query_rewrite_enabled true YES
query_rewrite_integrity enforced YES
result_cache_mode MANUAL YES
skip_unusable_indexes true YES
sort_area_retained_size 0 YES
sort_area_size 65536 YES
star_transformation_enabled false YES
statistics_level typical YES
transaction_isolation_level read_commited YES
workarea_size_policy auto YES
45 rows selected.
It turns out that the session running the SQL statement had eight (8) database parameters set by ALTER SESSION SET commands within the session; seven (7) of these parameters are undocumented “underscore” parameters. It turns out that two (2) of these eight changed parameters have the same value as the global database instance (i.e. “_pga_max_size” and “optimizer_use_sql_plan_baselines”), but six (6) of the underscore parameters do not. So, to understand whether these changes were the determining factor, I re-ran my test case with these six parameters changed to the values shown above at the session level…
Plan hash value: 1834398219
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 95 | 804K (1)| 01:06:08 |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | VIEW | | 1 | 95 | 804K (1)| 01:06:08 |
|* 3 | SORT ORDER BY STOPKEY | | 1 | 94 | 804K (1)| 01:06:08 |
| 4 | NESTED LOOPS SEMI | | 1 | 94 | 804K (1)| 01:06:08 |
| 5 | NESTED LOOPS SEMI | | 1 | 88 | 41 (0)| 00:00:01 |
| 6 | NESTED LOOPS | | 1 | 64 | 39 (0)| 00:00:01 |
|* 7 | TABLE ACCESS BY INDEX ROWID| CC_CLAIM | 1 | 55 | 38 (0)| 00:00:01 |
|* 8 | INDEX RANGE SCAN | CC000000D0U3 | 14 | | 26 (0)| 00:00:01 |
|* 9 | INDEX UNIQUE SCAN | CC000000DKU3 | 1 | 9 | 1 (0)| 00:00:01 |
|* 10 | INDEX RANGE SCAN | CC000000D2U6 | 508K| 11M| 2 (0)| 00:00:01 |
|* 11 | VIEW | VW_NSO_1 | 338K| 1983K| 804K (1)| 01:06:07 |
| 12 | NESTED LOOPS | | 338K| 10M| 804K (1)| 01:06:07 |
|* 13 | INDEX FULL SCAN | PERSONLASTFIRSTN | 338K| 6274K| 126K (4)| 00:10:27 |
|* 14 | INDEX RANGE SCAN | CC000000D3U3 | 1 | 15 | 2 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
…and that, my friends, is a direct hit. I hadn’t expected the solution to be this easy. Usually, a problem fights tooth-and-nail before yielding a solution.
So now the big question becomes: which is truly the better plan? The Oracle optimizer clearly state that plan hash value = 1834398219 (i.e. the existing execution plan) has an estimated cost of 894k and an estimated elapsed time of 1 hour, 6 minutes, 8 seconds whereas plan hash value = 605492508 (i.e. the execution plan using the default parameter values) has an estimated cost of 44 and an estimated elapsed time of 1 second.
And those estimates seem to be largely correct: when I run the new execution plan with the same data value inputs, it returns no rows but does so in 0.3 seconds. Subsequently changing the session parameters and re-running the same query took at least 5 mins before I killed it.
So the better execution for this query uses the default database level parameter settings, not the session-level settings.
Now the question becomes: where are the session parameters being changed?
For that answer, I ran another query, first finding a list of all the recently-executed SQL statements whose session-level parameters have been changed, and then scanning the Oracle Active Session History (ASH) data for client information…
SQL> select distinct module, action, client_id, sql_id
2 from v$active_session_history
3 where (sql_id,sql_child_number) in
4 (select sql_id, child_number
5 from v$sql_optimizer_env
6 where name = '_fast_full_scan_enabled' and value = 'false');
MODULE ACTION CLIENT_ID SQL_ID
--------------- -------------------------------- --------------- -------------
AppCenterXX WebReq:SimpleXYZSearch aconnell 1rbd5kmgy25fz
AppCenterXX WebReq:XYZProcess scalderon 6w7zt7rgjrsnv
sqlplus@prod01 1wadmdpy49qg9
(...167 rows edited out for brevity...)
AppCenterXX WebReq:SimpleXYZSearch jwilson 5239t67sg662v
AppCenterXX WebReq:SimpleXYZSearch jlacaille d9qtthk5000sd
171 rows selected.
I highlighted my own SQL*Plus session in red typeface, which goes toward corroborating that it is ALTER SESSION SET commands that are changing the environment. One other thing to note is that all of the sessions connecting as “ClaimCenter” are connecting to the same account on Oracle, so I checked if there were any AFTER LOGON database triggers on that (or other) accounts in the database, and while there were AFTER LOGON triggers, they did not cause these parameters to be re-set.
So, I can only conclude that the application code itself is changing these eight (8) database parameters, and we need to find out why. It is quite possible that they provide some form of AFTER LOGON functionality within their application, permitting administrators to specify parameter changes if desired, but “seeded” with pre-configured values? In that case, we just have to locate the capability to change those and change them. The worst-case scenario would be having these ALTER SESSION SET commands hard-coded right into their application code somewhere; let’s keep fingers crossed that it is not that! 🙂
At any rate, it was fun finding such a clear example of the pitfalls of fiddling with undocumented “underscore” parameters. I’ll update this post if we find anything interesting in the application code.