AWR Warehouse and Data Guard (Part Four)

After a brief interlude I wanted to follow up on something I noticed on John Hallas’s blog.

He mentioned that specifying a shared location for the extract metadata did not work, and was waiting on a fix for bug number 21826657. I checked that this bug should be fixed in db plugins version 13.2.2 and that my test hosts dga and dgb are both running this version.

To test whether the fix was effective, I first updated the extract properties table to specify a location  shared between both hosts, /home/oracle/awrw, generated a couple of snapshots (current primary is DGA) and then ran an extract process.

SQL> update dbsnmp.caw_extract_properties
  2     set property_value = '/home/oracle/awrw'
  3     where property_name = 'dump_dir_1';

1 row updated.

SQL> commit;

Commit complete.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> exec dbsnmp.mgmt_caw_extract.run_job_now;

PL/SQL procedure successfully completed.

I double-check the files are extracted to /home/oracle/awrw as expected. You can see from the filename that this corresponds to snapshots with ids between 42 and 44.

[oracle@dgb awrw]$ ls -alrt /home/oracle/awrw
total 12796
drwx------. 20 oracle oinstall     4096 Apr  8 12:21 ..
drwxr-xr-x.  1 oracle oinstall      136 Apr  9  2017 .
-rw-r-----.  1 oracle oinstall 13066240 Apr  9  2017 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp
-rw-r--r--.  1 oracle oinstall    31656 Apr  9  2017 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.log
[oracle@dgb awrw]$

I then switch-over to DGB and then performed “Upload Snapshots Now”.
Reviewing the steps for the Transfer Task looks promising, below are the outputs of a couple of the transfer subtask; first of all createArchive:

conn_url: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=dgb.localdomain)(PORT=1521)))(CONNECT_DATA=(SID=DG))), user: system, role: normal
Connected to database
Number of files ready for transfer: 1
Maximum total size of dump files: 5368709120
file: 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp size(bytes): 13066240
Number of files picked up for transfer: 1
Updated status of dump files, ready to transfer
Creating archive file using [/app/oracle/agent/agent_13.2.0.0.0/bin/zip -v 4CA58C2B03FF5F03E0530201A8C0052E.zip 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp]
  adding: 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp .	(in=13066240) (out=1116898) (deflated 91%)
total bytes=13066240, compressed=1116898 -> 91% savings
Created archive 4CA58C2B03FF5F03E0530201A8C0052E.zip
Disconnected from database

Next prepareDestinationFiles:

Extracting dump files from /app/oracle/agent/agent_inst/awr_t/4CA58C2B03FF5F03E0530201A8C0052E.zip
Extracted 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp
extract complete
Deleted archive 4CA58C2B03FF5F03E0530201A8C0052E.zip
conn_url: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.2)(PORT=1521)))(CONNECT_DATA=(SID=OMR))), user: sys, role: sysdba
Connected to database
dumpfile: 1_4CB89422459F0DF3E0530401A8C065D5_1852644603_42_44.dmp, emid: 1, dbid: 1852644603, bid: 42, eid: 44, size: 13066240
source target: DGB, type: oracle_database
Added dump files to the import queue
Disconnected from database

The output of the loadAWR task reveals that 3 snapshots have been successfully loaded to AWRW:

conn_url: (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.1.2)(PORT=1521)))(CONNECT_DATA=(SID=OMR))), user: sys, role: sysdba
Connected to database
Loaded 3 snapshots in the AWR Warehouse.
Disconnected from database

Reviewing load metadata you can see that the snapshots that were extracted from DGA have been successfully loaded from DGB.

SQL> select dump_id, target_name, begin_snap_id, end_snap_id, status from dbsnmp.caw_load_metadata order by dump_id;

 SQL> select begin_snap_id, end_snap_id, target_name, status from dbsnmp.caw_load_metadata order by dump_id;

	   12	       38 DGA		      3
	   39	       41 DGB		      3
	   42	       44 DGB		      3
SQL>

So it seems that it is now OK to update the location that the dumpfiles are extracted to. If this is a shared location accessible by both hosts in dataguard configuration, then as we have proved, any data extracted from one host, can subsequently be transferred and loaded from the another host without requiring manual intervention.
This may be of use with AWRW in a RAC configuration (something I haven’t yet investigated).

AWR Warehouse and Data Guard (Interlude)

Note, in preparing next entry, I reviewed the results from previous entries and found that although extracts were being transferred successfully they had all failed to load; I should have noticed they had status 4 (IMPORT_FAILED), in the load metadata table.

Reviewing dbms_scheduler_job_run_details of the load task revealed the following error:

"move error: 
ORA-20105: Unable to move AWR data to SYS
1852644603 1_4CB614B101EE2E5EE0530401A8C066F0_1852644603_12_37.dmp 1 DGA oracle_database 1852644603"

I removed both databases from ARWR, and playing around with the internal functions used to register and un-register the databases revealed that the partitions had got in a mess.

 
SQL> exec dbms_swrf_internal.unregister_database(1852644603);
SQL> exec dbms_swrf_internal.register_database(1852644603);
BEGIN dbms_swrf_internal.register_database(1852644603); END;

*
ERROR at line 1:
ORA-13509: error encountered during updates to a AWR table
ORA-14012: resulting partition name conflicts with that of an existing partition
ORA-06512: at "SYS.DBMS_SWRF_INTERNAL", line 105
ORA-06512: at "SYS.DBMS_SWRF_INTERNAL", line 184
ORA-06512: at line 1

To resolve this, I unregistered the database again using the internal function, then identified the ‘orphan’ partitions causing problems using the following SQL (1852644603 is the DBID of the problem database):

 
SELECT * FROM DBA_TAB_PARTITIONS WHERE PARTITION_NAME LIKE '%1852644603%'

Partitions for the unregistered database still existed for the following tables:

  • WRH$_CON_SYSMETRIC_HISTORY
  • WRH$_CON_SYSSTAT
  • WRH$_CON_SYSTEM_EVENT
  • WRH$_CON_SYS_TIME_MODEL
  • WRM$_PDB_IN_SNAP

I then dropped these problem partitions (this is just a play database, that is why I am able to be so reckless modifying internal Oracle tables, do you own research and testing before making such changes, and make sure you understand the risks.)

After I did this, I was able re-add DGA to ARWR and upload some snapshots successfully; this time I actually checked the records existed in AWRW using the following query:

SQL> SELECT DISTINCT DBID FROM DBA_HIST_SNAPSHOT;

1585153544
1852644603
SQL> 

I performed the same steps as before (backup extract metadata table, add DGB to AWRW, switchover to DGB, change the mapping record for DGA, restore the extract record, Upload Snapshots Now, then restore the mapping record for DGA). Snapshots from DGB also uploaded successfully with the original DBID>

It distinctly possible that I messed up the partitioning of the AWR tables by my hacks however there are a few things that make me think this is not the case. The original upload from DGA failed (this was before I had made any non-standard changes). Also repeating my steps after manually fixing the partitions, everything worked fine. It is also worth noting that the tables causing the problem are all new in 12.2 (DGA and DGB are fresh installed 12.2 instances, the AWR Warehouse repository is 12.2 upgraded from 12.1). We also had some similar issues on our production AWR Warehouse after we upgraded it from 12.1 to 12.2 (a different error, but also related to AWRW partitions). So I suspect some bug where AWRW is not properly handling partitions on new 12.2 AWR tables.

Messing around with the register/un-register functions also reminds me of another thing to be aware of regards my ‘hack’ on the CAW_DBID_MAPPING table. If you subsequently remove the either database from AWRW (via the GUI) it will also remove all records from AWR history from both databases. This may or may not be what you want, you should be aware of it.

TNS-12599: TNS:cryptographic checksum mismatch

Checking the alert log of some of one of the instances we have recently upgraded to 12.2 I noticed the following error message occurring frequently:

NI cryptographic checksum mismatch error: 12599.

VERSION INFORMATION:
TNS for Linux: Version 12.2.0.1.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 12.2.0.1.0 - Production
Time: 06-APR-2017 11:38:05
Tracing to file:
Tns error struct:
ns main err code: 12599

TNS-12599: TNS:cryptographic checksum mismatch
ns secondary err code: 12656
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0

Comparing with entries in listener.log file, I identified that these corresponded to connections from Enterprise Manager. Reproducing the issue was easy, I navigated to the database in Enterprise Manager, went through the menus “Oracle Database->Target Setup->Monitoring Configuration” and clicked on the “Test Connection button”.

Every time the test succeeded, but the message was appended to the alert log.

MOS note 1927120.1 goes into some detail about the error message, including the fact that is is safe to ignore. It should be resolved when Enterprise Manager is updated to include latest JDBC driver.

One option listed to prevent the issue appearing in the alert log is to set DIAG_ADR_ENABLED=OFF in sqlnet.ora (it will be then instead be logged to sqlnet.log).

I found that setting the following two parameters in sqlnet.ora also meant the warning did not occur (setting either parameter by itself was not sufficient).

  • SQLNET.CRYPTO_CHECKSUM_SERVER=rejected
  • SQLNET.ENCRYPTION_SERVER=rejected

AWR Warehouse and Data Guard (Part Three)

This entry continues with the setup configured in parts one and two. I stated that ideally you should perform “Upload Snapshots Now” on the primary database before performing switch-over to prevent remove the possibility of missing snapshots in AWRW. We will see what happens if you don’t, either because you forgot, or it was not possible due to fail-over, and review some options to resolve the situation.

Do your own due diligence before updating any internal AWR data, be prepared to accept the consequences if you break things.

Primary database is currently DGA, I generate a couple of snapshots and perform a manual extract (note during the normal course of operations, this extract runs independently of the transfer and load components, so it is possible for there to be snapshots which have been extracted, but not transferred to AWRW host). The extract metadata table shows one extraction with status 2, the data has been extracted but has not yet been transferred to AWRW.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> exec dbsnmp.mgmt_caw_extract.run_extract;

PL/SQL procedure successfully completed.

SQL> select extract_id, begin_snap_id, end_snap_id, status from dbsnmp.caw_extract_metadata order by extract_id;

EXTRACT_ID BEGIN_SNAP_ID END_SNAP_ID      STATUS
---------- ------------- ----------- ----------
      1            1       10           5
     22           11       13           5
     41           14       15           5
     61           16       18           5
     81           19       24           2

I perform switch-over so that DGB is primary. Because the extract metadata references a file on host dga, the transfer task on dga will fail (as the database is in standby mode), and the one on dgb will also fail because the extract file is actually on dga. As no additional extracts have taken place, this ‘orphan’ extract record can be deleted (backup the table first if you are playing it safe). The next extract on DGB will re-extract the missing snapshots.

SQL> delete from dbsnmp.caw_extract_metadata where extract_id = 81;

1 row deleted.

SQL> commit;

Commit complete.

Sure enough, when I run “Upload Snapshots Now” against DGB, the job succeeds, and reviewing the load metadata table reveals that the snapshots initially extracted from DGA have been re-extracted successfully from DGB (along with some other snapshots that have since taken place).

SQL> select dump_id, target_name, begin_snap_id, end_snap_id, status from dbsnmp.caw_load_metadata order by dump_id;

   DUMP_ID TARGET_NAM BEGIN_SNAP_ID END_SNAP_ID     STATUS
---------- ---------- ------------- ----------- ----------
	 4 DGA			  1	     10 	 4
	 5 DGB			 11	     13 	 4
	 6 DGB			 14	     15 	 4
	 7 DGA			 16	     18 	 4
	 8 DGB			 19	     28 	 4

However if you don’t do this in time,  subsequent snapshots may get extracted, transferred and loaded successfully creating a gap in the snapshots held in AWRW. To reproduce this situation, I create some more snapshots and kick of the extract process.
Notice this time that the run_extract doesn’t actually do anything (in the output we can see that as it has run recently so it doesn’t actually do anything). This is part of the design of the ETL process to have minimal impact on the source database. I override this behavior by calling the procedure run_job_now.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> exec dbms_workload_repository.create_snapshot;

PL/SQL procedure successfully completed.

SQL> set serveroutput on size unlimited
SQL> exec dbsnmp.mgmt_caw_extract.run_extract;
Starting CAW Extract...
Fetched CAW properties.
Retrying most recent failure...
No recent failure to retry.
Preparing for new Extract...
Upload interval is 24 hr.
Time elapsed since last successful extract: 0 hr.
Not enough time elapsed. Quitting.
No non-recoverable extracts to retry.

SQL> exec dbsnmp.mgmt_caw_extract.run_job_now;
run_job_now completed.

PL/SQL procedure successfully completed.

SQL> select extract_id, begin_snap_id, end_snap_id, status from dbsnmp.caw_extract_metadata order by extract_id;

EXTRACT_ID BEGIN_SNAP_ID END_SNAP_ID	 STATUS
---------- ------------- ----------- ----------
	 1	       1	  10	      5
	22	      11	  13	      5
	41	      14	  15	      5
	61	      16	  18	      5
       101	      19	  28	      5
       121	      29	  30	      2

6 rows selected.

I then switch-back (primary reverts to DGA), create a couple more snapshots and perform “Upload Snapshots Now”. The process completes successfully, however checking the load metadata reveals that some snapshots have indeed been missed.

SQL> select target_name, dbid, begin_snap_id, end_snap_id from dbsnmp.caw_load_metadata order by dump_id;

TARGET_NAM	 DBID BEGIN_SNAP_ID END_SNAP_ID
---------- ---------- ------------- -----------
DGA	   1852644603		  1	     10
DGB	   1852644603		 11	     13
DGB	   1852644603		 14	     15
DGA	   1852644603		 16	     18
DGB	   1852644603		 19	     28
DGA	   1852644603		 31	     32

6 rows selected.

Checking the extract metadata in reveals the status of this extract is 5 (transferred successfully). (I don’t know why; the load metadata showed that it has not been received into AWRW).

SQL> select extract_id, BEGIN_SNAP_ID, END_SNAP_ID, STATUS from dbsnmp.caw_extract_metadata order by extract_id;

EXTRACT_ID BEGIN_SNAP_ID END_SNAP_ID	 STATUS
---------- ------------- ----------- ----------
	 1	       1	  10	      5
	22	      11	  13	      5
	41	      14	  15	      5
	61	      16	  18	      5
       101	      19	  28	      5
       121	      29	  30	      5
       141	      31	  32	      5

7 rows selected.

To resolve this situation, I update the extract metadata record to error status, to force these snapshots to be are re-extracted.

SQL> update dbsnmp.caw_extract_metadata set status = 3, e_retry_count = 0 where extract_id = 121;

1 row updated.

SQL> commit;

Commit complete.

After running Upload Snapshots Now, reviewing the extract metadata reveals that the missing snapshots have been re-extracted (the record has been updated to status 5).

SQL> select extract_id, BEGIN_SNAP_ID, END_SNAP_ID, STATUS from dbsnmp.caw_extract_metadata order by extract_id;

EXTRACT_ID BEGIN_SNAP_ID END_SNAP_ID	 STATUS
---------- ------------- ----------- ----------
	 1	       1	  10	      5
	22	      11	  13	      5
	41	      14	  15	      5
	61	      16	  18	      5
       101	      19	  28	      5
       121	      29	  30	      5
       141	      31	  32	      5

7 rows selected.

Reviewing load metadata reveals that the missing snapshots have finally been loaded successfully into AWRW, the gap has been eliminated.

SQL> select target_name, dbid, begin_snap_id, end_snap_id from dbsnmp.caw_load_metadata order by dump_id;

TARGET_NAM	 DBID BEGIN_SNAP_ID END_SNAP_ID
---------- ---------- ------------- -----------
DGA	   1852644603		  1	     10
DGB	   1852644603		 11	     13
DGB	   1852644603		 14	     15
DGA	   1852644603		 16	     18
DGB	   1852644603		 19	     28
DGA	   1852644603		 31	     32
DGA	   1852644603		 29	     30

7 rows selected.

In summary even though out of the box, AWR Warehouse does not support data-guard configuration, with a little bit of hacking it is possible to get it to work reasonably well.

Here’s hoping that in a later release these issues will have been worked out, and it will work out of the box.

Reasons to love AWR Warehouse

I realize that the majority of blog posts I have made are for problems I have encountered. I think is only natural as most of my time is spent investigating the few occasions where things don’t work as expected rather than the 99.99% of the time when things just work. Even when there are problems the code is normally very well instrumented, and there is such a wealth of great technical information available, troubleshooting issues can be almost enjoyable.

With a view to re-balance this state of affairs somewhat, I thought I would put down a few thoughts on AWR Warehouse. I am continually surprised that so few people seem to be using it, so I will try to do my part address this, and hopefully persuade you to give it a shot. If you are not using it, feel free to share via a comment why not; if you are using it also feel free to share any thoughts, good or bad about the product.

Licensing

One of the main reasons to consider using AWRW can be found in the Database Licensing Information User Manual. I quote the relevant section:

A separate single instance Oracle Database can be installed and used as an infrastructure repository for RMAN, Oracle Enterprise Manager Cloud Control, Automatic Workload Repository (AWR) Warehouse, Global Data Services Catalog, and Grid Infrastructure Management Repository without additional license requirements, provided that all the targets are correctly licensed. It may not be used or deployed for other uses.

The infrastructure repositories for RMAN, Oracle Enterprise Manager Cloud Control, AWR Warehouse, and Global Data Services Catalog can be shared in one database, or deployed in separate databases. Enterprise Edition must be used for the infrastructure repository database(s).

If Oracle RAC is used for the infrastructure database, then Oracle RAC licenses for all servers, and Enterprise Edition licenses for all but the first server, are required.

If Data Guard is used for the infrastructure database, then an Enterprise Edition license for the standby server is required.

So paraphrasing (note I am neither a lawyer nor a licensing expert), as long as the source databases are correctly licensed,  you don’t use RAC or DataGuard, and you don’t use it for anything else, you don’t need to worry to worry licensing on AWR Warehouse Repository.  This can (in fact must be) Enterprise Edition as it makes extensive use of the partitioning option.

There are not many occasions when Oracle corporation gives stuff out for free, when they do I suggest to take advantage.

The licensing advantages become even more attractive when you realize that AWRW actually saves you licensing on source databases by allowing you to offload AWR data storage and analysis from the source database.

Note, don’t even think about using an existing database for this purpose, especially not the OEM Repository database; trust me, on this, you’ll thank me later.

Tight Integration into Enterprise Manager

Database administrators  that have not embraced Enterprise Manager (OEM) for managing their databases, may consider this disadvantage, however I believe that all companies that manage more than a handful of databases, should be fully utilizing the power of OEM; its benefits far outweigh any overhead in administration.

AWRW initial configuration and the addition of database sources is basically a case of making sure they are discovered within OEM, defining preferred credentials, and then clicking a few buttons.

Once databases are contributing to AWRW, making use of the historical data is again just a case of clicking a few buttons.

AWR reports can be easily run for data in AWRW rather than just local AWR data, comparisons between different databases from at the same, or different, periods of time can be made.

ASH analytics is another area where the power of AWRW and OEM combine to provide some powerful tools for in-depth analysis.

Database Versions

The AWRW documentation states that:

(The AWR Warehouse Repository) must be an equal or higher database version of the source databases it accommodates.

You might consider this a disadvantage, however I see it differently. It is an opportunity to try out upgrading and using new database versions with non-trivial (actually substantial) volumes of data before rolling out to databases where developers or users complain in case things go wrong.

We recently upgraded our AWRW Repository to 12.2.  We had a few issues (as expected), but we could take our time resolving them, because we (the DBAs) were the only ones who noticed any outage.  After this upgrade completed, we then felt brave enough to upgrade our OMR Repository database which went through much more smoothly.

Of course perform due diligence, practice on a test system, and make sure you have backups  which will allow you to roll back in case things go wrong.

Summary

It’s license free, extends the power of Enterprise Manager, and gives you low-risk exposure to new database versions.

Why aren’t you using it?  Seriously;  let me know via comments.

AWR Warehouse and Data Guard (Part Two)

We will continue with setup configured for the previous blog entry.
This post includes details about updating some internal AWRW tables. These steps are  provided for education purposes only; if you do this on a real system make sure you know exactly what you are doing and understand the risks. If you break anything, Oracle Support or I may not be able or willing to help you.

First I delete the two instances DGA and DGB previously configured from AWR Warehouse with a view to re-configuring them to resolve some of the problems we encountered last time.

I re-add DGA and perform “Upload Snapshots Now” and make sure an entry is created in the DBID mapping table.

SQL> SELECT target_name, old_dbid, new_dbid FROM DBSNMP.CAW_DBID_MAPPING order by mapping_id;

TARGET_NAME   OLD_DBID	 NEW_DBID
----------- ---------- ----------
cawr	    1585153544 1585153544
DGA	    1852644603 1852644603

I switch-over to instance DGB. Before re-adding DGB to AWR Warehouse, I backup the extract metadata table.

SQL> create table dbsnmp.caw_extract_metadata_backup as select * from dbsnmp.caw_extract_metadata;

Table created.

I then re-add DGB to AWRW, but before uploading snapshots I update the mapping for DGA temprarily to set new_dbid column to a different value 1.

SQL> update DBSNMP.CAW_DBID_MAPPING set new_dbid=1 where target_name='DGA';
1 row updated.
SQL> COMMIT;
Commit complete.

I then verify that caw_extract_metadata is empty (it has been  recreated when adding DGB to AWRW) and restore the record from my backup table. This should fool AWR extract process into realizing which snapshots have already been uploaded from DGA. I also update the relevant sequence so there will not get an error when creating  a new extract metadata record.  I create a couple of snapshots for testing.

SQL> select count(*) from dbsnmp.caw_extract_metadata;
  COUNT(*)
----------
	 0
SQL> insert into dbsnmp.caw_extract_metadata select * from dbsnmp.caw_extract_metadata_backup;
1 row created.
SQL> commit;
Commit complete.
SQL> select dbsnmp.caw_extract_id_seq.nextval from dual;
21
SQL> exec dbms_workload_repository.create_snapshot;
PL/SQL procedure successfully completed.
SQL> exec dbms_workload_repository.create_snapshot;
PL/SQL procedure successfully completed.

Finally I perform “Upload Snapshots Now” on DGB.  The process completes successfully.  Let’s check the contents of the dbid mapping table:

SQL> select mapping_id, target_name, old_dbid, new_dbid from DBSNMP.CAW_DBID_MAPPING;
MAPPING_ID TARGET_NAME	 OLD_DBID   NEW_DBID
---------- ----------- ---------- ----------
	 1 cawr        1585153544 1585153544
	 4 DGA	       1852644603	   1
	 5 DGB	       1852644603 1852644603

Because we had temporarily re-mapped DGA, OEM has not had to remap the dbid for DGB.  I now revert the  change we made to DGA mapping and review the mappings:

SQL> update DBSNMP.CAW_DBID_MAPPING set new_dbid = 1852644603 where target_name = 'DGA';
1 row updated.
SQL> commit;
Commit complete.
SQL> select mapping_id, target_name, old_dbid, new_dbid from DBSNMP.CAW_DBID_MAPPING;
MAPPING_ID TARGET_NAME	 OLD_DBID   NEW_DBID
---------- ----------- ---------- ----------
	 1 cawr        1585153544 1585153544
	 4 DGA	       1852644603 1852644603
	 5 DGB	       1852644603 1852644603

To double check everything is OK after this change, I create some new snapshots, and perform “Upload Snapshots Now” again on DGB.  I then switch back to DGA, create some more snapshots, and again perform “Upload Snapshots Now”.
Both tasks complete successfully, and I can check everything is as hoped from the load metadata table:

SQL> select target_name, dbid, begin_snap_id, end_snap_id from dbsnmp.caw_load_metadata order by dump_id;

TARGET_NAME	  DBID BEGIN_SNAP_ID END_SNAP_ID
----------- ---------- ------------- -----------
DGA	    1852644603		   1	      10
DGB	    1852644603		  11	      13
DGB	    1852644603		  14	      15
DGA	    1852644603		  16	      18

One thing to note with this ‘hack’ is that is best to perform “Upload Snapshots Now” on the active database before a switch-over is performed to make sure there are no extracts ‘left behind’ when the database become standby. Obviously this may not be possible in fail-over situations. I will try to cover work-arounds for this in a follow-up blog post.

Do not set job_queue_processes at PDB level in 12.1.0.2

MOS Doc ID 2175443.1 goes into more detail, including the following quote:

It is possible to set job_queue_processes at pdb level on 12.1.0.2 multitenant database, but it does not work correctly.

I have managed to reproduce the actual behavior. Of particular concern is that this allows a local user on one PDB to disrupt things on other PDBs.

My test system is running 12.1.0.2, multi-tenant with two PDBs, PDB1 and PDB2. Each PDB has a local user with DBA privileges, dba1, and dba2 respectively.

[oracle@awrw admin]$ sqlplus dba1/oracle@pdb1

SQL*Plus: Release 12.1.0.2.0 Production on Sun Apr 2 13:34:41 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 02 2017 13:33:46 +08:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> show parameter job_queue_processes

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes		     integer	 1000
SQL> alter system set job_queue_processes=0 container=current;

System altered.

SQL> show parameter job_queue_processes

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes		     integer	 0
SQL>

So dba1 seems to have changed job_queue_processes, in their own PDB, pdb1.

The alert log reveals some cause for concern:

CJQ0 started with pid=34, OS id=31546
Stopping background process CJQ0
Sun Apr 02 13:35:01 2017
ALTER SYSTEM SET job_queue_processes=0 SCOPE=BOTH;

Sure enough, in PDB2, even though job_queue_processes parameter seems to be set to it’s default value, we are unable to use scheduler:

[oracle@awrw admin]$ sqlplus dba2/oracle@pdb2

SQL*Plus: Release 12.1.0.2.0 Production on Sun Apr 2 13:44:19 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 02 2017 13:43:25 +08:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> show parameter job_queue_processes

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
job_queue_processes		     integer	 1000
SQL> exec sys.dbms_scheduler.create_job( 'DUMMY_JOB', job_type=>'PLSQL_BLOCK', job_action=>'BEGIN NULL; END;');

PL/SQL procedure successfully completed.

SQL> exec sys.dbms_scheduler.run_job( 'DUMMY_JOB', false);
BEGIN sys.dbms_scheduler.run_job( 'DUMMY_JOB', false); END;

*
ERROR at line 1:
ORA-27492: unable to run job "DBA2"."DUMMY_JOB": scheduler unavailable
ORA-06512: at "SYS.DBMS_ISCHED", line 209
ORA-06512: at "SYS.DBMS_SCHEDULER", line 594
ORA-06512: at line 1

I connect back to PDB1 and reset the parameter:

[oracle@awrw admin]$ sqlplus dba1/oracle@pdb1

SQL*Plus: Release 12.1.0.2.0 Production on Sun Apr 2 13:46:40 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 02 2017 13:34:41 +08:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter system set job_queue_processes=1000 container=current;

System altered.
In the alert log I can see Job Queue Co-ordinator process is restarted:
Starting background process CJQ0
Sun Apr 02 13:47:08 2017
ALTER SYSTEM SET job_queue_processes=1000 SCOPE=BOTH;
Sun Apr 02 13:47:08 2017
CJQ0 started with pid=34, OS id=2227

And sure enough I am now able to run schedule jobs in PDB2.

[oracle@awrw ~]$ sqlplus dba2/oracle@pdb2

SQL*Plus: Release 12.1.0.2.0 Production on Sun Apr 2 13:48:51 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Last Successful login time: Sun Apr 02 2017 13:44:19 +08:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> exec sys.dbms_scheduler.run_job( 'DUMMY_JOB', false);

PL/SQL procedure successfully completed.

Bind Sensitivity and PL/SQL cursor caching

I have been investigating a PL/SQL process which was failing because a non-optimal plan was being used due to a combination of Bind Variable Peeking and data skew.
I put together a reproducible test case executing the problem SQL from SQLPlus, and found that addition of a /*+bind_aware*/ hint to inform CBO that this query was bind sensitive seemed to resolve the issue.

However retesting via a PL/SQL showed that the plan being used was still that of the first bind variables encountered.

I have simplified the testcase to more easily reproduce the behavior I was seeing (database version 12.1.0.2):

create table skew_table nologging as
   select rownum id,  
        case mod(rownum,  10000) when 0 then 0 else 1 end c10000,
        rpad('X', 255, 'X') padding
        from dual
        connect by level <= 1e6; create index skew_index on skew_table(c10000); exec dbms_stats.gather_table_stats(NULL, 'SKEW_TABLE', METHOD_OPT=>'FOR ALL COLUMNS SIZE SKEWONLY');

create or replace procedure get_skew(p in number)
is 
   dummy number;
begin
   select /*+ bind_aware sktest */ count(*) INTO dummy FROM skew_table where c10000 = p;
end;
/

declare
   dummy number;
begin
   get_skew(0);
   get_skew(1);
end;   
/

Checking v$sql we can see only one plan was generated:

select child_number, executions, parse_calls, is_bind_sensitive, is_bind_aware  from v$sql where sql_id = '1rg2w46daksr4';  

CHILD_NUMBER EXECUTIONS PARSE_CALLS I I                                                                                                                             
------------ ---------- ----------- - -                                                                                                                             
           0          2           1 Y Y    

I guessed that PL/SQL cursor caching was resulting in even the soft parse being skipped, so disabled this optimization in the calling code:

declare
   dummy number;
begin
   execute immediate 'ALTER SESSION SET session_cached_cursors = 0';
   get_skew(0);
   get_skew(1);
   execute immediate 'ALTER SESSION RESET session_cached_cursors';
end;   
/

You can see now that the cursor is re-parsed, and that appropriate plans based on bind variables is now taking place correctly:

SQL> select child_number, executions, parse_calls, is_bind_sensitive, is_bind_aware  from v$sql where sql_id = '1rg2w46daksr4';                                     
                                                                                                                                                                    
CHILD_NUMBER EXECUTIONS PARSE_CALLS I I                                                                                                                             
------------ ---------- ----------- - -                                                                                                                             
           0          3           3 Y Y                                                                                                                             
           1          1           0 Y Y                                                                                                                             

Update: while this was interesting to investigate, for the live issue I have simply locked in a plan that is acceptable (if not optimal) for all values.

AWR Warehouse and Data Guard (Part One)

Note, if you are not interested in the nitty gritty of my testing and results, you may just want to skip ahead to the blog entry i have made that summarizes my findings, and makes some recommendations.

One limitation of AWR Warehouse (AWRW) is that it doesn’t play nicely with Data Guard.  I have been spending some time recently to reproduce issues that we have seen on our production systems. With luck some of the workarounds we have implemented to minimize there problems will be discussed in a follow up post.

First here are some details about my testing environment; note in particular that it does not follow the best practice of separating the OEM and AWR Warehouse repositories:

  • Host dga holds database instance DGA
  • Host dgb holds database instance DGB
  • DGA and DGB are configured as primary and standby via Oracle Data Guard
  • Host oem hosts Enterprise Manager (OEM) and database instance OMR
  • OMR holds both OEM and AWRW repositories
  • All hosts are running Oracle Linux 7.3
  • All databases are Oracle 12.2.0.1
  • All hosts and database instances are configured as targets in OEM
  • Preferred credentials have been defined for all databases.

Let’s add DGA as an AWRW target in AWRW, from AWR homepage in OEM we click the Add button and select DGA.  After a few seconds we can see DGA in our list of targets:

DGA_Uploading

I then highlight this target and select “Upload Snapshots Now” from Actions drop-down list.

Success. We can see more information in CAW_MAPPINGS table in the AWRW repository:

SQL> SELECT mapping_id, target_name, old_dbid, new_dbid FROM DBSNMP.CAW_DBID_MAPPING;

MAPPING_ID TARG   OLD_DBID   NEW_DBID
---------- ---- ---------- ----------
	 1 cawr 1585153544 1585153544
	 2 DGA	1852644603 1852644603

SQL>

Let’s try to add DGB (currently running as standby) as a target:

Screen Shot 2017-03-26 at 5.08.32 PM

As you can see, adding the standby instance fails.

Next switchover is performed such that DGB is primary, and DGA is standby.

One thing to note is that AWR tables now exist in DGB (they were created on DGA when it was primary and have been copied as part of normal standby redo apply).

Addition of DGB to AWRW now completes successfully:

Screen Shot 2017-03-26 at 5.12.35 PM

“Upload Snapshots Now” also runs without problems successfully. However if looking closer shows some problems.  The CAW_MAPPINGS table reveals that during the addition, OEM detected it already had an instance in AWRW with the same DBID, so it re-mapped DGB to a new DBID, 1.

SQL> select mapping_id, target_name, old_dbid, new_dbid from dbsnmp.caw_dbid_mapping order by mapping_id;

MAPPING_ID TARG   OLD_DBID   NEW_DBID
---------- ---- ---------- ----------
	 1 cawr 1585153544 1585153544
	 2 DGA	1852644603 1852644603
	 3 DGB	1852644603	    1
SQL> 

When performing analysis of AWRW data, this mapping would need to be considered.

Analysis of table CAW_LOAD_METADATA reveals further problems:

SQL> select target_name, begin_snap_id, end_snap_id from dbsnmp.caw_load_metadata;

TARG BEGIN_SNAP_ID END_SNAP_ID
---- ------------- -----------
DGA		 1	     6
DGB		 1	     7

AWR history for snapshot ids 1 through 6 have been uploaded twice, once for DGA, once for DGB. In a real environment where the volume of AWR data on the source database is large, this could be an nasty surprise. Also any analysis of AWRW data will likely be invalid with some snapshots counted twice under different DBIDs.

Switch-over such that DGA is returned to primary role is then performed, and “Upload Snapshots Now” run.

The task completes successfully, and this time, checking CAW_LOAD_METADATA we can see that we don’t have any additional duplicates.

SQL> select target_name, begin_snap_id, end_snap_id from dbsnmp.caw_load_metadata order by dump_id;

TARG BEGIN_SNAP_ID END_SNAP_ID
---- ------------- -----------
DGA		 1	     6
DGB		 1	     7
DGA		 8	    10

SQL>

It seems the main problem was caused during the initial configuration of the target once this has occurred, that in general, uploads take place with no problems after switch-over, with the caveat that uploads from DGB will be mapped to the new DBID 1.

Note in particular that the mapping is determined by the instance that performed the upload which may or may not be the instance were the snapshot was made.

There are exceptions however; in particular, note that during normal operations the Extract, Transfer and Load operations are independent tasks, so that if the extract completes successfully on one instance and then switch-over occurs before the transfer runs, the transfer tasks (on both primary and standby instances) will fail to upload the relevant snapshots.

More details about this case, and as promised, some of the workarounds we have implemented to follow.

Oracle Connection Manager (cman) registration_invited_nodes and registration_excluded_nodes

We make use Oracle Connection Manager to control client access to our database servers, it’s pretty cool technology that I’m always surprised how few people are aware of its existence.

Our installed version is 12.1, and this is unable to communicate with our newly upgraded 12.2 instance, so I have been working through installing the latest version.  On starting the instance up, I noticed that no services were being registered, and the log contained multiple entries like the following:

Listener(VNCR option 1) rejected Registration request from destination
23-MAR-2017 14:37:37 * service_register_NSGR * 1182
TNS-01182: Listener rejected registration of service ""

This functionality is described here:

Starting with this release, valid node checking for registration is enabled by default in Oracle Connection Manager. By default, only the databases on local node are allowed to register with Oracle Connection Manager. The databases on remote node can be permitted or denied to register with Oracle Connection Manager through the use of parameters REGISTRATION_INVITED_NODES and REGISTRATION_EXCLUDED_NODES.

Check out the examples for these parameters in the documentation:

REGISTRATION_EXCLUDED_NODES = (10.1.26.*, 10.16.40.0/24, \
                                       2001:DB8:3eff:fe38, node2)
REGISTRATION_INVITED_NODES = (10.1.35.*, 10.1.34.0/24, \
                                       2001:DB8:fe38:7303, node1)

Based on that example I updated the cman.ora configuration to include a valid node. Note that 255.255.255.255 is just an example for testing. Note also that I have configured cman to listen on a non-default port, 999, rather than the default of 1521.

cman_vmhklftdbocmdv2.alpha.lftltd.net =
(configuration=
  (address=(protocol=tcp)(host=vmhklftdbocmdv2.alpha.lftltd.net)(port=999))
  (parameter_list =
    (registration_invited_nodes = (255.255.255.255))
    (connection_statistics=yes)
    (log_level=user)
    (trace_level=user)
    (trace_filelen=1000)
    (trace_fileno=3)
    (idle_timeout=14400)
  )
  (rule_list=
    (rule= (src=vmhklftdbocmdv2.alpha.lftltd.net)(dst=::1)(srv=cmon)(act=accept))
     )
)

Note when I run connection manager, it has picked up the default port of 1521, rather than 999 which I specified:

-bash-4.1$ cmctl

CMCTL for Linux: Version 12.2.0.1.0 - Production on 23-MAR-2017 16:08:03

Copyright (c) 1996, 2016, Oracle.  All rights reserved.

Welcome to CMCTL, type "help" for information.

CMCTL> administer
Current instance CMAN_vmhklftdbocmdv2.alpha.lftltd.net is not yet started
Connections refer to (ADDRESS=(PROTOCOL=TCP)(HOST=vmhklftdbocmdv2)(PORT=1521)).
The command completed successfully.
CMCTL:CMAN_vmhklftdbocmdv2.alpha.lftltd.net> 

Basically what has happened is that cman has parsed the cman.ora file, and it doesn’t like the invited_nodes list to be in parentheses. Rather than throwing out an error it just continues with default values.
I remove the brackets, and adding an extra IP address just for fun:

    (registration_invited_nodes = 1.1.1.1, 255.255.255.255)

When connecting now, observe that the settings from the parameter file are being used (note the non-default port, 999):

-bash-4.1$ cmctl

CMCTL for Linux: Version 12.2.0.1.0 - Production on 23-MAR-2017 16:12:16

Copyright (c) 1996, 2016, Oracle.  All rights reserved.

Welcome to CMCTL, type "help" for information.

CMCTL> administer
Current instance CMAN_vmhklftdbocmdv2.alpha.lftltd.net is not yet started
Connections refer to (DESCRIPTION=(address=(protocol=tcp)(host=vmhklftdbocmdv2.alpha.lftltd.net)(port=999))).
The command completed successfully.
CMCTL:CMAN_vmhklftdbocmdv2.alpha.lftltd.net> 

I observed the same behavior with REGISTRATION_EXCLUDED_NODES.
While I was in the area, I also noticed that the described default value for VALID_NODE_CHECKING_REGISTRATION appears to be wrong. The documentation specified that the default value is off, which contradicts the previously highlighted section of the documentation, and my own testing.

I’m off to file a couple of documentation bugs…