Instance crash when not restarted through clusterware after patching

We had some issues, fortunately on a test instance, after applying a database
patch. With the help of a colleague (thanks Kei) I was able to identify what had
gone wrong and setup a testcase to demonstrate the issue. I think it is worth
sharing as it reveals some of the complexities of the relationships between
the components of the Oracle stack.

My setup consists of two database instances on the same host configured in Oracle Restart. To start the patching process I stop both instances. Note in particular the group owner of oracle binary is asmadmin.

[oracle@hkexdb01 ~]$ srvctl stop database -db PVJA -o abort
[oracle@hkexdb01 ~]$ srvctl stop database -db PVJB -o abort
[oracle@hkexdb01 ~]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle asmadmin 327575894 Jun 13 14:52 /u01/app/oracle/product/

Next I apply database patch (I have omitted some details from the
patch session as they are not relevant). Observe that after patch completes,
oracle binary has been re-linked, and owner is now oinstall.

[oracle@hkexdb01 ~]$ cd ~/25397136
[oracle@hkexdb01 25397136]$ $ORACLE_HOME/OPatch/opatch apply
Oracle Interim Patch Installer version
Copyright (c) 2017, Oracle Corporation.  All rights reserved.

Oracle Home       : /u01/app/oracle/product/
Central Inventory : /u01/app/oraInventory
   from           : /u01/app/oracle/product/
OPatch version    :
OUI version       :
Log file location : /u01/app/oracle/product/

Verifying environment and performing prerequisite checks...
OPatch continues with these patches:   24732088  25397136  

Do you want to proceed? [y|n]
User Responded with: Y
All checks passed.

Please shutdown Oracle instances running out of this ORACLE_HOME on the local system.
(Oracle Home = '/u01/app/oracle/product/')

Is the local system ready for patching? [y|n]
User Responded with: Y
Backing up files...
Applying sub-patch '24732088' to OH '/u01/app/oracle/product/'


Patching component oracle.rdbms.install.plugins,
Composite patch 25397136 successfully applied.
Log file location: /u01/app/oracle/product/

OPatch succeeded.
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle oinstall 327813870 Jun 13 14:59 /u01/app/oracle/product/

After patching is complete, I now restart PVJA outside of
clusterware (note the group owner of oracle binary is still oinstall).

[oracle@hkexdb01 25397136]$ export ORACLE_SID=PVJA
[oracle@hkexdb01 25397136]$ sqlplus / as sysdba

SQL*Plus: Release Production on Tue Jun 13 15:00:54 2017

Copyright (c) 1982, 2014, Oracle.  All rights reserved.

Connected to an idle instance.

SYS@PVJA> startup
ORACLE instance started.

Total System Global Area 2147483648 bytes
Fixed Size                  2926472 bytes
Variable Size            1564297336 bytes
Database Buffers          570425344 bytes
Redo Buffers                9834496 bytes
Database mounted.
Database opened.
SYS@PVJA> exit
Disconnected from Oracle Database 12c Enterprise Edition Release - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle oinstall 327813870 Jun 13 14:59 /u01/app/oracle/product/

I then restart PVJB instance through clusterware. Note that clusterware
updates the group owner of the oracle binary.

srvctl start database -db PVJB 
ls -alrt $ORACLE_HOME/bin/oracle

[oracle@hkexdb01 25397136]$ srvctl start database -db PVJB 
[oracle@hkexdb01 25397136]$ ls -alrt $ORACLE_HOME/bin/oracle
-rwsr-s--x 1 oracle asmadmin 327813870 Jun 13 14:59 /u01/app/oracle/product/
[oracle@hkexdb01 25397136]$ 

This change in group ownership ‘breaks’ the PVJA instance. There are lots of errors in the alert log.

Errors in file /u01/app/oracle/diag/rdbms/pvja/PVJA/trace/PVJA_j000_64968.trc:
ORA-27140: attach to post/wait facility failed
ORA-27300: OS system dependent operation:invalid_egid failed with status: 1
ORA-27301: OS failure message: Operation not permitted
ORA-27302: failure occurred at: skgpwinit6
ORA-27303: additional information: startup egid = 1001 (oinstall), current egid = 1006 (asmadmin)

Note the clusterware log seems to indicate the point at which this action took

2017-06-13 14:55:47.233925 :CLSDYNAM:4045649664: [ora.pvjb.db]{1:31716:24755} [start] Utils:execCmd action = 1 flags = 6 ohome = /u01/app/ cmdname = setasmgidwrap

This behaiour seems to match Bug 9784037 : SETASMGID CAUSING ORA-27303, which
was closed as “Not a Bug”. The provided workaround is “always remember to execute setasmgidwrap after doing relink all / patch apply in RDBMS Oracle Home”.
However I think just making sure that all instances are restarted through
clusterware is probably a better way to handle this.

Update, seems Frits Hoogland has run into this issue too, I now feel a bit better about my ‘mistake’.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s