Resolving RMC connection error for AIX – DLPAR (Dynamic Logical Partition)
We used to get the below error while performing dlpar operation , which is painful whenever we are in a situation that need to add CPU or memory to a partition and that to a partition would be business critical whereas reboot is not affordable at all . As per my experience I would say this issue happens due to network connectivity for maximum cases very less chances for rsct and other csm agent contributes for dlpar issues.
Now without reboot you cannot add resource to this logical partition, if you could resolve dlpar issue off course we could add resources without rebooting the server.
Before staring any troubleshooting steps, always validate basic things such that, ping from logical partition to HMC (Many place ping would be disabled, hence telnet with HMC IP address along with port number) and ensure all necessary communication are in place.
First we should check HMC configuration whereas we have to validate HMC network and firewall settings
In HMC , left hand side panel HMC management would be there click on HMC management that would navigate to Customize Network Settings , validate your network information as like the below screenshot .
On above screenshot click change network settings which would navigate in to HMC network configuration where as ensure correct HMC ip address and subnet mask.
Second most important step validate the routing on HMC and ping from HMC to test the network connectivity.
After validation if communication is alright then we have to check RMC sub system.
rmcctrl –z command to stop all the rsct_rm group sub system
rmcctrl –A command to start all the rsct_rm group sub system
rmcctrl –p command to reset the peer connections.
Validate now are you able to perform dlpar operation as of now. If yes everything fine otherwise check DCaps value in HMC by running lspartition –dlpar for your desired lpar is listed there or not.
lpartition command would not result any output means LPAR to HMC connection not been established.
Now we can take the next step of troubleshooting, can execute recfgct command which used to remove all RSCT data under the /var/ct directory and regenerate RSCT node id data. During the regeneration process RSCT node id information would be copied from /etc/ct_node_id to /var/ct/cfg/ct_node_id. Moreover now IBM.DRM subsystem would be activated.
Check rsct_rm group services through lssrc –g rsct_rm command.
After restarting rsct_rm group service then we have wait for some time to establish LPAR and HMC node establishment. Now can be check the partition status through lspartition command in HMC.
Even the same can be validated through HMC by using lssyscfg command
lssyscfg -r lpar -m <Frame Serial Number > -F rmc_ipaddr,lpar_id,name,state,rmc_state
Normally, many people as soon you seen that rsct_rm services by using stopsrc or startsrc command to restart rsct_rm group server or sub systems. Always, to resolve RMC related errors it’s recommend that to use rmcctrl command insider of start/stopping rsct service. Keep in mind is that RMC error which is connection between LPAR and HMC don’t get confuse with FSP connection with HMC.