ProMon: The IBM strategic proactive monitoring infrastructure Chip Layton Senior IT Consultant February 2016 ProMon Definition of terms • Critical Monitoring •Events •Outages • Strategic Monitoring •Utilization •Redundancy •Standardization 2 POWER7+ © Copyright IBM Corporation 2016 ProMon The case for Proactive Monitoring •All systems tend toward chaos •Assuming that properly defined and documented build and change procedures are in place, people make mistakes •Redundancy built into an environment and properly verified during initial installation only protects against the first failure. •Most monitoring systems report catastrophic failures in the environment but under report failure of standby systems 3 © Copyright IBM Corporation 2016 ProMon •Do you need Strategic Monitoring? •Are your system administrators responsible for less than 15 LPAR each? •Does your admin team routinely audit the current configuration of each LPAR? •Does the storage and network configuration of your LPARs remain static over its lifetime? •Do the system administrators, network administrators and storage administrators keep each other informed of the details of upcoming changes? •Is adding a new LPAR to your environment a rare event? •If you can honestly answer “yes” to all of the above questions, congratulations, you have good command and control of your environment. 4 © Copyright IBM Corporation 2016 ProMon •Items checked •Disk Attributes •Active processes •Path validation •Monitoring and backup software installation •Fiber Channel errors and attributes •Shared Ethernet configuration 5 © Copyright IBM Corporation 2016 Use Case I for ProMon Time to upgrade the VIOS servers on all the frames. Are you sure none of the LPAR will lose access to one or more of their disks? Maybe we should wait until we have had a chance to verify the system better. 6 © Copyright IBM Corporation 2016 6 Use Case II for ProMon Regular use can alert the system administration team that changes made in the SAN or Network subsystem have had an unanticipated effect to the LPAR redundancy. 7 © Copyright IBM Corporation 2016 7 Use Case III for ProMon ProMon provides a regular method of checking the systems to insure that normal system maintenance has not adversely impacted the overall system performance. 8 © Copyright IBM Corporation 2016 8 Typical ProMon Environment Power 780 HMC Power 780 Power 795 Power 780 0 2 1 3 0 2 1 3 System x3550 M3 1 2 3 4 1 2 3 4 System x3550 M3 Power 780 Power 780 Power 780 520 Power 780 Power 795 Power 780 9 Centralized Monitoring Server Power 780 © Copyright IBM Corporation 2016 9 Typical ProMon Environment HMC Power 780 0 2 1 3 System x3550 M3 1 2 3 4 1 2 3 4 Power 780 0 2 1 3 System x3550 M3 Power 795 Power 780 Power 780 Power 780 Power 780 520 Power 780 Power 795 Power 780 10 Centralized Monitoring Server Power 780 © Copyright IBM Corporation 2016 10 ProMon •Commands used • lsdev • lsattr • lspv • lsvg • fcstat • lspath • powermt 11 • • • • • • ioscli enstat ps grep ssh lppchk © Copyright IBM Corporation 2016 Sample LPAR Report LPAR Status Report Jul 21 2016 at 02:01:00 AM Processing HMC unxhmcpa002 Collecting LPAR list from HMC unxhmcpa002 Processing HMC unxhmcpa003 Collecting LPAR list from HMC unxhmcpa003 Processing Processing Processing Processing Processing Processing Processing Frame Frame Frame Frame Frame Frame Frame florida mississippi texas lousiana california georgia Alabama lpar001-NoLPM, Unable to audit LPAR due to failure of ssh … lpar2, lpar2, lpar2, lpar2, lpar2, lpar2, lpar2, Check rootvg on lpar2 for closed/stale partitions MPIO policy error. Fewer than 4 path to disk hdisk0 MPIO policy error. Fewer than 4 path to disk hdisk1 Shortage of DMA Resources found for fcs2 Consider changing command elements on fcs2 from 500 to 1024 Consider changing transfer size on fcs2 from 0x100000 to 0x200000 OS inconsistencies found on lpar2 …. hatesttlpar1-NoLPM, Unable to audit LPAR due to failure of ssh …. lpar3, lpar3, lpar3, lpar3, lpar3, Check rootvg on lpar3 for closed/stale partitions Health Check interval policy error on hdisk12 Queue depth policy error for hdisk12 Health Check interval policy error on hdisk13 Queue depth policy error for hdisk13 …. lpar4, TSM may not be running fcs and vscsi adapters both exist on lpar4 ….. lpar5, lpar5, lpar5, lpar5, lpar5, lpar5, lpar5, lpar5, …… Check rootvg on lpar5 for closed/stale partitions Reserve lock set to yes for hdiskpower0 Reserve lock set to yes for hdiskpower2 Reserve lock set to yes for hdiskpower3 Reserve lock set to yes for hdiskpower4 Shortage of DMA Resources found for fcs0 Consider changing command elements on fcs0 from 500 to 1024 Consider changing transfer size on fcs0 from 0x100000 to 0x200000 A total of 253 LPAR were audited Errors were found in 206 LPAR or 81 % 12 12 © Copyright IBM Corporation 2016 Sample VIOS report Starting VIOS audits at 0201 on 072116 florida-vio2 There is/are 4 configured Shared Ethernet Adapters on florida-vio2 functioning normally florida-vio2 For ent12 there were 0 MB sent and 1454 MB received with 0 xmit errors and 0 receive errors florida-vio2 Maximum Transmit Queue was 0 packets with no overflow errors florida-vio2 For ent20 there were 0 MB sent and 26 MB received with 0 xmit errors and 0 receive errors florida-vio2 Maximum Transmit Queue was 0 packets with no overflow errors florida-vio2, For ent32 there were 9096 MB sent and 10124 MB received with 18922929 xmit errors and 568 receive errors florida-vio2 Maximum Transmit Queue was 4 packets with no overflow errors florida-vio2, Dynamic tracking not set on fscsi0 florida-vio2, Fast_Fail attribute not set on fscsi0 florida-vio2, Dynamic tracking not set on fscsi1 florida-vio2, Fast_Fail attribute not set on fscsi1 florida-vio2, Dynamic tracking not set on fscsi2 florida-vio2, Fast_Fail attribute not set on fscsi2 florida-vio2, Dynamic tracking not set on fscsi3 florida-vio2, Fast_Fail attribute not set on fscsi3 florida-vio2, Dynamic tracking not set on fscsi4 florida-vio2, Fast_Fail attribute not set on fscsi4 florida-vio2, Dynamic tracking not set on fscsi5 florida-vio2, Fast_Fail attribute not set on fscsi5 florida-vio2, Dynamic tracking not set on fscsi6 florida-vio2, Fast_Fail attribute not set on fscsi6 florida-vio2, Dynamic tracking not set on fscsi7 florida-vio2, Fast_Fail attribute not set on fscsi7 florida-vio2, Check for failed adapter in SEA ent16 florida-vio2, Unmirrored LV found rootvg audit_lv jfs2 /audit florida-vio1 There is/are 4 configured Shared Ethernet Adapters on florida-vio1 functioning normally florida-vio1 For ent12 there were 1451 MB sent and 1451 MB received with 0 xmit errors and 0 receive errors florida-vio1 Maximum Transmit Queue was 2 packets with no overflow errors florida-vio1 For ent16 there were 0 MB sent and 44 MB received with 0 xmit errors and 284 receive errors florida-vio1 Maximum Transmit Queue was 0 packets with no overflow errors florida-vio1 For ent20 there were 9250 MB sent and 9250 MB received with 0 xmit errors and 0 receive errors florida-vio1 Maximum Transmit Queue was 3 packets with no overflow errors florida-vio1, For ent32 there were 0 MB sent and 1218 MB received with 0 xmit errors and 568 receive errors florida-vio1 Maximum Transmit Queue was 0 packets with no overflow errors florida-vio1, Dynamic tracking not set on fscsi0 florida-vio1, Fast_Fail attribute not set on fscsi0 florida-vio1, Dynamic tracking not set on fscsi1 florida-vio1, Fast_Fail attribute not set on fscsi1 florida-vio1, Dynamic tracking not set on fscsi2 florida-vio1, Fast_Fail attribute not set on fscsi2 florida-vio1, Dynamic tracking not set on fscsi3 florida-vio1, Fast_Fail attribute not set on fscsi3 florida-vio1, Dynamic tracking not set on fscsi4 florida-vio1, Fast_Fail attribute not set on fscsi4 florida-vio1, Dynamic tracking not set on fscsi6 florida-vio1, Fast_Fail attribute not set on fscsi6 florida-vio1, Unmirrored LV found rootvg audit_lv jfs2 /audit VIOS audits complete at 0222 on 072116 13 13 © Copyright IBM Corporation 2016 Sample VIOS report Starting VIOS audits at 0257 on 052013 atl750dvio01 There is/are 2 configured Shared Ethernet Adapters on atl750dvio01 functioning normally For ent10 there were 2 MB sent and 6 MB received with 36447 xmit errors and 0 receive errors Maximum Queue Depth was 4 packets with no overflow errors For ent11 there were 0 MB sent and 9 MB received with 0 xmit errors and 0 receive errors Maximum Queue Depth was 2 packets with no overflow errors Unmirrored LV found on atl750dvio01 rootvg hd5 boot N/A Unmirrored LV found on atl750dvio01 rootvg hd6 paging N/A Unmirrored LV found on atl750dvio01 rootvg paging00 paging N/A Unmirrored LV found on atl750dvio01 rootvg hd8 jfs2log N/A Unmirrored LV found on atl750dvio01 rootvg hd4 jfs2 / Unmirrored LV found on atl750dvio01 rootvg hd2 jfs2 /usr Unmirrored LV found on atl750dvio01 rootvg hd9var jfs2 /var Unmirrored LV found on atl750dvio01 rootvg hd3 jfs2 /tmp Unmirrored LV found on atl750dvio01 rootvg hd1 jfs2 /home Unmirrored LV found on atl750dvio01 rootvg hd10opt jfs2 /opt Unmirrored LV found on atl750dvio01 rootvg hd11admin jfs2 /admin Unmirrored LV found on atl750dvio01 rootvg livedump jfs2 /var/adm/ras/livedump atl750dvio02 Rootvg properly mirrored on atl750dvio02 There is/are 2 configured Shared Ethernet Adapters on atl750dvio02 functioning normally For ent10 there were 0 MB sent and 8 MB received with 0 xmit errors and 0 receive errors Maximum Queue Depth was 4 packets with no overflow errors For ent11 there were 4 MB sent and 6 MB received with 11928 xmit errors and 0 receive errors Maximum Queue Depth was 2 packets with no overflow errors VIOS audits complete at 0307 on 052013 14 14 © Copyright IBM Corporation 2016 Once a day against the entire environment, you’re kidding me right? You can always run an audit against a single LPAR. Read the documentation to find out all of the options available. 15 15 © Copyright IBM Corporation 2016 hdisk214 thru hdisk217, what about 0 – 213 ????? 16 16 © Copyright IBM Corporation 2016 17 17 © Copyright IBM Corporation 2016 18 18 © Copyright IBM Corporation 2016 19 19 © Copyright IBM Corporation 2016
© Copyright 2026 Paperzz