2016-07 IBM LBS ProMon

ProMon: The IBM strategic
proactive monitoring
infrastructure
Chip Layton
Senior IT Consultant
February 2016
ProMon
Definition of terms
•
Critical Monitoring
•Events
•Outages
•
Strategic Monitoring
•Utilization
•Redundancy
•Standardization
2
POWER7+
© Copyright IBM Corporation 2016
ProMon
The case for Proactive Monitoring
•All systems tend toward chaos
•Assuming that properly defined and documented build
and change procedures are in place, people make
mistakes
•Redundancy built into an environment and properly
verified during initial installation only protects against
the first failure.
•Most monitoring systems report catastrophic failures
in the environment but under report failure of standby
systems
3
© Copyright IBM Corporation 2016
ProMon
•Do you need Strategic Monitoring?
•Are your system administrators responsible for less than 15 LPAR each?
•Does your admin team routinely audit the current configuration of each
LPAR?
•Does the storage and network configuration of your LPARs remain static
over its lifetime?
•Do the system administrators, network administrators and storage
administrators keep each other informed of the details of upcoming
changes?
•Is adding a new LPAR to your environment a rare event?
•If you can honestly answer “yes” to all of the above
questions, congratulations, you have good command and
control of your environment.
4
© Copyright IBM Corporation 2016
ProMon
•Items checked
•Disk Attributes
•Active processes
•Path validation
•Monitoring and backup software installation
•Fiber Channel errors and attributes
•Shared Ethernet configuration
5
© Copyright IBM Corporation 2016
Use Case I for ProMon
Time to upgrade the
VIOS servers on all the
frames. Are you sure
none of the LPAR will
lose access to one
or more of their disks? Maybe we should wait
until we have had a chance to verify the system better.
6
© Copyright IBM Corporation 2016
6
Use Case II for ProMon
Regular use can alert the system
administration team that changes made in the
SAN or Network subsystem have had an
unanticipated effect to the LPAR redundancy.
7
© Copyright IBM Corporation 2016
7
Use Case III for ProMon
ProMon provides a regular method of checking
the systems to insure that normal system
maintenance has not adversely impacted the
overall system performance.
8
© Copyright IBM Corporation 2016
8
Typical ProMon Environment
Power 780
HMC
Power 780
Power 795
Power 780
0
2
1
3
0
2
1
3
System x3550 M3
1
2
3
4
1
2
3
4
System x3550 M3
Power 780
Power 780
Power 780
520
Power 780
Power 795
Power 780
9
Centralized
Monitoring
Server
Power 780
© Copyright IBM Corporation 2016
9
Typical ProMon Environment
HMC
Power 780
0
2
1
3
System x3550 M3
1
2
3
4
1
2
3
4
Power 780
0
2
1
3
System x3550 M3
Power 795
Power 780
Power 780
Power 780
Power 780
520
Power 780
Power 795
Power 780
10
Centralized
Monitoring
Server
Power 780
© Copyright IBM Corporation 2016
10
ProMon
•Commands used
• lsdev
• lsattr
• lspv
• lsvg
• fcstat
• lspath
• powermt
11
•
•
•
•
•
•
ioscli
enstat
ps
grep
ssh
lppchk
© Copyright IBM Corporation 2016
Sample LPAR Report

LPAR Status Report Jul 21 2016
at 02:01:00 AM
Processing HMC unxhmcpa002
Collecting LPAR list from HMC unxhmcpa002
Processing HMC unxhmcpa003
Collecting LPAR list from HMC unxhmcpa003
Processing
Processing
Processing
Processing
Processing
Processing
Processing
Frame
Frame
Frame
Frame
Frame
Frame
Frame
florida
mississippi
texas
lousiana
california
georgia
Alabama

lpar001-NoLPM, Unable to audit LPAR due to failure of ssh

…
lpar2,
lpar2,
lpar2,
lpar2,
lpar2,
lpar2,
lpar2,
Check rootvg on lpar2 for closed/stale partitions
MPIO policy error. Fewer than 4 path to disk hdisk0
MPIO policy error. Fewer than 4 path to disk hdisk1
Shortage of DMA Resources found for fcs2
Consider changing command elements on fcs2 from 500 to 1024
Consider changing transfer size on fcs2 from 0x100000 to 0x200000
OS inconsistencies found on lpar2

….
hatesttlpar1-NoLPM, Unable to audit LPAR due to failure of ssh

….
lpar3,
lpar3,
lpar3,
lpar3,
lpar3,
Check rootvg on lpar3 for closed/stale partitions
Health Check interval policy error on hdisk12
Queue depth policy error for hdisk12
Health Check interval policy error on hdisk13
Queue depth policy error for hdisk13

….
lpar4, TSM may not be running
fcs and vscsi adapters both exist on lpar4

…..
lpar5,
lpar5,
lpar5,
lpar5,
lpar5,
lpar5,
lpar5,
lpar5,
……
Check rootvg on lpar5 for closed/stale partitions
Reserve lock set to yes for hdiskpower0
Reserve lock set to yes for hdiskpower2
Reserve lock set to yes for hdiskpower3
Reserve lock set to yes for hdiskpower4
Shortage of DMA Resources found for fcs0
Consider changing command elements on fcs0 from 500 to 1024
Consider changing transfer size on fcs0 from 0x100000 to 0x200000
A total of 253 LPAR were audited
Errors were found in 206 LPAR or 81 %
12
12
© Copyright IBM Corporation 2016
Sample VIOS report

Starting VIOS audits at 0201 on 072116
florida-vio2
There is/are 4 configured Shared Ethernet Adapters on florida-vio2 functioning normally
florida-vio2 For ent12 there were 0 MB sent and 1454 MB received with 0 xmit errors and 0 receive errors
florida-vio2 Maximum Transmit Queue was 0 packets with no overflow errors
florida-vio2 For ent20 there were 0 MB sent and 26 MB received with 0 xmit errors and 0 receive errors
florida-vio2 Maximum Transmit Queue was 0 packets with no overflow errors
florida-vio2, For ent32 there were 9096 MB sent and 10124 MB received with 18922929 xmit errors and 568 receive errors
florida-vio2 Maximum Transmit Queue was 4 packets with no overflow errors
florida-vio2, Dynamic tracking not set on fscsi0
florida-vio2, Fast_Fail attribute not set on fscsi0
florida-vio2, Dynamic tracking not set on fscsi1
florida-vio2, Fast_Fail attribute not set on fscsi1
florida-vio2, Dynamic tracking not set on fscsi2
florida-vio2, Fast_Fail attribute not set on fscsi2
florida-vio2, Dynamic tracking not set on fscsi3
florida-vio2, Fast_Fail attribute not set on fscsi3
florida-vio2, Dynamic tracking not set on fscsi4
florida-vio2, Fast_Fail attribute not set on fscsi4
florida-vio2, Dynamic tracking not set on fscsi5
florida-vio2, Fast_Fail attribute not set on fscsi5
florida-vio2, Dynamic tracking not set on fscsi6
florida-vio2, Fast_Fail attribute not set on fscsi6
florida-vio2, Dynamic tracking not set on fscsi7
florida-vio2, Fast_Fail attribute not set on fscsi7
florida-vio2, Check for failed adapter in SEA ent16
florida-vio2, Unmirrored LV found rootvg audit_lv jfs2 /audit

florida-vio1
There is/are 4 configured Shared Ethernet Adapters on florida-vio1 functioning normally
florida-vio1 For ent12 there were 1451 MB sent and 1451 MB received with 0 xmit errors and 0 receive errors
florida-vio1 Maximum Transmit Queue was 2 packets with no overflow errors
florida-vio1 For ent16 there were 0 MB sent and 44 MB received with 0 xmit errors and 284 receive errors
florida-vio1 Maximum Transmit Queue was 0 packets with no overflow errors
florida-vio1 For ent20 there were 9250 MB sent and 9250 MB received with 0 xmit errors and 0 receive errors
florida-vio1 Maximum Transmit Queue was 3 packets with no overflow errors
florida-vio1, For ent32 there were 0 MB sent and 1218 MB received with 0 xmit errors and 568 receive errors
florida-vio1 Maximum Transmit Queue was 0 packets with no overflow errors
florida-vio1, Dynamic tracking not set on fscsi0
florida-vio1, Fast_Fail attribute not set on fscsi0
florida-vio1, Dynamic tracking not set on fscsi1
florida-vio1, Fast_Fail attribute not set on fscsi1
florida-vio1, Dynamic tracking not set on fscsi2
florida-vio1, Fast_Fail attribute not set on fscsi2
florida-vio1, Dynamic tracking not set on fscsi3
florida-vio1, Fast_Fail attribute not set on fscsi3
florida-vio1, Dynamic tracking not set on fscsi4
florida-vio1, Fast_Fail attribute not set on fscsi4
florida-vio1, Dynamic tracking not set on fscsi6
florida-vio1, Fast_Fail attribute not set on fscsi6
florida-vio1, Unmirrored LV found rootvg audit_lv jfs2 /audit

VIOS audits complete at 0222 on 072116
13
13
© Copyright IBM Corporation 2016
Sample VIOS report
 Starting VIOS audits at 0257 on 052013
atl750dvio01
There is/are 2 configured Shared Ethernet Adapters on atl750dvio01 functioning normally
For ent10 there were 2 MB sent and 6 MB received with 36447 xmit errors and 0 receive errors
Maximum Queue Depth was 4 packets with no overflow errors
For ent11 there were 0 MB sent and 9 MB received with 0 xmit errors and 0 receive errors
Maximum Queue Depth was 2 packets with no overflow errors
Unmirrored LV found on atl750dvio01 rootvg hd5 boot N/A
Unmirrored LV found on atl750dvio01 rootvg hd6 paging N/A
Unmirrored LV found on atl750dvio01 rootvg paging00 paging N/A
Unmirrored LV found on atl750dvio01 rootvg hd8 jfs2log N/A
Unmirrored LV found on atl750dvio01 rootvg hd4 jfs2 /
Unmirrored LV found on atl750dvio01 rootvg hd2 jfs2 /usr
Unmirrored LV found on atl750dvio01 rootvg hd9var jfs2 /var
Unmirrored LV found on atl750dvio01 rootvg hd3 jfs2 /tmp
Unmirrored LV found on atl750dvio01 rootvg hd1 jfs2 /home
Unmirrored LV found on atl750dvio01 rootvg hd10opt jfs2 /opt
Unmirrored LV found on atl750dvio01 rootvg hd11admin jfs2 /admin
Unmirrored LV found on atl750dvio01 rootvg livedump jfs2 /var/adm/ras/livedump

atl750dvio02
Rootvg properly mirrored on atl750dvio02
There is/are 2 configured Shared Ethernet Adapters on atl750dvio02 functioning normally
For ent10 there were 0 MB sent and 8 MB received with 0 xmit errors and 0 receive errors
Maximum Queue Depth was 4 packets with no overflow errors
For ent11 there were 4 MB sent and 6 MB received with 11928 xmit errors and 0 receive errors
Maximum Queue Depth was 2 packets with no overflow errors
VIOS audits complete at 0307 on 052013
14
14
© Copyright IBM Corporation 2016
Once a day against the entire environment, you’re kidding me right?
You can always run an audit against a single LPAR. Read the documentation to find out all of the options
available.
15
15
© Copyright IBM Corporation 2016
hdisk214 thru hdisk217, what about 0 – 213 ?????
16
16
© Copyright IBM Corporation 2016
17
17
© Copyright IBM Corporation 2016
18
18
© Copyright IBM Corporation 2016
19
19
© Copyright IBM Corporation 2016