DB2 Best Practices

IBMr
IBM® DB2® for Linux®, UNIX®, and Windows®
Best practices
DB2 pureScale performance and
monitoring
Webcast transcript
Steve Rees
Senior Technical Staff Member
DB2 Performance
Issued: January 2013
Contents
DB2 pureScale performance and monitoring...................................................1
Webcast transcript.................................................................................................1
Introduction...........................................................................................................4
Slide 2: Agenda......................................................................................................4
Slide 3: Helpful high level stuff to remember about pureScale.....................4
Slide 4: Configuring pureScale for 'pureFormance'.........................................5
Slide 5: How many cores does the CF need?....................................................6
Slide 6: How much memory does the CF need?...............................................7
Slide 7: What about the cluster interconnect?...................................................8
Slide 8: Infiniband VS. Ethernet?........................................................................9
Slide 9: What about disk storage?.....................................................................10
Slide 10: Potential tuning for cluster scale-out................................................11
Slide 11: Sizing up the initial DB2 configuration............................................11
Slide 12: Sizing up the initial DB2 configuration (continued)......................12
Slide 13: Agenda – part 2...................................................................................13
Slide 14: A primer on two-level page buffering in pureScale.......................13
Slide 15: New LBP / GBP bufferpool metrics in pureScale...........................14
Slide 16: Accounting for pureScale bufferpool operations...........................15
Slide 17: pureScale bufferpool monitoring......................................................16
Slide 18: pureScale bufferpool monitoring (cont.)..........................................17
Slide 19: pureScale bufferpool monitoring (cont.)..........................................17
Slide 20: pureScale bufferpool tuning.............................................................18
Slide 21: pureScale bufferpool tuning (cont.)................................................18
Slide 22: pureScale bufferpool tuning (cont.)..................................................19
Slide 23: pureScale page negotiation (or 'reclaims').......................................19
Slide 24: Monitoring page reclaims..................................................................20
Slide 25: Reducing page reclaims......................................................................21
DB2 pureScale: Best practices for performance and monitoring
page 2 of 36
Slide 26: CURRENT MEMBER default column reduces contention...........22
Slide 27: Monitoring CF CPU utilization.........................................................23
Slide 28: AUTOMATIC CF memory: simple case – 1 active database........24
Slide 29: AUTOMATIC CF memory and multiple active databases...........25
Slide 30: Detecting an interconnect bottleneck...............................................25
Slide 31: Drilling down on interconnection traffic.........................................27
Slide 32: Finding interconnect bottlenecks with MON_GET_CF_CMD.....27
Slide 33: Interconnect bottleneck example......................................................28
Slide 34: Add another CF HCA.........................................................................29
Slide 35: Low-level interconnect diagnostics..................................................29
Slide 36: pureScale disk IO...............................................................................30
Slide 37: Castout configuration.........................................................................30
Slide 38: Castout monitoring.............................................................................32
Slide 39: Optim Performance Manager and DB2 pureScale monitoring....33
Slide 40: Summary..............................................................................................33
Slide 41: Summary (continued).........................................................................34
Notices..................................................................................................................36
Trademarks....................................................................................................37
Contacting IBM.............................................................................................37
DB2 pureScale: Best practices for performance and monitoring
page 3 of 36
Introduction
This document contains the transcript of the webcast available on the IBM DB2 for Linux, UNIX, and
Windows best practices community on IBM developerWorks at the following URL:
https://ibm.biz/BdxMvG
Hi, my name is Steve Rees.
I lead the DB2 pureScale performance team at the IBM Lab up in Toronto. I am very glad to present today
DB2 pureScale best practices for performance and monitoring. The material in this presentation is a
combination of things we have learned at the Lab, on our internal performance tests as well as coming
from performance benchmarks and customer engagements. It's really a pretty good summary of all the
best practises that we've got on DB2 pureScale performance.
Slide 2: Agenda
I'd like to start off with a quick introduction and concepts. I am assuming that most people that are
listening to this are fairly familiar with DB2 pureScale. They know the basics of it so I'm not going to
spend a lot of time on it. We are going to then go on to the configuration angle on DB2 pureScale
performance. We will look at the shape of the cluster and the components that make it up - what we
would call clustered geometry - and the components and scaling of the cluster.
That's about the first third of the presentation. The last portion of the presentation, about two thirds, is
dedicated to monitoring and tuning in pureScale. Looking at bufferpools, locking, Cluster caching facility
(CF), etc...
Slide 3: Helpful high level stuff to remember about
pureScale
DB2 pureScale: Best practices for performance and monitoring
page 4 of 36
I am not going to spend a lot of time getting into details about the background of DB2 pureScale. But
there are a few things I wanted to mention that I find are helpful in understanding the differences
between DB2 pureScale and non pureScale DB2.
The first one is that the CF or cluster caching facility is the hub of the cluster. Its performance is
extremely important when it comes to the performance of the overall cluster. The CF is the center of the
communication and coordination between the members. All significant communication between
members really goes through the CF. So good CF performance is very important to the overall
performance of the cluster.
Likewise the high speed, low latency interconnect, such as Infiniband, that is used to connect members to
the CF, is very important in maintaining high performance in pureScale.
The second thing is kind of an obvious one. DB2 pureScale is shared data technology. What that means is
that there is really only one copy of the database. Different members in pureScale share and sometimes
contend for access to different rows on the same page. What that means is that as we introduce the
concept of pureScale it brings along the concept of page locks. That is a different kind of lock than we had
to deal with in regular DB2 for Linux, UNIX, and Windows. This concept is very familiar to people who
have experience with DB2 on System Z Parallel Sysplex. Page locks are new in pureScale. That has an
impact on how the system is tuned.
The third thing is that inserts, updates, and deletes as SQL operations tend to drive more cluster activity
then selects. When we talk about tuning and sizing a pureScale cluster, what comes into it is the nature of
the workload. What would be the fraction of the reads, i.e. selects, versus inserts, updates, and deletes.
The last point is that pureScale introduces a kind of two tier bufferpool at the members and the CF.
Regular DB2 ESE (Enterprise Server Edition) non pureScale has a single level bufferpool. DB2 pureScale
introduces a multi level bufferpool, where each member has its own local bufferpool similar to ESE. The
CF has what is called a group bufferpool, or GBP for short, that caches all modified pages from all
members of the cluster. We have this two layer concept that makes tuning the bufferpools in pureScale
just a little different then regular DB2.
DB2 pureScale: Best practices for performance and monitoring
page 5 of 36
Slide 4: Configuring pureScale for 'pureFormance'
When configuring a pureScale cluster we can choose from a number of different cluster geometries or
shapes. They can be made up of large machines or small machines or medium sized machines in
different numbers, etc... and there are different ones shown on the page here. Usually what ends up
happening when a customer deploys pureScale clusters, is that they would choose the cluster geometry
based on other factors other than performance. They would choose possibly based on skills or available
hardware, etc...
The important thing to remember when configuring a cluster from the ground is that a balance of
resources needs to be maintained. We have to have a balance of CPU, memory, disk, and interconnect. If
they are out of balance we are going to end up wasting resources and getting a sub optimal performance
out of the cluster as a whole. We need to maintain that balance. That theme will continue to reoccur
throughout the presentation.
At the bottom I have something marked “BP” for best practice. All the clusters shown on this page here
include a secondary CF (for clustered caching facility.) pureScale can run with a single CF. That is not a
problem, but it introduces a single point of failure. Best practice is to include a secondary CF.
Slide 5: How many cores does the CF need?
How many cores does a CF need? That is probably one of the first questions that come up in pureScale
sizing discussion. As a rule of thumb, typically, the sum of cores across all pureScale members is about 6
to 12 times larger than the CF. That range depends on the read/write ratio of the workload. For a very
write heavy workload it would probably be around the range of 6 times more. For example if I have 12
total cores on all the members then I would probably have 2 cores in each of the CF's. If I have a very
read heavy workload with lots of selects, very few inserts, updates, and deletes, it would probably be in
the range of up to 12 times as many cores, so 24 total member cores and 2 in each of the CFs.
A good thing to remember is you don't have to pay to license the CF functionality, only the members.
Obviously the CFs are not free, but at least you don't have to pay for the software that is running on
those.
DB2 pureScale: Best practices for performance and monitoring
page 6 of 36
An important point is the CF can get extremely busy. The only way that pureScale can achieve a response
time that is in the tens of microseconds is if the cores on the CF are allowed to focus on what they have to
do. They have exclusive use of the CPU. If you go to one of the CFs and you run a vmstat command or
another tool that shows the CPU utilization, typically then you will see 100% utilization and it's normal
even on an idle cluster.
We strongly advise dedicated cores for the CF. If you are in a virtualized environment then [shared
processors] can be fine for the member. For the CF, because it is so busy, and because the response time of
the activity going on on the CF is so critical, we strongly advise dedicated cores. We also advise at least
one physical core for the CF. Sometimes when a very small pureScale cluster is set up, you might end up
that 1 or 2 virtual CPU's might seem to be enough. We would recommend a physical core. On Power
Series, a Power 7 machine, that would typically mean 4 logical threads. On Intel, that probably means two
virtual threads. Having a single physical core, at least 1, is a good recommendation. You can co-locate the
CF and a member within 1 LPAR or 1 Linux machine. Basically what that means is the CF and member
are running in the same operating system and sharing the same cores, etc... If you are going to do that,
then the CF and the member should be pinned the separate sets of cores. The CF gets one set of cores and
the member gets the other set, so that they don't end up fighting for processor cores or cache space.
Slide 6: How much memory does the CF need?
The general rule of thumb for a cluster of three or more members is that the Group Buffer Pool (GBP)
would be about 35 -40 % of the sum of all the local bufferpools. Suppose I had a 4 member cluster with a
local bufferpool size of 1 million 4k pages. That would mean that our best practice rule of thumb GBP
size would have 1.5 million pages. If I had a higher read workload with more selects, the size of the GBP
can come down a little bit. We should consider 25 percent as a minimum. You can come down as low as
25 percent. We don't recommend you go any smaller than that because of other things that need to be
stored in the GBP, even for read only workloads. If you have a 2 member cluster then we probably
recommend about 40 -50 percent depending on the read/write ratio.
GBP is an important factor here in sizing the CF. The CF memory is dominated by that, and GBP is
definitely the biggest memory consumer.
DB2 pureScale: Best practices for performance and monitoring
page 7 of 36
CF_DB _MEM_SZ, the memory of the entire active database [on the CF], should be about 25% bigger
than GBP size. Generally you would consider sizing the GBP and add another 25 % for the overall
database memory size. That database memory size is used for things like lock list, the shared
communications area, etc... and other data structures that exist in the CF besides the GBP. The GBP only
stores the modified pages. Unlike the DB2 Parallel Sysplex where the coupling facility can store read only
pages as well, on pureScale we are only storing modified pages. So the higher the read ratio, the smaller
the CF and the GBP can be.
A good thing to remember is that the GBP is always sized in 4 K pages, regardless of what is going on in
the members. If you have a 1 thousand page 8 K bufferpool on a member, that is going to represent 2000
K pages at the GBP and we have to do the GBP sizing in 4 K pages, regardless what we are doing on the
members.
Slide 7: What about the cluster interconnect?
The cluster interconnect is a very important piece of technology. It provides remote direct memory access
between the members and the CF, at extremely low latency, sort of 10 microseconds to 20 microseconds; a
very small amount of time. Typical configurations would use 1 interconnect adapter for Infiniband (we
call that a host channel adaptor, or HCA.) We use one of those per CF, and one per member. If you are
running on a Power system, those HCAs can be shared between multiple LPARs if needed. It is divided
up by the Hypervisor.
The CF HCA handles all the combined message traffic from all the members. That means that the HCA at
the CF is going be the one that “heats up” the fastest - it's going to be the one that gets busiest fastest. The
good news is that DB2 pureScale can support multiple host communication adapters, for Infiniband and
for RoCE ethernet, on the CF. Then we can add additional adapters to eliminate any bottlenecks that
might occur there.
In very round figures we are probably looking at 1 HCA infiniband adapter supporting about 6 to 8 CF
cores depending on the workload. Most of these adapters that pureScale is using each has 2 ports and you
might look at those and say “ha ha, I have 2 ports here and I can get double the performance by simply
DB2 pureScale: Best practices for performance and monitoring
page 8 of 36
using both ports”. Unfortunately those ports share circuitry inside. In our experience in the lab we have
not seen performance improvements by using both ports.
Can an HCA be shared between member and CF partitions residing on one machine? Yes it can. I
mentioned that the AIX Hypervisor can do that. One thing to be aware of is not to overload that HCA. If
we have multiple members or members and a CF sharing an HCA, we could overload that if we were not
careful. There are some suggestions here on how to calculate whether or not a given configuration would
be suitable for one HCA.
Slide 8: Infiniband VS. Ethernet?
DB2 pureScale on Linux has supported RoCE Ethernet for quite a while, as well as Infiniband. AIX in
DB2 10 now supports ROCE Ethernet as well.
There are many customers who prefer to look at Ethernet because the technology is more familiar to
them. They probably already have ethernet infrastructure in their IT department, and Infiniband might be
new to them. So many customers prefer to look at Ethernet.
In terms of raw bandwidth, from a performance perspective, Infiniband beats RoCE hands down. In
terms of the raw bandwidth we are looking at here is the QDR Infiniband, 40 GB/secs. DDR is the one we
use on AIX. Most commonly it's 20Gb/sec, whereas RoCE Ethernet for both Linux and AIX is 10 GB/sec.
You might look at those and say “well, Ethernet is going be one quarter the speed of QDR Infiniband.”
Fortunately it's not that bleak a story.
DB2 pureScale performance depends significantly on small message response time. That's the most
important thing. If we look at the message response time comparing RoCE and QDR Infiniband on the
graph, the tall pink bars represent Ethernet. Lower is better in this graph. We notice that the QDR
Infiniband bars are about half the height of the Ethernet ones, meaning the response time of Infiniband is
about half that of what Ethernet is. So okay, half [the performance for ethernet] is better than a quarter.
We were looking for a 4 to 1 performance difference between these and there is only a 2 to 1 performance
difference with the response time. That's good. The story gets a little bit better still. There are a lot of
things going on on a pureScale cluster, not just messages going back and forth. If we put together the
whole picture and we look at throughput, comparing an Ethernet configuration using and an Infiniband
configuration, we see about a 5 -15 percent performance difference between them. It's not 2 to 1 and it's
DB2 pureScale: Best practices for performance and monitoring
page 9 of 36
certainly not 4 to 1. It's about 5-15 percent difference; 5-15 percent better performance on Infiniband than
on RoCE Ethernet. In many cases that is not going to be noticeable Of course your mileage will vary. It
depends how performance sensitive your environment would be. But it's certainly much better
performance for Ethernet then would be indicated by the bandwidth figure.
Slide 9: What about disk storage?
What about disk storage?
Disk storage for DB2 pureScale is not all that different from DB2 EE (Extended Edition). We want to make
sure we have an adequate IO bandwidth to keep response times low particularly for the logs. For the
pureScale members, each one has its own log in it and it needs to flush its log more frequently then
regular DB2 EE. That makes us even more focused on log performance in pureScale than we are in
regular EE. That's why I want to mention solid stats disks (SSD). SSD is not normally something that
would be associated with a transaction log device, but for pureScale what we are looking for is the
absolute best log flush performance we can get. A relatively small number of solid state disks could
make all the difference between adequate log performance and really good log performance, and might
remove a bottleneck. That's something to keep in mind.
Another aspect of disk storage is the notion of SCSI-3 persistent reserves (PR). This helps with the
recovery times on the storage area network, or SAN. If one of the members goes down then that member
needs to be fenced off from the storage before recovery can proceed. And that fencing off of the failed
member from the storage can happen very quickly, in just a matter of seconds, if the system has SCSI-3
PR. It is supported at the storage level, and depends on the storage that you choose. There is IBM storage
and non IBM storage that would support it. For some of the IBM ones like V7000 models shown here, it
makes quite a bit of difference. A typical recovery time might be in the order of 20 to 30 seconds if you
have SCSI-3 PR. Without SCSI-3 PR we have to rely on lease expiry. It could take 60 or 90 seconds or
even a little more.
For GPFS configuration, a best practice is to use a separate GPFS file system for the log and for the table
spaces. Other than that pureScale has basically built into it many of the GPFS tuning steps that are
required for best performance. For example, things like enabling direct IO and setting larger 1 megabyte
block sizes - these steps help in the GPFS performance and are done automatically by the db2cluster
command.
DB2 pureScale: Best practices for performance and monitoring
page 10 of 36
Slide 10: Potential tuning for cluster scale-out
One of the major design points of DB2 pureScale is the ability to scale out. We can transparently add new
members and we we don't have to re-partition data. Everything can grow in a very transparent manner.
But back at the beginning we talked about the notion of balance. The balance of resources in the cluster is
very important. If I double the number of members in my cluster but don't do anything about maybe the
memory at the CF, or the cores in the CF, or maybe the number of Infiniband adapters in the CF, if I don't
maintain that balance I might not get the benefit that I should get by growing the cluster by say, doubling
my cluster size. So the kind of thing that you want to look at is: can the disk storage keep up, am I
creating a bottleneck in the interconnect, and all these kind of things. The monitoring and tuning steps
required to determine this and fixing this are going to be talked about in the second half of the
presentation.
Slide 11: Sizing up the initial DB2 configuration
In terms of DB2 configuration, the initial things in DB2 you should think about for pureScale, a couple
things come to mind. The first one is that larger extent sizes tend to perform better than small extent sizes.
The reason why is that some operations require communication with the CF and some other processing
every time a new extent is allocated. If my extent size is 2 pages, very small, I am going to do a lot of that
kind of chatty communication with the CF as I am growing my table. If I am at the default 32 pages or 64,
or 128, etc... or even larger extent sizes, then I am going to do many fewer interactions with the CF as the
table grows, and that is going to save overhead in the cluster. The default of 32 pages usually works
pretty well but you can feel free to go on beyond that, if the data that you have and the use that you are
going to put it to provides for it.
We also want to talk about smaller DB2 page sizes. Smaller DB2 page sizes tend to perform better than
larger ones. The reason why is first of all a typical pureScale workload tends to drive a lot of random IO.
We don't usually see a lot of scanning. We are looking at reading a row here, a row there, and each time
we read one, if we are reading from disk and sending it to the cluster, then it will be wrapped in a page,
so to speak. It comes along with other rows that maybe we are not interested in at the moment. The
smaller that wrapping is, the smaller that packaging is, then the less resources it is going to take to move
around, so a smaller page means less data flow between the member and the CF. In general we would
DB2 pureScale: Best practices for performance and monitoring
page 11 of 36
recommend using the smallest page size that would hold the row that you are going to use. If you have a
6000 byte row then that is going to be probably an 8K page. Probably not a 32K page; probably an 8K
page would be better from a performance perspective.
If you are moving applications that use sequences or identity columns to DB2 pureScale we would
recommend you use a large cache size. There's a keyword on creating sequences and identity called the
order keyword. Just avoid using that order keyword. The reason why is that on pureScale obtaining a
new batch of numbers in sequence requires a chat with the CF and a log flush. Those are not all that
expensive, but they are not free. So we would probably want - depending on how frequently we are
going back for new identity and sequence numbers - we would want to make that as efficient as possible.
A large cache size enables each member to have a set of numbers to draw on without having to go back to
the CF or do the log flush. It generally results in a fair bit better performance. The thing to do it is to tune
it. Test to see what cache size yields the kind of performance that you need in your system.
Slide 12: Sizing up the initial DB2 configuration
(continued)
Another thing to remember is that DB2 pureScale can have a a greater lock list requirement then DB2 EE.
If you are moving from the DB2 EE system then lock list might have to get bigger. Why is that? I
mentioned in the beginning we have this thing called a page lock. A lock list in pureScale holds the same
kinds of locks that we used to have like row locks and table locks and package locks, and variation locks
and other things. Now in pureScale we also have whole page locks. Those other kinds of locks, like row
locks for example, they can be escalated. We can use lock size and things to kind of control how many of
them there are. Physical page locks don't escalate. They don't collapse into escalated replacements. And
so we want to make sure that we won't run out of lock list size because of those page locks. The
information center recommends 3% of local bufferpool size. It is a good rule of thumb for the lock list.
I've seen cases of up to 6%. We don't have to go too much higher then that. Depending on how many
locks would be held over a long period of time in your environment, 3-6% is probably going to be needed
to make sure that you do not run out of lock list .
A good piece of news in DB2 10 for pureScale users that DB2 pureScale now supports range partitioned
tables. These were not supported in previous DB2 pureScale versions before DB2 Version 10. This is one
DB2 pureScale: Best practices for performance and monitoring
page 12 of 36
of the things that many customers really wanted to find and range partition tables are a great fit for a
model where data flows in in a chunk where it gets processed and then flows out. The range of time can
be days, or weeks, or months, that kind of idea. It is also useful for breaking up a range of keys in a large
and highly concurrent piece of a table. By being able to do this we can have a local index on each
partition, which can help produce contention as well. This results from the feature that came along in DB2
10 called Current Member (CUR_MEM) and we are going to talk a little bit more about that as another
way to help reduce contention, if it happens to crop up.
Slide 13: Agenda – part 2
That wraps up the first part. Let's move on to Monitoring and Tuning.
Slide 14: A primer on two-level page buffering in
pureScale
Let's start with a little bit of background about the bufferpool in DB2 pureScale and the two-level
metaphor that we use for it. As was said at the beginning, we have the local bufferpool at each member,
called the LBP for short. And that is a lot like the DB2 EE bufferpool. It caches both read only and
updated pages for that member. The group bufferpool is at the CF. It contains 2 main kinds of things.
First of all it contains a reference to every page in all LBPs across the cluster. It's very important that the
CF knows which members have which pages and which members don't have those pages. The GBP also
contains copies of all modified pages. This is the part that most people know about, the modified pages
being stored in the GBP. When a page is modified at a member, for example if I do an update statement
and then commit that, at commit time the page is flushed from the member up to the CF and it goes in the
group bufferpool. It can then be used by other agents, other applications running on other members.
Eventually it will get flushed to disk as well.
The efficiency advantage of this is quite strong. If we have a hot page which is being modified or
referenced fairly frequently, it's in the GBP. It's very efficient to access and it's much quicker and better
than going straight from there to disk. It can be an order of magnitude, or more, more expensive to go to
disk. For example, page read requests over Infiniband could cost in the 20-40 microseconds range and
that could be typically a hundred times faster than going to disk. The question is: How do I size these
DB2 pureScale: Best practices for performance and monitoring
page 13 of 36
bufferpools properly? The good news is that DB2 provides a whole group of monitoring metrics about
the local bufferpool and the group bufferpool. So we can use those to tune the sizes to fit our system.
Slide 15: New LBP / GBP bufferpool metrics in
pureScale
Let's look at some of those metrics.
The first one is the pool_data_LBP_pages_found (LBP pages found). This is basically a count of the
number of page references that resolve to the LBP. For example, we needed a page and it was present
either valid or invalid in the LBP.
What is an invalid page? Consider if I have a cluster with two members and I have the same page resident
in both members, Member A and Member B. Member A updates that page. Now member B has a stale
copy of the page. It doesn't have the updated version. We can't let B use that page thinking that it's
current and thinking it has the latest data. Because it doesn't. When member A committed its change to
that page, the CF knew that member B had a copy of this page and was able to reach out through RDMA
and mark that page on member B as invalid or stale.
Getting back to this particular counter, LBP pages found, we increment this counter if the page was there,
either valid or invalid. Either “fresh” or “stale”, if you will.
pool_data_gbp_l_reads (GBP logical reads) is the number of times we went to the group bufferpool to
read a page. For example if a page wasn't found locally: first time I read it I check my local bufferpool. If
it's not there I will go to the GBP to see it it's there. That's a GBP logical read.
An interesting thing or kind of trivia about pureScale: it supports prefetching. All DB2 versions support
prefetching. We support prefetching from disk. DB2 pureScale also supports prefetching from the group
bufferpool up in the CF down into the local bufferpool on the member. If data access patterns suggest
that this would be a good thing to do, DB2 will start shipping down pages in advance of you needing
them. In other words, prefetching. This GBP L reads counter includes prefetching. And so when we are
doing hit ratio calculation we have to take that into account.
DB2 pureScale: Best practices for performance and monitoring
page 14 of 36
pool_data_gbp_p_reads (GBP physical reads). This is a bit of misnomer because it kind of implies that
the GBP or the CF is going to be doing reads from disk. The members in a pureScale cluster do all the
disk IO. They do the reads and they do the writes. The CF and the GBP don't do the any IO.
So why do we have a counter called GBP physical reads? This is a count of the number of physical reads
that were done because the page was not present in either the local bufferpool or the group bufferpool.
pool_data_gbp_invalid_pages (GBP invalid pages). This sounds pretty sinister but all it means is the
number of times we went to the GBP because a page that I have locally at the member was present and
marked invalid. It was stale. Consider the Member A and B scenario we talked about before. If member B
went to read that page and it was marked invalid, then went to the GBP in the CF to see if it was there,
that would increment this counter. It was a trip to the GBP because it was an invalid page.
pool_async_data_gbp_l_reads (Async data GBP logical reads pages prefetched). This is the count of the
number of pages we brought down from the GBP into the local bufferpool in prefetching. It is not
extremely common, but it comes in handy in terms to help us modify our calculations and determine
how much successful access there was to the group bufferpool in a non-prefetching or direct access kind
of way.
Slide 16: Accounting for pureScale bufferpool
operations
Let's look at some scenarios.
We have some pictures here that make things a little bit clearer.
We have four different common scenarios, and how they effect the counters down on the left-hand side.
First case is very simple. The agent needs to find a page. It looks in the LBP. Is it there? Yes it's there and
it is valid. We did a logical read and it was a local bufferpool page found.
Second scenario. I have a page marked stale (marked invalid) in the local bufferpool. I can't use that and
have to go to the GBP. We will see if it was there. In this case it was there. We will see which counters are
incremented there. The same as the last time, but also including some of the GBP counters, including the
invalid pages one.
DB2 pureScale: Best practices for performance and monitoring
page 15 of 36
In the third case, the page is not in the LBP but found in the GBP. Almost the same as the second case. It's
not here at all. We go to the GBP and it's there and we have a slightly different pattern of counters
updated.
Last but not least, the scenario where the page hasn't been touched at all. It's not in the local bufferpool.
It's not in the GBP. It's found on the local disk Instead. we look locally at the CF and we get it from disk.
We end up with yet another set of different increments to our counters.
Slide 17: pureScale bufferpool monitoring
We can take those metrics and put them together into some basic monitoring techniques and formulas for
looking at bufferpool monitoring in pureScale.
First of all is our old friend bufferpool hit ratio, just (logical reads - physical reads ) / logical reads. You
notice here we are dividing out what we are subtracting off the prefetched ones looking at physical
reads. Also none of these metrics have LBP or GBP because this is the same formula we would use in non
pureScale DB2. The kind of values we would look for are 95% for indexes, 90% for data. Slightly lower
than that would be okay. If it got much lower then the “good values” then we would consider tuning up
the bufferpools.
That was the overall hit ratio. We can also look at the local LBP hit ratio.
Here all we do is we take logical bufferpool pages found and we subtract off any that might have been
prefetched, and we divide that by the total number of logical reads we did. And that gives us the number
that were satisfied locally.
Interesting thing is that the LBP hit ratio is typically lower then the overall hit ratio because it doesn't
include anything that was found in the group bufferpool. Now we still count invalid pages as a hit. They
are in LBP pages found. The reason why is if we counted it as a “miss”, we might be tempted to increase
the size of the local bufferpool to increase our hit ratio - but the fact is a miss isn't because our bufferpool
is too small. It's a miss because it was invalid. It was present, just marked stale. We don't need to increase
the size of the bufferpool. And so that's why we want to include it as a hit for the calculation.
DB2 pureScale: Best practices for performance and monitoring
page 16 of 36
Slide 18: pureScale bufferpool monitoring (cont.)
GBP hit ratio is very similar to the overall global one. We take logical reads minus physical reads,
divided by logical.
Interesting thing here is that the hit ratios for the GBP are typically quite low. Particularly for a very read
heavy environment. Imagine one out of one thousand references had to go the group bufferpool.
Chances are in a case like that the page we're looking for isn't in the group bufferpool either. We are
going to have to go to disk. We can end up with really low GBP hit ratios for high read ratio
environments. But that's not necessarily a problem. Increasing the GBP sizes is not always necessarily the
best thing to do. We have a little bit of information coming up on how to know when that's the case.
Slide 19: pureScale bufferpool monitoring (cont.)
Group bufferpool full conditions are an important thing to be aware of. This is kind of like, if you are
familiar in DB2 non pureScale, the concept of a dirty steal. Think of it like this: You come into a cafeteria.
You have your lunch with you on a tray and all the tables are dirty and you want to find a place to sit
down. You have to stop and clean the table - clean someone else's dishes - before you can sit down and
use that table. That's kind of what's happening here. We want to bring a newly modified page up to the
group bufferpool, and it's full with no place to put the new page. So we have to arrange for some cast out,
which is the pureScale equivalent for page cleaning, to free up some space. Those are expensive
operations. We don't want to block bringing in a new page with a forced cleaning of some of the GBP
pages. We have this calculation on a group bufferpool full condition.
There is table function called mon_get_group_bufferpool. We are just dividing out by the number of
group bufferpool full conditions per transaction, and multiplying it by 10000. We are basically saying
“how many group bufferpool full conditions have we really seen for 10000 transactions?”. Ideally we
want zero. But if it's only one or two it's probably fine. If it's 10 or 20 then chances are that we are seeing
a condition that is really impacting our pureScale performance. We want to do something about it.
DB2 pureScale: Best practices for performance and monitoring
page 17 of 36
Slide 20: pureScale bufferpool tuning
In terms of tuning, we're just going to take some of these simple calculations, and go through them
looking for what needs to be worked on. So, step 1, look at the overall bufferpool hit ratio. Does it meet
the goals? If it does, then we're done.
If not, then in step 2 you can check out the local bufferpool hit ratio. We looked at the target values that
we are looking at, that would be desirable. Important thing to remember though is that the group
bufferpool does two things. It pulls modified pages in, and it also holds references to every page in every
local bufferpool across the cluster.
So suppose I decide that my local bufferpools are too small, and I need to double the size. I have now
doubled the potential number of pointers or references that the GBP has to keep track of, because I have
doubled the number of pages in my local bufferpools across my cluster. This is going to use up more
space to do that, and it might actually end up squeezing out space that the GBP could use for modified
pages.
So, good rule of thumb, for every extra eight local bufferpool pages I add, the GBP needs one more page
for registration for those page references. So if I add 4000 new pages to the local bufferpools, I want to
add 500 new pages to the group bufferpool. That's just to keep everything balanced. We don't have to be
super accurate about it, but we don't want to go in there and just add a whole bunch of memory to local
bufferpools and not consider the group bufferpool.
Slide 21: pureScale bufferpool tuning (cont.)
That was local bufferpool hit ratio and group bufferpool hit ratio; we looked at these before. A couple of
important ones. I mentioned the group bufferpool ratio can be really low, and how do you tell whether or
not low is OK in our environment.
A couple of extra rules of thumb. First of all, there's a calculation here: pool_data_lf_reads, isthat greater
than 10 times pool_data_GBP_lf_reads? In other words, are less than 10% of the page reads going to the
GBP? In a very high read environment, if I'm not going to the GBP that much, I don't really care about the
DB2 pureScale: Best practices for performance and monitoring
page 18 of 36
ratio of success, you know, the hit ratio, at the GBP. If it's less than 10% then I'm not going to worry about
it too much. If it's more than 10%, chances are that it is actually helping me out, or it could be helping me
out, and I should pay attention to the GBP hit ratio.
Something else we can look at, is did I go to the GBP more than 25% of the time due to an invalid page?
Or was it really mostly going to the GBP for missing pages? If it was for invalid pages, well, then chances
are that the GBP is doing a lot more for the cluster in terms of performance, because if I have invalid
pages, chances are the GBP has the good copy of the page, and I don't have to go to disk for it. Whereas, if
I just don't have a page at all in the local bufferpool, the odds of it being in the GBP are much less. So, if it
is more than 25% due to invalid pages, chances are that the GBP is really helping me out and it could
benefit from extra pages above what it's got already.
Slide 22: pureScale bufferpool tuning (cont.)
Last one you want to check is GBP full. Like I said before, a great value for this, the number of GBP full
conditions per 10,000 transactions is zero. Goods values are less than 5 per 10,000 transactions. If it's
higher than that, there's a couple of things that we can look at.
The group bufferpool might be too small. There are castout engines in pureScale, and like I said, these are
the pureScale equivalent of page cleaners. They might not be keeping up. Or, do I have enough castout
engines configured that I set with numiocleaners? Or maybe SOFTMAX is set too high.
Slide 23: pureScale page negotiation (or 'reclaims')
Page negotiation, or something that's more commonly called a reclaim, is the idea that I mentioned earlier
on in the scenario where I could have two members each having a copy of the same page and each
wanting to modify different rows on the same page.
So let's look at a scenario here. Member A acquires a page P, and modifies a row on it and continues with
its transaction. So, member A gets page P, it modifies it and gets an exclusive lock from the global lock
manager, and does some other stuff for its transaction. Now, it hasn't committed yet. The guy goes for a
coffee, and is holding all these locks. Member B wants to modify a different row on the same page. Now,
DB2 pureScale: Best practices for performance and monitoring
page 19 of 36
it's a different row. If it was the same row, all the regular DB2 row locking stuff would kick in and it
would be no different than what we're normally used to. But here it's a different row. Now, the thing is
that the page can really only be modified in one place at a time, and B wants to modify it.
Does it have to wait? Does B have to wait until A commits so that it can get a hold of that modified page?
The answer is no. We have a mechanism in pureScale called a page reclaim. Page negotiation is another
name for it. What happens here is that member B makes a request to the CF as usual. The CF sends a
request to member A to release the page. Member A writes its modified copy of the page to the log. The
yellow copy of P is now off in the log for member A. The CF then reclaims the page, and passes page P off
to member B, which can then proceed with its transaction. Notice that member A has not committed its
transaction yet, and member B is still able to proceed.
This is pretty great technology really, because if we didn't have this, we would have a real contention or
concurrency problem between these two members. But instead they are both able to proceed. Now an
interesting bit of trivial here, what happens if member A rolls back? Because now remember that member
B holds the modified page. Well, what happens if member A rolls back, it is simply going to try to get
that exclusive page lock back, and the page will be reclaimed from member B, etc., etc., back to member A
and then it can make its modifications to the page.
Slide 24: Monitoring page reclaims
Now these are powerful and helpful for concurrency, but they're not free. They're not even really very
cheap. So we don't really want to see too many page reclaims going on.
We have a table function called mon_get_page_access_info that gives us some good information about
this. Here's an example of some of the numbers that might come up. Basically we can see in the bottom
table here, it's going to give us information for each table - the table itself, or the indexes on the table, how
many reclaims have been going on. These different columns represent page reclaims for exclusive lock,
page reclaim for sharing, etc... Basically we're just looking for small numbers here. If we see one table or
index of a table that has particularly high numbers, that's an indication that maybe the members are
fighting over the pages, and having to reclaim the page back and forth.
DB2 pureScale: Best practices for performance and monitoring
page 20 of 36
Now, how much is too much? Here it's 12, 641. Is that excessive? Well, if this was taken over, say a week,
it wouldn't be excessive at all. If this was 12, 641 in a minute, yes that would be excessive. So, a good rule
of thumb, more than one reclaim per ten transactions is probably worth looking into.
Slide 25: Reducing page reclaims
There are a couple of ways that we can reduce page reclaims. One is to use smaller page sizes, because, as
I mentioned earlier on, a large page size fits more rows on there, which might sound good. But that can
encourage “false sharing”: two different rows needed by two different members on the same page. The
bigger the pages, the more likely that is to happen. So small page sizes can reduce false sharing conflicts,
etc...
Another sort of paradigm that tends to come up a fair bit, is this idea of a tiny but hot table, with frequent
updates. You can have a great big complicated schema, but it turns out that there's one very important
small linchpin kind of table that has a small number of rows but is frequently updated, and as a result,
those pages are getting yanked back and forth between members and can cause lots of reclaims. What we
would recommend there is to use PCTFREE. So if I had a table with, say, 100 rows, and it was being
frequently updated, those 100 rows might all fit on one page. If I set PCTFREE to 99 and REORGed,
chances are those 100 rows are going to end up on 100 different pages. It's now 100 times bigger than it
was, but we're still only talking about 100 pages, which is really small, and now in a case like that where I
can get down to one row per page, I've completely eliminated this false sharing and things like that.
Slide 26: CURRENT MEMBER default column
reduces contention
A new feature that was added in DB2 10 was the CURRENT MEMBER special register and hidden
column. This helps me deal with probably the most common causes of reclaims. And one of them is, case
number one, frequent inserts of increasing numeric values. Call this a high key insert. So we're inserting
rows into a table because an index on a column where the new value is always bigger than all the
previous ones. It might be the timestamp, it might be an order number, it might be something like that.
You can picture an index in your mind, all the inserts, all the activity, all the new key insertion, is kind of
down at one side. It's at the high end of the index. There's a lot of contention there for those pages at the
DB2 pureScale: Best practices for performance and monitoring
page 21 of 36
high end of the index. And if I've got, since this is pureScale, multiple members who are all wanting to
insert rows into those pages at the top end of the index, that can create contention.
What we can do here, is we can add a CURRENT MEMBER column, and some SQL here is going to
show us what we're going to do. We're going to ALTER TABLE. The table here in this example is called
ORDERS. We're going to add a column called CURMEM, type small integer. We're going to say its
'default current member implicitly hidden'. OK, so what's going to happen now is every time a row is
inserted, that CURRENT MEMBER column is going to be populated with the number of the member
where the agent was running when the insertion happened. So it could be member zero or one or two.
Now it's implicitly hidden, which is good, just in case there's any bad programming practice out there
where we say SELECT *, you won't get the current member back. But we're not talking about any
application changes here. This is just a DDL change. And then we're going to create an index. Instead of
the index just being on sequence number, of table orders, we're going to make it on CURRENT MEMBER
and sequence number. We're going to stick CURRENT MEMBER out front. Now, if I had a query that
then said, you know, “SELECT stuff FROM ORDERS, WHERE sequence number = some value”, you
might think that this index wouldn't be used because I have to go specify CURRENT MEMBER, and how
do I know what CURRENT MEMBER should be – it's just not very good.
The really great thing that goes along with this is that DB2 10 introduced a feature called jump scan that
actually makes this index work the way you want it to work. In jump scan terms we call it a gap column.
Even if CURRENT MEMBER is not specified in the query, we can still use the index in a very efficient
way. And that means that my index is now not just on sequence number, but CURRENT MEMBER and
sequence number. And so each member's insertions are in a different part of the index. Member 0 could
be down the left hand side, member 1 might be in the middle, member 2 might be on the right hand side.
And they're staying out of each other's way and reducing page reclaims.
Case two for this is a low cardinality index. So suppose I'm inserting rows into a table of people and
there's a column called GENDER that's just got two values. And this index on gender just has two keys:
male and female. And then it's got pointers to all the records, all the male records, all the female records,
etc. You can imagine the contention in trying to update those pointers. So CURRENT MEMBER comes to
the rescue again. We do a similar ALTER TABLE. We're going to add CURMEM. In this case here, we
don't have to add it in front – we can add it behind. But the point is that now we've taken these, in this
DB2 pureScale: Best practices for performance and monitoring
page 22 of 36
example I've got STATE, the same thing would happen for GENDER. By adding CURRENT MEMBER
to the column I've made the key sort of a finer granularity, and I'm helping to spread out the keys that are
being updated with pointers that point to rows with duplicate values. In doing that I also reduce page
reclaims between the members. Very cool stuff.
Now, the studious people that are listening to this will notice that this technique, in particular case one,
works for non-unique indexes. If this was a unique index this wouldn't do the trick. But it works great for
non-unique indexes.
Slide 27: Monitoring CF CPU utilization
Let's look at monitoring CF CPU utilization – how do I tell how busy my CF is? Like I said at the
beginning, if I go onto a CF, and I run vmstat, which tells me how busy the CPUs are, it's going to show
about 100% busy, even when the cluster is completely idle. And that's because there are threads running
on the CF and all their job is is to spin looking for new messages coming in from members. We want to
get those messages in and processed and out in 10s of microseconds. And the only way to do that is to be
right on the job. We can't afford to take interrupts, we can't afford to do context switches. As a result, the
architecture of the CF looks really busy.
How do you tell how busy it really is? How close to saturated is it? Well, we have a new [admin view]
called env_cf_sys_resources, and this reaches into the CF and gets me some good information in this
example output on the right here. And it shows the two CFs I've got, the primary and the secondary, and
its shows me memory, and lots of good stuff there. But the one I'm really interested in is the CPU
utilization. In this particular example, it's showing that both members are 93% busy. To me that's busier
than I'd want it to be. I would see this and say, “you know what, my CF is overextended”. Is this
something that I need to deal with? Is this a momentary spike? Then I probably want to add some more
cores to my CF so that my CF utilization comes down.
Slide 28: AUTOMATIC CF memory: simple case – 1
active database
DB2 pureScale: Best practices for performance and monitoring
page 23 of 36
In terms of CF memory, the important CF memory configuration parameters can be set to AUTOMATIC,
and they generally do a fine job. It's not like self-tuning memory manager; it's not like constantly moving
memory around or anything like that. These are calculated automatics. And most of the time they do a
great job.
Let's talk a little bit about the calculations that they do. So at the top most level, there's CF_MEM_SZ.
That's the total amount of memory available to the CF. And if we set everything to AUTOMATIC, the
kinds of things that we get are... CF_MEM_SZ set between 70 to 90% of the physical memory on the box.
CF_DB_MEM_SZ, that's the amount of memory for one database to use. It defaults to CF_MEM_SZ,
meaning that basically I'm going to take all the memory that I've got in the CF, and I'm going to allow it to
be used by one database. This is the 'one active database' scenario.
CF_SCA_SZ is the shared communication area. It's going to be calculated to be somewhere in the 5 to
20% range of CF_DB_MEM_SZ. This is holding things like table control blocks, and other metadata. Not
particularly interesting, but we've got to have some memory for stuff like that.
CF_LOCK_SZ is 15% of CF_DB_MEM_SZ, and this is obviously for the global lock list. And then the rest
of it, the bulk of the storage, goes to the group bufferpool. And that's on the right here, we can kind of see
an approximate break down of how the memory is doled out to different consumers.
Slide 29: AUTOMATIC CF memory and multiple
active databases
Now what if I've got multiple active databases and I still want to use AUTOMATIC? Now you saw the
scenario we just went through, and we gave all the memory to the first active database. And what if I've
got two? How does that work out?
Well, if I'm going to be having multiple active databases, there's a registry variable I should set called
DB2_DATABASE_CF_MEMORY. That basically gives the CF a heads up on how many databases you
want to have running concurrently. So if I set this registry variable to -1, it basically divides up the
memory by NUMDB. So the NUMDB configuration parameter would be used to divide it out. In this
case it's set to three, and each of these three databases, one, two, and three, would get a third of the
memory. Makes complete sense.
DB2 pureScale: Best practices for performance and monitoring
page 24 of 36
If I don't want to depend on NUMDB, I can also set it to an explicit percentage. I can say in this case 33,
so everybody gets 33%. You get the general idea.
Now if I just fire up pureScale, the defaults are DB2_DATABASE_CF_MEMORY is set to 100, in other
words, one database gets 100% of the memory, and NUMDB is 32. So if I want to have multiple active
databases, I want to pay some attention to these settings, NUMDB and this registry variable
DB2_DATABASE_CF_MEMORY so that I'm able to start up those multiple databases without running
out of memory.
Slide 30: Detecting an interconnect bottleneck
The interconnect, which brings messages from the members to and from the CF, is very important to
pureScale performance, and is something that we need to make sure doesn't create a bottleneck. You
know, Infiniband is something that sounds kind of infinite, but it definitely isn't. And we need to make
sure that there isn't a bottleneck which is occurring on the host communication adapters.
The typical ratio that I mentioned at the beginning - 6 – 8 CF cores per HCA - but there are cases when
even a 4 cores, if it was really busy, and there were lots of messages going back and forth, we could be
creating a bottleneck. So the kind of symptoms that we would see: poor cluster throughput overall, with
the CF not running out of CPU. In other words, there's free CPU on the CF, and cluster performance
overall is pretty poor, high CF response time, and I'll get to how you measure that. And increased
member CPU time.
Now, why would the CPU on the member go up if I have an interconnect bottleneck? Ordinarily the CF
responds so quickly, that an agent makes a request to the CF - to get a lock, to get a page, all these kinds
of things – it spins and waits to get an answer. It doesn't go back into the run queue and wait to be
rescheduled. It's counting on the fact that these requests are going to come back in a few tens of
microseconds. And so it's spinning, waiting, spinning, waiting – usually the response comes back really
fast, and spinning is the right thing to do. But if I'm in a cluster with a very busy CF, response time is
poor, and the member can end up spinning and soaking up more CPU time [on the member.]
DB2 pureScale: Best practices for performance and monitoring
page 25 of 36
How to measure? Probably the best, simplest way: we have two metrics that are very useful in this case.
We have CF_WAITS is roughly the number of calls to the CF, give or take. And the number of calls is sort
of the number of locks, the number of pages requested. All kinds of things like that – it's very workload
dependant. We're going to divide that into CF_WAIT_TIME, which is the total amount of time spent
waiting for those requests to get done. If we use the two of those together, we can get the average CF wait
time. We take the total message time divided by the total number of messages, we get the average time
for a single message.
We also have a separate one that's a little more specialized, RECLAIM_WAIT_TIME. It's the amount of
time spent waiting on reclaims, because the time spent waiting on reclaims is not included in
CF_WAIT_TIME.
The good thing is that these metrics are available in a number of different places. At a per statement level,
in the package cache via mon_get_pkg_cache_stmt, or at the agent level through mon_get_workload,
mon_get_connection, etc... So we can tune at multiple different levels of the pureScale system. At the top
level, the overall system level, at the statement level, at the agent level, and the application level. We have
all these different ways of looking at who is sending messages to the CF, and how long are they taking.
Slide 31: Drilling down on interconnection traffic
We can drill down a little bit. This first example here at the top of the page on page 31, is the example of
using CF average wait times. So, CF_WAITS and CF_WAIT_TIME include totals for all message types,
Lock requests, reads, writes. It includes time to send, time to process the command on the CF, and the
time to receive. The average time is a good, overall metric, but we can get more specific than that.
New in DB2 10, we have a table function called MON_GET_CF_WAIT_TIME, and it gives us
breakdowns by message type. So if you're really interested, you can look at different message response
times and processing times for lock requests, write requests, read requests, which do have different
average times. Lock requests are typically really fast. Writes can be really heavy depending on how many
pages are moving up, and things like that.
DB2 pureScale: Best practices for performance and monitoring
page 26 of 36
Here's an example of some of the output for that. We show the three different message types:
SetLockState, WriteAndRegisterMultiple, ReadAndRegister. And by the way, there are about 65 other
message types. If you're really interested you can see – but these are the core ones: lock, write and read.
The number of requests and the total amount of time. So we can get average request time for individual
message types.
There's even another table function called MON_GET_CF_CMD that reaches over to the CF and it tells
us how much time is actually spent on the CF. And this is really useful, because I can then look at the
average SetLockState time, as seen by the member, versus as seen by the CF. It can be quite different
depending on where am I spending my time. Am I spending my time in the network, or am I spending
my time on the CF? Where is the time going? Using these, I can help break that down.
Slide 32: Finding interconnect bottlenecks with
MON_GET_CF_CMD
This is another example here on page 32. MON_GET_CF_CMD gives us one extra thing that we can't
really get with an average CF wait time. And this is pretty handy. This is a message type called
CrossInvalidate. Now why is a CrossInvalidate message special? It is the message that the CF uses to
invalidate pages at members.
So, remember the example we talked about at the beginning about member A that modified a page and
committed. The CF also knows that member B also has a copy of that page. And in this case, we know
now that it's going to send a CrossInvalidate message to member B to mark that page as invalid.
CrossInvalidate is useful because it's a very small message, it's one byte long, it doesn't take any member
processing, it takes very little CF processing. So it's just about the best thing we have, as a way of
measuring network performance in pureScale, looking for the times of CrossInvalidate.
So, on average, even if all kinds of other stuff is going on in the cluster, very busy, etc., XI
CrossInvalidate times are pretty stable. You would expect them to be less than 10 microseconds. And if it
was up more than 20 microseconds, that's a pretty good, reliable sign that you've got a network
bottleneck to deal with.
DB2 pureScale: Best practices for performance and monitoring
page 27 of 36
Slide 33: Interconnect bottleneck example
Here's a great example of an interconnect bottleneck. So, in this situation we're running a Linux pureScale
cluster. It was really busy, it was running an SAP workload. We already had a CF with two Infiniband
HCAs. So we'd already given it extra bandwidth into the CF by having two cards instead of one.
We divided out CF_WAIT_TIME by CF_WAITS to get the average CF wait time. In this case it was 630
microseconds. Now on pureScale, we're always talking about tens of microseconds. You know, 100
microseconds is pretty high, 200 microseconds is really pretty high, 600 is crazy high. Six hundred was a
sign of a really serious problem here.
So the first thing we did, we checked over on the CF. Was it really busy? Well, it was about 75% busy. It's
high, but not so high that could explain a 630 microsecond request time. RECLAIM_WAIT_TIME was
also pretty high.
Slide 34: Add another CF HCA
So what we did is we decided to add another HCA. So we went from two CF HCAs to three. And there's
what it looks like. And the result were pretty spectacular. I wish all performance problems laid down and
died like this one, because this one was great. We went from 630 microseconds down to 145. We basically
ran four times faster by adding 50% more CF HCA capacity. This is an excellent indication that this was
the bottleneck.
So we had a huge drop in average CF_WAIT_TIME. The activity time of a really important INSERT
statement dropped from 15 milliseconds down to four. So this is not just some theoretical change in
microseconds at some low level of networking. This is the time to run an INSERT statement that the
application would see, dropped by a factor of almost four. Lots of performance improvement thanks to
that in this particular case. Definitely a bottleneck worth keeping an eye out for.
Slide 35: Low-level interconnect diagnostics
DB2 pureScale: Best practices for performance and monitoring
page 28 of 36
We can go lower level than CF_WAIT_TIME. There are some low level interconnect diagnostics. The bad
news is our friend netstat, which is a very common Linux/UNIX network utility, does not provide useful
information about Infiniband throughput. It basically says it's there, but it won't tell us anything about
packets.
The good news is that there's some other way of doing that. If you're on Linux, there's a tool called
perfquery, which will tell us how many packets have been sent and received, and how long they took so
we can get average times.
Some instructions here about how to do that, and a target range of, say, 300,000 – 400,000 packets per
second in or out bound is a good upper bound for a single HCA. If you're getting up to that, you're
probably hitting the wall, and you want to consider adding another HCA.
If you are not on Linux, you can also get packet counts directly from the IB switch management port. And
this actually applies to ethernet as well. You can go to the switch. You can ask the switch how many
packets have been sent or received from the port that is connected to the CF. And so there are different
instructions that we can use depending on the make of the switch, whether it's QLogic or Mellanox, etc...
These are not everyday monitoring steps by any means. These are the kind of things that you would do in
an exception case. You have a problem, you really want to know what's going on. You can go down to
these low level statistics at the switch if need be, and figure out how many packets are going back and
forth.
Slide 36: pureScale disk IO
I mentioned at the beginning, pureScale disk IO is not really different from EE all that much. Really we
want to see random reads in the five to ten millisecond range. Asynchronous writes via castouts, like
page cleaning, in the one to five millisecond range, and because we're sensitive to log writes, one to three
milliseconds for your average log write time.
As I said at the beginning, we're sensitive to that, so we want to track transaction write log performance.
We can use mon_get_workload, a database snapshot, or something new in DB2 10,
DB2 pureScale: Best practices for performance and monitoring
page 29 of 36
mon_get_transaction_log, which is a new table function which gives us lots of really good, precise
information about the performance of the transaction log.
Also in terms of disk IO on pureScale, I mentioned earlier on that db2cluster sets good initial values.
Probably, the one thing that we might suggest is tuning up for GPFS in version 9.8, would be setting
worker1threads. It is a GPFS parameter that determines how many concurrent IOs can be going on at the
same time. It's kind of internal threads inside GPFS. We recommend setting it to 256 to get greater
concurrency. In DB2 10, we've already taken that over and we set that to 256 automatically. But if you're
running on one of the earlier versions of pureScale, then you might want to set that explicitly yourself.
Slide 37: Castout configuration
We talked about castout being like page cleaning. If you've heard about Alternate Page Cleaning in EE,
castout in pureScale is quite similar to that. And that means that it depends on a single database
parameter instead of two database parameters to control it.
Castout engines run on the members, and like I said earlier on, the CF doesn't do any disk IO itself – the
members do all the IO. And so the castout engines run on the members, and write modified pages to disk
that they obtain from the CF. There are still page cleaners in a pureScale cluster by the way.
There are two kinds of write threads. There are castout engines and page cleaners. And page cleaners
have a slightly reduced job in pureScale. They just write what we call GBP independent modified pages.
There are certain pages that are produced – modified – in the local bufferpool that don't make any sense
to send to the group bufferpool. Because who else is going to need it? Think about something like an
index CREATE. The pages of an index under construction are of no use to anybody else until the entire
index is created. And so there's no point in sending those up to the CF, the GBP and the CF, in case
anybody else needs them. Nobody will. And so that is an example of a GBP independent page that just
got written to disk directly from the member without having to go up to the CF.
A couple of things that influence castout activity. First of all, it's impacted by SOFTMAX, soft checkpoint
value. If I reduce the value of SOFTMAX, that is going to have two implications. I'm going to get faster
group crash recovery. That's if there's an entire cluster outage, for example a power outage. You know, if
the power goes out in the building, when it comes back on you have to do crash recovery. Group crash
DB2 pureScale: Best practices for performance and monitoring
page 30 of 36
recovery will be faster if you have been running with a lower value of SOFTMAX. But of course, it does
more aggressive cleaning and may have a performance implication at run time.
If you are migrating to pureScale, migration tip number one: you might want to consider setting
SOFTMAX a bit higher than you had it in EE. And the reason is that pureScale is a highly available
architecture. If there was, for example, a machine problem, that took down a member, that's only one
member in the cluster. DB2 pureScale is able to keep on going even without that member, whereas a
single machine failure in a DB2 EE configuration is the whole system. And in that case, we want to be
able to come back up as quickly as possible. There's reason to have a low SOFTMAX. But in pureScale,
because we have this extra layer of availability insurance, if you will, in the cluster configuration, where
we can tolerate a member or multiple members even going away, etc., you can consider having
SOFTMAX set a little bit higher, and be able to have better run time performance.
Migration tip number two: because this is based on alternate page cleaning, there is a configuration
parameter that you might be used to called CHNGPGS_THRESH. And it has no effect on castout in
pureScale. So if you happen to be coming from a DB2 EE system, where all the page cleaning is
happening due to CHNGPGS_THRESH, and maybe SOFTMAX was set a little bit too high, when you
go to pureScale CHNGPGS_THRESH no longer has any impact, and the cleaning will all be done based
on SOFTMAX.
So, migration tip number two, make sure that SOFTMAX is set to a reasonable value before starting up
on pureScale. So that you're not kind of surprised in case cleaning doesn't kick in when you thought it
should.
Something else that impacts castout activity: group bufferpool size relative to the database size. For
example, if you had a 1 GB group bufferpool, and a 10 TB database, the group bufferpool can obviously
only hold a very tiny fraction, one one hundredth of 1%, of the total database size in that example. And so
it's going to be having to free up pages for new pages, it's going to have to be victimizing pages and
writing pages out at a very high rate. A much higher rate than if the database size was only ten times
bigger instead of a thousand times bigger.
DB2 pureScale: Best practices for performance and monitoring
page 31 of 36
The last thing that impacts castout activity is the number of castout engines set with
NUM_IOCLEANERS. The automatic setting is generally fine. But in DB2 10, one of the things that was
changed with NUM_IOCLEANERS setting to AUTOMATIC, is that it is now based on the number of
physical cores in the system, and not the number of threads. And so, if you are on a large machine, with
DB2 9.8 and use AUTOMATIC, you probably don't want to use AUTOMATIC - because the machine is
running, say, with multi-threading, about 32 or 64 threads on 16 cores - instead of having 16 page
cleaners you would end up with, maybe 32 or maybe 64 page cleaners, which could be too many. So we
would recommend sticking to the number of cores, and you need to do that explicitly on version 9.8, but
in version 10, AUTOMATIC will do that for you.
Slide 38: Castout monitoring
How to monitor castout? It's easy! The basics of it are the same as monitoring page cleaning in EE. So the
same things that one would do on EE, like calculating the frequency of disk writes, and the amount of
time that disk writes take, can be done the same way, etc., either from snapshots or from the table
functions. Here's an example of a query that just pulls out some information from
MON_GET_WORKLOAD, MON_GET_BUFFERPOOL, to calculate index writes and data writes per
transaction.
It's also a good idea to keep track of the performance of this by looking at the OS level, the operating
system level, using tools like IOSTAT and NMON, to track the IO activity from the database. A good tip
is if the write activity is bursty... you know, it's either coming in smooth...if it's smooth, it's great. If it's
bursty, if it's choppy, if it's got lots of peak values, it could be that SOFTMAX is a little too high. You
might want to reduce SOFTMAX a little bit so that we smooth out the number of writes, and we achieve
better overall performance.
Also want to keep track of whether or not the write times at the OS level are reasonable. You know,
things more than about ten milliseconds or so, are a sign that the IO subsystem is not keeping up.
Slide 39: Optim Performance Manager and DB2
pureScale monitoring
DB2 pureScale: Best practices for performance and monitoring
page 32 of 36
I just want to mention Optim Performance Manager and DB2 pureScale. OPM has been able to monitor
DB2 pureScale for quite a while now. OPM 4.1.1 could do global monitoring of DB2 pureScale. So it can
do a per member and cluster-wide monitoring, etc., CF, CPU, a number of things, but at a pretty high
level.
OPM 5.1, which has a huge number of features and improvements relative to 4.1.1, also improves in the
area of pureScale by tracking GBP hit ratio per connection, all these kinds of things: CF requests per time,
page reclaim rate and time, global lock manager information, etc.
And, in DB2 10 and OPM 5.1.1, it even tracks the average CrossInvalidate time, the number of
CrossInvalidate requests, and a whole bunch of things that make it very suitable for tracking, and for
monitoring tuning performance in DB2 pureScale.
Slide 40: Summary
OK, and that brings me to the end. So just to wrap up, I think you'll notice after you've listened to this
presentation that there are a lot of things in pureScale that are very, very similar to how they were in EE,
in the sense that if you are a database administrator, or developer, or other interested party coming from
a DB2 EE environment, or even DPF for that matter, and you're going to pureScale, and you have in your
back pocket, your kit bag of processes and queries and tricks and things that you look at, very many of
those are going to be equally applicable on pureScale, in terms of configuration parameters, monitoring
techniques, desired or problematic metric ranges, etc.
The important thing I think in there, if you're coming from a EE environment to a pureScale environment,
is to keep those four key architectural differences I mentioned right at the very beginning, keep those in
mind as a way of helping identify some of the new, potential performance areas that you want to keep
track of. First of all, the CF being the hub of cooperation and communication between the members. That
we've got a very key piece of technology in low latency interconnect between members and the CF. We've
got this two layer bufferpool, group bufferpool and a local bufferpool at each member. And we've got the
notion of page locks and page negotiation that make lock processing in pureScale just a little bit different
than in earlier versions of DB2.
DB2 pureScale: Best practices for performance and monitoring
page 33 of 36
Slide 41: Summary (continued)
So in terms of a way to get going, I'd suggest, if your just starting off with EE based monitoring and
tuning techniques, simple stuff, looking at the core tools and techniques that you have, using
AUTOMATIC in most cases, and tuning from there. So not getting caught up in trying to figure out
exactly what number should be set for each configuration parameter, but letting DB2 do some of the work
there with setting AUTOMATIC. Using the standard bufferpool hit ratio methods, applying them to the
system as a whole at the LBP and then at the GBP, and keeping track of IO tuning, using our target range
for read and write times, and just making sure that IO bottlenecks don't develop.
From there, from the core EE, the core DB2 stuff, we can progress to the pureScale areas. For example, CF
resource allocation, CF response time, CPU utilization on the CF, etc., and the behaviour of page
negotiations, i.e. reclaims. How often are they happening and how much impact are they having on the
system?
And of course if you are considering moving to DB2 pureScale, DB2 10 is a great place to go because
we've got a number of new improvements in DB2 10 in the area of pureScale that are enhancing
performance. For example, CURRENT MEMBER, better monitoring information, the jump scan, and
other core DB2 engine improvements. And plus broader support in OPM, broader support for pureScale
in OPM 5.1.1.
Thank you very much!
DB2 pureScale: Best practices for performance and monitoring
page 34 of 36
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service may
be used. Any functionally equivalent product, program, or service that does not infringe
any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described
in this document. The furnishing of this document does not grant you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where
such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES
CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore, this
statement may not apply to you.
Without limiting the above disclaimers, IBM provides no representations or warranties
regarding the accuracy, reliability or serviceability of any information or recommendations
provided in this publication, or with respect to any results that may be obtained by the use
of the information or observance of any recommendations provided herein. The
information contained in this document has not been submitted to any formal IBM test and
is distributed AS IS. The use of this information or the implementation of any
recommendations or techniques herein is a customer responsibility and depends on the
customer’s ability to evaluate and integrate them into the customer’s operational
environment. While each item may have been reviewed by IBM for accuracy in a specific
situation, there is no guarantee that the same or similar results will be obtained elsewhere.
Anyone attempting to adapt these techniques to their own environment do so at their own
risk.
This document and the information contained herein may be used solely in connection
with the IBM products discussed in this document.
This information could include technical inaccuracies or typographical errors. Changes are
periodically made to the information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or changes in the product(s)
and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience only
and do not in any manner serve as an endorsement of those Web sites. The materials at
those Web sites are not part of the materials for this IBM product and use of those Web sites
is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Any performance data contained herein was determined in a controlled environment.
Therefore, the results obtained in other operating environments may vary significantly. Some
measurements may have been made on development-level systems and there is no
guarantee that these measurements will be the same on generally available systems.
Furthermore, some measurements may have been estimated through extrapolation. Actual
DB2 pureScale: Best practices for performance and monitoring
page 35 of 36
results may vary. Users of this document should verify the applicable data for their specific
environment.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility or
any other claims related to non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or withdrawal
without notice, and represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To
illustrate them as completely as possible, the examples include the names of individuals,
companies, brands, and products. All of these names are fictitious and any similarity to the
names and addresses used by an actual business enterprise is entirely coincidental.
COPYRIGHT LICENSE: © Copyright IBM Corporation 2013. All Rights Reserved.
This information contains sample application programs in source language, which illustrate
programming techniques on various operating platforms. You may copy, modify, and
distribute these sample programs in any form without payment to IBM, for the purposes of
developing, using, marketing or distributing application programs conforming to the
application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of
these programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and
other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may
also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
Contacting IBM
To provide feedback about this transcript, write to [email protected]
To contact IBM in your country or region, check the IBM Directory of Worldwide
Contacts at http://www.ibm.com/planetwide
To learn more about IBM Information Management products, go to
http://www.ibm.com/software/data/
DB2 pureScale: Best practices for performance and monitoring
page 36 of 36