IBMr IBM® DB2® for Linux®, UNIX®, and Windows® Best practices DB2 pureScale performance and monitoring Webcast transcript Steve Rees Senior Technical Staff Member DB2 Performance Issued: January 2013 Contents DB2 pureScale performance and monitoring...................................................1 Webcast transcript.................................................................................................1 Introduction...........................................................................................................4 Slide 2: Agenda......................................................................................................4 Slide 3: Helpful high level stuff to remember about pureScale.....................4 Slide 4: Configuring pureScale for 'pureFormance'.........................................5 Slide 5: How many cores does the CF need?....................................................6 Slide 6: How much memory does the CF need?...............................................7 Slide 7: What about the cluster interconnect?...................................................8 Slide 8: Infiniband VS. Ethernet?........................................................................9 Slide 9: What about disk storage?.....................................................................10 Slide 10: Potential tuning for cluster scale-out................................................11 Slide 11: Sizing up the initial DB2 configuration............................................11 Slide 12: Sizing up the initial DB2 configuration (continued)......................12 Slide 13: Agenda – part 2...................................................................................13 Slide 14: A primer on two-level page buffering in pureScale.......................13 Slide 15: New LBP / GBP bufferpool metrics in pureScale...........................14 Slide 16: Accounting for pureScale bufferpool operations...........................15 Slide 17: pureScale bufferpool monitoring......................................................16 Slide 18: pureScale bufferpool monitoring (cont.)..........................................17 Slide 19: pureScale bufferpool monitoring (cont.)..........................................17 Slide 20: pureScale bufferpool tuning.............................................................18 Slide 21: pureScale bufferpool tuning (cont.)................................................18 Slide 22: pureScale bufferpool tuning (cont.)..................................................19 Slide 23: pureScale page negotiation (or 'reclaims').......................................19 Slide 24: Monitoring page reclaims..................................................................20 Slide 25: Reducing page reclaims......................................................................21 DB2 pureScale: Best practices for performance and monitoring page 2 of 36 Slide 26: CURRENT MEMBER default column reduces contention...........22 Slide 27: Monitoring CF CPU utilization.........................................................23 Slide 28: AUTOMATIC CF memory: simple case – 1 active database........24 Slide 29: AUTOMATIC CF memory and multiple active databases...........25 Slide 30: Detecting an interconnect bottleneck...............................................25 Slide 31: Drilling down on interconnection traffic.........................................27 Slide 32: Finding interconnect bottlenecks with MON_GET_CF_CMD.....27 Slide 33: Interconnect bottleneck example......................................................28 Slide 34: Add another CF HCA.........................................................................29 Slide 35: Low-level interconnect diagnostics..................................................29 Slide 36: pureScale disk IO...............................................................................30 Slide 37: Castout configuration.........................................................................30 Slide 38: Castout monitoring.............................................................................32 Slide 39: Optim Performance Manager and DB2 pureScale monitoring....33 Slide 40: Summary..............................................................................................33 Slide 41: Summary (continued).........................................................................34 Notices..................................................................................................................36 Trademarks....................................................................................................37 Contacting IBM.............................................................................................37 DB2 pureScale: Best practices for performance and monitoring page 3 of 36 Introduction This document contains the transcript of the webcast available on the IBM DB2 for Linux, UNIX, and Windows best practices community on IBM developerWorks at the following URL: https://ibm.biz/BdxMvG Hi, my name is Steve Rees. I lead the DB2 pureScale performance team at the IBM Lab up in Toronto. I am very glad to present today DB2 pureScale best practices for performance and monitoring. The material in this presentation is a combination of things we have learned at the Lab, on our internal performance tests as well as coming from performance benchmarks and customer engagements. It's really a pretty good summary of all the best practises that we've got on DB2 pureScale performance. Slide 2: Agenda I'd like to start off with a quick introduction and concepts. I am assuming that most people that are listening to this are fairly familiar with DB2 pureScale. They know the basics of it so I'm not going to spend a lot of time on it. We are going to then go on to the configuration angle on DB2 pureScale performance. We will look at the shape of the cluster and the components that make it up - what we would call clustered geometry - and the components and scaling of the cluster. That's about the first third of the presentation. The last portion of the presentation, about two thirds, is dedicated to monitoring and tuning in pureScale. Looking at bufferpools, locking, Cluster caching facility (CF), etc... Slide 3: Helpful high level stuff to remember about pureScale DB2 pureScale: Best practices for performance and monitoring page 4 of 36 I am not going to spend a lot of time getting into details about the background of DB2 pureScale. But there are a few things I wanted to mention that I find are helpful in understanding the differences between DB2 pureScale and non pureScale DB2. The first one is that the CF or cluster caching facility is the hub of the cluster. Its performance is extremely important when it comes to the performance of the overall cluster. The CF is the center of the communication and coordination between the members. All significant communication between members really goes through the CF. So good CF performance is very important to the overall performance of the cluster. Likewise the high speed, low latency interconnect, such as Infiniband, that is used to connect members to the CF, is very important in maintaining high performance in pureScale. The second thing is kind of an obvious one. DB2 pureScale is shared data technology. What that means is that there is really only one copy of the database. Different members in pureScale share and sometimes contend for access to different rows on the same page. What that means is that as we introduce the concept of pureScale it brings along the concept of page locks. That is a different kind of lock than we had to deal with in regular DB2 for Linux, UNIX, and Windows. This concept is very familiar to people who have experience with DB2 on System Z Parallel Sysplex. Page locks are new in pureScale. That has an impact on how the system is tuned. The third thing is that inserts, updates, and deletes as SQL operations tend to drive more cluster activity then selects. When we talk about tuning and sizing a pureScale cluster, what comes into it is the nature of the workload. What would be the fraction of the reads, i.e. selects, versus inserts, updates, and deletes. The last point is that pureScale introduces a kind of two tier bufferpool at the members and the CF. Regular DB2 ESE (Enterprise Server Edition) non pureScale has a single level bufferpool. DB2 pureScale introduces a multi level bufferpool, where each member has its own local bufferpool similar to ESE. The CF has what is called a group bufferpool, or GBP for short, that caches all modified pages from all members of the cluster. We have this two layer concept that makes tuning the bufferpools in pureScale just a little different then regular DB2. DB2 pureScale: Best practices for performance and monitoring page 5 of 36 Slide 4: Configuring pureScale for 'pureFormance' When configuring a pureScale cluster we can choose from a number of different cluster geometries or shapes. They can be made up of large machines or small machines or medium sized machines in different numbers, etc... and there are different ones shown on the page here. Usually what ends up happening when a customer deploys pureScale clusters, is that they would choose the cluster geometry based on other factors other than performance. They would choose possibly based on skills or available hardware, etc... The important thing to remember when configuring a cluster from the ground is that a balance of resources needs to be maintained. We have to have a balance of CPU, memory, disk, and interconnect. If they are out of balance we are going to end up wasting resources and getting a sub optimal performance out of the cluster as a whole. We need to maintain that balance. That theme will continue to reoccur throughout the presentation. At the bottom I have something marked “BP” for best practice. All the clusters shown on this page here include a secondary CF (for clustered caching facility.) pureScale can run with a single CF. That is not a problem, but it introduces a single point of failure. Best practice is to include a secondary CF. Slide 5: How many cores does the CF need? How many cores does a CF need? That is probably one of the first questions that come up in pureScale sizing discussion. As a rule of thumb, typically, the sum of cores across all pureScale members is about 6 to 12 times larger than the CF. That range depends on the read/write ratio of the workload. For a very write heavy workload it would probably be around the range of 6 times more. For example if I have 12 total cores on all the members then I would probably have 2 cores in each of the CF's. If I have a very read heavy workload with lots of selects, very few inserts, updates, and deletes, it would probably be in the range of up to 12 times as many cores, so 24 total member cores and 2 in each of the CFs. A good thing to remember is you don't have to pay to license the CF functionality, only the members. Obviously the CFs are not free, but at least you don't have to pay for the software that is running on those. DB2 pureScale: Best practices for performance and monitoring page 6 of 36 An important point is the CF can get extremely busy. The only way that pureScale can achieve a response time that is in the tens of microseconds is if the cores on the CF are allowed to focus on what they have to do. They have exclusive use of the CPU. If you go to one of the CFs and you run a vmstat command or another tool that shows the CPU utilization, typically then you will see 100% utilization and it's normal even on an idle cluster. We strongly advise dedicated cores for the CF. If you are in a virtualized environment then [shared processors] can be fine for the member. For the CF, because it is so busy, and because the response time of the activity going on on the CF is so critical, we strongly advise dedicated cores. We also advise at least one physical core for the CF. Sometimes when a very small pureScale cluster is set up, you might end up that 1 or 2 virtual CPU's might seem to be enough. We would recommend a physical core. On Power Series, a Power 7 machine, that would typically mean 4 logical threads. On Intel, that probably means two virtual threads. Having a single physical core, at least 1, is a good recommendation. You can co-locate the CF and a member within 1 LPAR or 1 Linux machine. Basically what that means is the CF and member are running in the same operating system and sharing the same cores, etc... If you are going to do that, then the CF and the member should be pinned the separate sets of cores. The CF gets one set of cores and the member gets the other set, so that they don't end up fighting for processor cores or cache space. Slide 6: How much memory does the CF need? The general rule of thumb for a cluster of three or more members is that the Group Buffer Pool (GBP) would be about 35 -40 % of the sum of all the local bufferpools. Suppose I had a 4 member cluster with a local bufferpool size of 1 million 4k pages. That would mean that our best practice rule of thumb GBP size would have 1.5 million pages. If I had a higher read workload with more selects, the size of the GBP can come down a little bit. We should consider 25 percent as a minimum. You can come down as low as 25 percent. We don't recommend you go any smaller than that because of other things that need to be stored in the GBP, even for read only workloads. If you have a 2 member cluster then we probably recommend about 40 -50 percent depending on the read/write ratio. GBP is an important factor here in sizing the CF. The CF memory is dominated by that, and GBP is definitely the biggest memory consumer. DB2 pureScale: Best practices for performance and monitoring page 7 of 36 CF_DB _MEM_SZ, the memory of the entire active database [on the CF], should be about 25% bigger than GBP size. Generally you would consider sizing the GBP and add another 25 % for the overall database memory size. That database memory size is used for things like lock list, the shared communications area, etc... and other data structures that exist in the CF besides the GBP. The GBP only stores the modified pages. Unlike the DB2 Parallel Sysplex where the coupling facility can store read only pages as well, on pureScale we are only storing modified pages. So the higher the read ratio, the smaller the CF and the GBP can be. A good thing to remember is that the GBP is always sized in 4 K pages, regardless of what is going on in the members. If you have a 1 thousand page 8 K bufferpool on a member, that is going to represent 2000 K pages at the GBP and we have to do the GBP sizing in 4 K pages, regardless what we are doing on the members. Slide 7: What about the cluster interconnect? The cluster interconnect is a very important piece of technology. It provides remote direct memory access between the members and the CF, at extremely low latency, sort of 10 microseconds to 20 microseconds; a very small amount of time. Typical configurations would use 1 interconnect adapter for Infiniband (we call that a host channel adaptor, or HCA.) We use one of those per CF, and one per member. If you are running on a Power system, those HCAs can be shared between multiple LPARs if needed. It is divided up by the Hypervisor. The CF HCA handles all the combined message traffic from all the members. That means that the HCA at the CF is going be the one that “heats up” the fastest - it's going to be the one that gets busiest fastest. The good news is that DB2 pureScale can support multiple host communication adapters, for Infiniband and for RoCE ethernet, on the CF. Then we can add additional adapters to eliminate any bottlenecks that might occur there. In very round figures we are probably looking at 1 HCA infiniband adapter supporting about 6 to 8 CF cores depending on the workload. Most of these adapters that pureScale is using each has 2 ports and you might look at those and say “ha ha, I have 2 ports here and I can get double the performance by simply DB2 pureScale: Best practices for performance and monitoring page 8 of 36 using both ports”. Unfortunately those ports share circuitry inside. In our experience in the lab we have not seen performance improvements by using both ports. Can an HCA be shared between member and CF partitions residing on one machine? Yes it can. I mentioned that the AIX Hypervisor can do that. One thing to be aware of is not to overload that HCA. If we have multiple members or members and a CF sharing an HCA, we could overload that if we were not careful. There are some suggestions here on how to calculate whether or not a given configuration would be suitable for one HCA. Slide 8: Infiniband VS. Ethernet? DB2 pureScale on Linux has supported RoCE Ethernet for quite a while, as well as Infiniband. AIX in DB2 10 now supports ROCE Ethernet as well. There are many customers who prefer to look at Ethernet because the technology is more familiar to them. They probably already have ethernet infrastructure in their IT department, and Infiniband might be new to them. So many customers prefer to look at Ethernet. In terms of raw bandwidth, from a performance perspective, Infiniband beats RoCE hands down. In terms of the raw bandwidth we are looking at here is the QDR Infiniband, 40 GB/secs. DDR is the one we use on AIX. Most commonly it's 20Gb/sec, whereas RoCE Ethernet for both Linux and AIX is 10 GB/sec. You might look at those and say “well, Ethernet is going be one quarter the speed of QDR Infiniband.” Fortunately it's not that bleak a story. DB2 pureScale performance depends significantly on small message response time. That's the most important thing. If we look at the message response time comparing RoCE and QDR Infiniband on the graph, the tall pink bars represent Ethernet. Lower is better in this graph. We notice that the QDR Infiniband bars are about half the height of the Ethernet ones, meaning the response time of Infiniband is about half that of what Ethernet is. So okay, half [the performance for ethernet] is better than a quarter. We were looking for a 4 to 1 performance difference between these and there is only a 2 to 1 performance difference with the response time. That's good. The story gets a little bit better still. There are a lot of things going on on a pureScale cluster, not just messages going back and forth. If we put together the whole picture and we look at throughput, comparing an Ethernet configuration using and an Infiniband configuration, we see about a 5 -15 percent performance difference between them. It's not 2 to 1 and it's DB2 pureScale: Best practices for performance and monitoring page 9 of 36 certainly not 4 to 1. It's about 5-15 percent difference; 5-15 percent better performance on Infiniband than on RoCE Ethernet. In many cases that is not going to be noticeable Of course your mileage will vary. It depends how performance sensitive your environment would be. But it's certainly much better performance for Ethernet then would be indicated by the bandwidth figure. Slide 9: What about disk storage? What about disk storage? Disk storage for DB2 pureScale is not all that different from DB2 EE (Extended Edition). We want to make sure we have an adequate IO bandwidth to keep response times low particularly for the logs. For the pureScale members, each one has its own log in it and it needs to flush its log more frequently then regular DB2 EE. That makes us even more focused on log performance in pureScale than we are in regular EE. That's why I want to mention solid stats disks (SSD). SSD is not normally something that would be associated with a transaction log device, but for pureScale what we are looking for is the absolute best log flush performance we can get. A relatively small number of solid state disks could make all the difference between adequate log performance and really good log performance, and might remove a bottleneck. That's something to keep in mind. Another aspect of disk storage is the notion of SCSI-3 persistent reserves (PR). This helps with the recovery times on the storage area network, or SAN. If one of the members goes down then that member needs to be fenced off from the storage before recovery can proceed. And that fencing off of the failed member from the storage can happen very quickly, in just a matter of seconds, if the system has SCSI-3 PR. It is supported at the storage level, and depends on the storage that you choose. There is IBM storage and non IBM storage that would support it. For some of the IBM ones like V7000 models shown here, it makes quite a bit of difference. A typical recovery time might be in the order of 20 to 30 seconds if you have SCSI-3 PR. Without SCSI-3 PR we have to rely on lease expiry. It could take 60 or 90 seconds or even a little more. For GPFS configuration, a best practice is to use a separate GPFS file system for the log and for the table spaces. Other than that pureScale has basically built into it many of the GPFS tuning steps that are required for best performance. For example, things like enabling direct IO and setting larger 1 megabyte block sizes - these steps help in the GPFS performance and are done automatically by the db2cluster command. DB2 pureScale: Best practices for performance and monitoring page 10 of 36 Slide 10: Potential tuning for cluster scale-out One of the major design points of DB2 pureScale is the ability to scale out. We can transparently add new members and we we don't have to re-partition data. Everything can grow in a very transparent manner. But back at the beginning we talked about the notion of balance. The balance of resources in the cluster is very important. If I double the number of members in my cluster but don't do anything about maybe the memory at the CF, or the cores in the CF, or maybe the number of Infiniband adapters in the CF, if I don't maintain that balance I might not get the benefit that I should get by growing the cluster by say, doubling my cluster size. So the kind of thing that you want to look at is: can the disk storage keep up, am I creating a bottleneck in the interconnect, and all these kind of things. The monitoring and tuning steps required to determine this and fixing this are going to be talked about in the second half of the presentation. Slide 11: Sizing up the initial DB2 configuration In terms of DB2 configuration, the initial things in DB2 you should think about for pureScale, a couple things come to mind. The first one is that larger extent sizes tend to perform better than small extent sizes. The reason why is that some operations require communication with the CF and some other processing every time a new extent is allocated. If my extent size is 2 pages, very small, I am going to do a lot of that kind of chatty communication with the CF as I am growing my table. If I am at the default 32 pages or 64, or 128, etc... or even larger extent sizes, then I am going to do many fewer interactions with the CF as the table grows, and that is going to save overhead in the cluster. The default of 32 pages usually works pretty well but you can feel free to go on beyond that, if the data that you have and the use that you are going to put it to provides for it. We also want to talk about smaller DB2 page sizes. Smaller DB2 page sizes tend to perform better than larger ones. The reason why is first of all a typical pureScale workload tends to drive a lot of random IO. We don't usually see a lot of scanning. We are looking at reading a row here, a row there, and each time we read one, if we are reading from disk and sending it to the cluster, then it will be wrapped in a page, so to speak. It comes along with other rows that maybe we are not interested in at the moment. The smaller that wrapping is, the smaller that packaging is, then the less resources it is going to take to move around, so a smaller page means less data flow between the member and the CF. In general we would DB2 pureScale: Best practices for performance and monitoring page 11 of 36 recommend using the smallest page size that would hold the row that you are going to use. If you have a 6000 byte row then that is going to be probably an 8K page. Probably not a 32K page; probably an 8K page would be better from a performance perspective. If you are moving applications that use sequences or identity columns to DB2 pureScale we would recommend you use a large cache size. There's a keyword on creating sequences and identity called the order keyword. Just avoid using that order keyword. The reason why is that on pureScale obtaining a new batch of numbers in sequence requires a chat with the CF and a log flush. Those are not all that expensive, but they are not free. So we would probably want - depending on how frequently we are going back for new identity and sequence numbers - we would want to make that as efficient as possible. A large cache size enables each member to have a set of numbers to draw on without having to go back to the CF or do the log flush. It generally results in a fair bit better performance. The thing to do it is to tune it. Test to see what cache size yields the kind of performance that you need in your system. Slide 12: Sizing up the initial DB2 configuration (continued) Another thing to remember is that DB2 pureScale can have a a greater lock list requirement then DB2 EE. If you are moving from the DB2 EE system then lock list might have to get bigger. Why is that? I mentioned in the beginning we have this thing called a page lock. A lock list in pureScale holds the same kinds of locks that we used to have like row locks and table locks and package locks, and variation locks and other things. Now in pureScale we also have whole page locks. Those other kinds of locks, like row locks for example, they can be escalated. We can use lock size and things to kind of control how many of them there are. Physical page locks don't escalate. They don't collapse into escalated replacements. And so we want to make sure that we won't run out of lock list size because of those page locks. The information center recommends 3% of local bufferpool size. It is a good rule of thumb for the lock list. I've seen cases of up to 6%. We don't have to go too much higher then that. Depending on how many locks would be held over a long period of time in your environment, 3-6% is probably going to be needed to make sure that you do not run out of lock list . A good piece of news in DB2 10 for pureScale users that DB2 pureScale now supports range partitioned tables. These were not supported in previous DB2 pureScale versions before DB2 Version 10. This is one DB2 pureScale: Best practices for performance and monitoring page 12 of 36 of the things that many customers really wanted to find and range partition tables are a great fit for a model where data flows in in a chunk where it gets processed and then flows out. The range of time can be days, or weeks, or months, that kind of idea. It is also useful for breaking up a range of keys in a large and highly concurrent piece of a table. By being able to do this we can have a local index on each partition, which can help produce contention as well. This results from the feature that came along in DB2 10 called Current Member (CUR_MEM) and we are going to talk a little bit more about that as another way to help reduce contention, if it happens to crop up. Slide 13: Agenda – part 2 That wraps up the first part. Let's move on to Monitoring and Tuning. Slide 14: A primer on two-level page buffering in pureScale Let's start with a little bit of background about the bufferpool in DB2 pureScale and the two-level metaphor that we use for it. As was said at the beginning, we have the local bufferpool at each member, called the LBP for short. And that is a lot like the DB2 EE bufferpool. It caches both read only and updated pages for that member. The group bufferpool is at the CF. It contains 2 main kinds of things. First of all it contains a reference to every page in all LBPs across the cluster. It's very important that the CF knows which members have which pages and which members don't have those pages. The GBP also contains copies of all modified pages. This is the part that most people know about, the modified pages being stored in the GBP. When a page is modified at a member, for example if I do an update statement and then commit that, at commit time the page is flushed from the member up to the CF and it goes in the group bufferpool. It can then be used by other agents, other applications running on other members. Eventually it will get flushed to disk as well. The efficiency advantage of this is quite strong. If we have a hot page which is being modified or referenced fairly frequently, it's in the GBP. It's very efficient to access and it's much quicker and better than going straight from there to disk. It can be an order of magnitude, or more, more expensive to go to disk. For example, page read requests over Infiniband could cost in the 20-40 microseconds range and that could be typically a hundred times faster than going to disk. The question is: How do I size these DB2 pureScale: Best practices for performance and monitoring page 13 of 36 bufferpools properly? The good news is that DB2 provides a whole group of monitoring metrics about the local bufferpool and the group bufferpool. So we can use those to tune the sizes to fit our system. Slide 15: New LBP / GBP bufferpool metrics in pureScale Let's look at some of those metrics. The first one is the pool_data_LBP_pages_found (LBP pages found). This is basically a count of the number of page references that resolve to the LBP. For example, we needed a page and it was present either valid or invalid in the LBP. What is an invalid page? Consider if I have a cluster with two members and I have the same page resident in both members, Member A and Member B. Member A updates that page. Now member B has a stale copy of the page. It doesn't have the updated version. We can't let B use that page thinking that it's current and thinking it has the latest data. Because it doesn't. When member A committed its change to that page, the CF knew that member B had a copy of this page and was able to reach out through RDMA and mark that page on member B as invalid or stale. Getting back to this particular counter, LBP pages found, we increment this counter if the page was there, either valid or invalid. Either “fresh” or “stale”, if you will. pool_data_gbp_l_reads (GBP logical reads) is the number of times we went to the group bufferpool to read a page. For example if a page wasn't found locally: first time I read it I check my local bufferpool. If it's not there I will go to the GBP to see it it's there. That's a GBP logical read. An interesting thing or kind of trivia about pureScale: it supports prefetching. All DB2 versions support prefetching. We support prefetching from disk. DB2 pureScale also supports prefetching from the group bufferpool up in the CF down into the local bufferpool on the member. If data access patterns suggest that this would be a good thing to do, DB2 will start shipping down pages in advance of you needing them. In other words, prefetching. This GBP L reads counter includes prefetching. And so when we are doing hit ratio calculation we have to take that into account. DB2 pureScale: Best practices for performance and monitoring page 14 of 36 pool_data_gbp_p_reads (GBP physical reads). This is a bit of misnomer because it kind of implies that the GBP or the CF is going to be doing reads from disk. The members in a pureScale cluster do all the disk IO. They do the reads and they do the writes. The CF and the GBP don't do the any IO. So why do we have a counter called GBP physical reads? This is a count of the number of physical reads that were done because the page was not present in either the local bufferpool or the group bufferpool. pool_data_gbp_invalid_pages (GBP invalid pages). This sounds pretty sinister but all it means is the number of times we went to the GBP because a page that I have locally at the member was present and marked invalid. It was stale. Consider the Member A and B scenario we talked about before. If member B went to read that page and it was marked invalid, then went to the GBP in the CF to see if it was there, that would increment this counter. It was a trip to the GBP because it was an invalid page. pool_async_data_gbp_l_reads (Async data GBP logical reads pages prefetched). This is the count of the number of pages we brought down from the GBP into the local bufferpool in prefetching. It is not extremely common, but it comes in handy in terms to help us modify our calculations and determine how much successful access there was to the group bufferpool in a non-prefetching or direct access kind of way. Slide 16: Accounting for pureScale bufferpool operations Let's look at some scenarios. We have some pictures here that make things a little bit clearer. We have four different common scenarios, and how they effect the counters down on the left-hand side. First case is very simple. The agent needs to find a page. It looks in the LBP. Is it there? Yes it's there and it is valid. We did a logical read and it was a local bufferpool page found. Second scenario. I have a page marked stale (marked invalid) in the local bufferpool. I can't use that and have to go to the GBP. We will see if it was there. In this case it was there. We will see which counters are incremented there. The same as the last time, but also including some of the GBP counters, including the invalid pages one. DB2 pureScale: Best practices for performance and monitoring page 15 of 36 In the third case, the page is not in the LBP but found in the GBP. Almost the same as the second case. It's not here at all. We go to the GBP and it's there and we have a slightly different pattern of counters updated. Last but not least, the scenario where the page hasn't been touched at all. It's not in the local bufferpool. It's not in the GBP. It's found on the local disk Instead. we look locally at the CF and we get it from disk. We end up with yet another set of different increments to our counters. Slide 17: pureScale bufferpool monitoring We can take those metrics and put them together into some basic monitoring techniques and formulas for looking at bufferpool monitoring in pureScale. First of all is our old friend bufferpool hit ratio, just (logical reads - physical reads ) / logical reads. You notice here we are dividing out what we are subtracting off the prefetched ones looking at physical reads. Also none of these metrics have LBP or GBP because this is the same formula we would use in non pureScale DB2. The kind of values we would look for are 95% for indexes, 90% for data. Slightly lower than that would be okay. If it got much lower then the “good values” then we would consider tuning up the bufferpools. That was the overall hit ratio. We can also look at the local LBP hit ratio. Here all we do is we take logical bufferpool pages found and we subtract off any that might have been prefetched, and we divide that by the total number of logical reads we did. And that gives us the number that were satisfied locally. Interesting thing is that the LBP hit ratio is typically lower then the overall hit ratio because it doesn't include anything that was found in the group bufferpool. Now we still count invalid pages as a hit. They are in LBP pages found. The reason why is if we counted it as a “miss”, we might be tempted to increase the size of the local bufferpool to increase our hit ratio - but the fact is a miss isn't because our bufferpool is too small. It's a miss because it was invalid. It was present, just marked stale. We don't need to increase the size of the bufferpool. And so that's why we want to include it as a hit for the calculation. DB2 pureScale: Best practices for performance and monitoring page 16 of 36 Slide 18: pureScale bufferpool monitoring (cont.) GBP hit ratio is very similar to the overall global one. We take logical reads minus physical reads, divided by logical. Interesting thing here is that the hit ratios for the GBP are typically quite low. Particularly for a very read heavy environment. Imagine one out of one thousand references had to go the group bufferpool. Chances are in a case like that the page we're looking for isn't in the group bufferpool either. We are going to have to go to disk. We can end up with really low GBP hit ratios for high read ratio environments. But that's not necessarily a problem. Increasing the GBP sizes is not always necessarily the best thing to do. We have a little bit of information coming up on how to know when that's the case. Slide 19: pureScale bufferpool monitoring (cont.) Group bufferpool full conditions are an important thing to be aware of. This is kind of like, if you are familiar in DB2 non pureScale, the concept of a dirty steal. Think of it like this: You come into a cafeteria. You have your lunch with you on a tray and all the tables are dirty and you want to find a place to sit down. You have to stop and clean the table - clean someone else's dishes - before you can sit down and use that table. That's kind of what's happening here. We want to bring a newly modified page up to the group bufferpool, and it's full with no place to put the new page. So we have to arrange for some cast out, which is the pureScale equivalent for page cleaning, to free up some space. Those are expensive operations. We don't want to block bringing in a new page with a forced cleaning of some of the GBP pages. We have this calculation on a group bufferpool full condition. There is table function called mon_get_group_bufferpool. We are just dividing out by the number of group bufferpool full conditions per transaction, and multiplying it by 10000. We are basically saying “how many group bufferpool full conditions have we really seen for 10000 transactions?”. Ideally we want zero. But if it's only one or two it's probably fine. If it's 10 or 20 then chances are that we are seeing a condition that is really impacting our pureScale performance. We want to do something about it. DB2 pureScale: Best practices for performance and monitoring page 17 of 36 Slide 20: pureScale bufferpool tuning In terms of tuning, we're just going to take some of these simple calculations, and go through them looking for what needs to be worked on. So, step 1, look at the overall bufferpool hit ratio. Does it meet the goals? If it does, then we're done. If not, then in step 2 you can check out the local bufferpool hit ratio. We looked at the target values that we are looking at, that would be desirable. Important thing to remember though is that the group bufferpool does two things. It pulls modified pages in, and it also holds references to every page in every local bufferpool across the cluster. So suppose I decide that my local bufferpools are too small, and I need to double the size. I have now doubled the potential number of pointers or references that the GBP has to keep track of, because I have doubled the number of pages in my local bufferpools across my cluster. This is going to use up more space to do that, and it might actually end up squeezing out space that the GBP could use for modified pages. So, good rule of thumb, for every extra eight local bufferpool pages I add, the GBP needs one more page for registration for those page references. So if I add 4000 new pages to the local bufferpools, I want to add 500 new pages to the group bufferpool. That's just to keep everything balanced. We don't have to be super accurate about it, but we don't want to go in there and just add a whole bunch of memory to local bufferpools and not consider the group bufferpool. Slide 21: pureScale bufferpool tuning (cont.) That was local bufferpool hit ratio and group bufferpool hit ratio; we looked at these before. A couple of important ones. I mentioned the group bufferpool ratio can be really low, and how do you tell whether or not low is OK in our environment. A couple of extra rules of thumb. First of all, there's a calculation here: pool_data_lf_reads, isthat greater than 10 times pool_data_GBP_lf_reads? In other words, are less than 10% of the page reads going to the GBP? In a very high read environment, if I'm not going to the GBP that much, I don't really care about the DB2 pureScale: Best practices for performance and monitoring page 18 of 36 ratio of success, you know, the hit ratio, at the GBP. If it's less than 10% then I'm not going to worry about it too much. If it's more than 10%, chances are that it is actually helping me out, or it could be helping me out, and I should pay attention to the GBP hit ratio. Something else we can look at, is did I go to the GBP more than 25% of the time due to an invalid page? Or was it really mostly going to the GBP for missing pages? If it was for invalid pages, well, then chances are that the GBP is doing a lot more for the cluster in terms of performance, because if I have invalid pages, chances are the GBP has the good copy of the page, and I don't have to go to disk for it. Whereas, if I just don't have a page at all in the local bufferpool, the odds of it being in the GBP are much less. So, if it is more than 25% due to invalid pages, chances are that the GBP is really helping me out and it could benefit from extra pages above what it's got already. Slide 22: pureScale bufferpool tuning (cont.) Last one you want to check is GBP full. Like I said before, a great value for this, the number of GBP full conditions per 10,000 transactions is zero. Goods values are less than 5 per 10,000 transactions. If it's higher than that, there's a couple of things that we can look at. The group bufferpool might be too small. There are castout engines in pureScale, and like I said, these are the pureScale equivalent of page cleaners. They might not be keeping up. Or, do I have enough castout engines configured that I set with numiocleaners? Or maybe SOFTMAX is set too high. Slide 23: pureScale page negotiation (or 'reclaims') Page negotiation, or something that's more commonly called a reclaim, is the idea that I mentioned earlier on in the scenario where I could have two members each having a copy of the same page and each wanting to modify different rows on the same page. So let's look at a scenario here. Member A acquires a page P, and modifies a row on it and continues with its transaction. So, member A gets page P, it modifies it and gets an exclusive lock from the global lock manager, and does some other stuff for its transaction. Now, it hasn't committed yet. The guy goes for a coffee, and is holding all these locks. Member B wants to modify a different row on the same page. Now, DB2 pureScale: Best practices for performance and monitoring page 19 of 36 it's a different row. If it was the same row, all the regular DB2 row locking stuff would kick in and it would be no different than what we're normally used to. But here it's a different row. Now, the thing is that the page can really only be modified in one place at a time, and B wants to modify it. Does it have to wait? Does B have to wait until A commits so that it can get a hold of that modified page? The answer is no. We have a mechanism in pureScale called a page reclaim. Page negotiation is another name for it. What happens here is that member B makes a request to the CF as usual. The CF sends a request to member A to release the page. Member A writes its modified copy of the page to the log. The yellow copy of P is now off in the log for member A. The CF then reclaims the page, and passes page P off to member B, which can then proceed with its transaction. Notice that member A has not committed its transaction yet, and member B is still able to proceed. This is pretty great technology really, because if we didn't have this, we would have a real contention or concurrency problem between these two members. But instead they are both able to proceed. Now an interesting bit of trivial here, what happens if member A rolls back? Because now remember that member B holds the modified page. Well, what happens if member A rolls back, it is simply going to try to get that exclusive page lock back, and the page will be reclaimed from member B, etc., etc., back to member A and then it can make its modifications to the page. Slide 24: Monitoring page reclaims Now these are powerful and helpful for concurrency, but they're not free. They're not even really very cheap. So we don't really want to see too many page reclaims going on. We have a table function called mon_get_page_access_info that gives us some good information about this. Here's an example of some of the numbers that might come up. Basically we can see in the bottom table here, it's going to give us information for each table - the table itself, or the indexes on the table, how many reclaims have been going on. These different columns represent page reclaims for exclusive lock, page reclaim for sharing, etc... Basically we're just looking for small numbers here. If we see one table or index of a table that has particularly high numbers, that's an indication that maybe the members are fighting over the pages, and having to reclaim the page back and forth. DB2 pureScale: Best practices for performance and monitoring page 20 of 36 Now, how much is too much? Here it's 12, 641. Is that excessive? Well, if this was taken over, say a week, it wouldn't be excessive at all. If this was 12, 641 in a minute, yes that would be excessive. So, a good rule of thumb, more than one reclaim per ten transactions is probably worth looking into. Slide 25: Reducing page reclaims There are a couple of ways that we can reduce page reclaims. One is to use smaller page sizes, because, as I mentioned earlier on, a large page size fits more rows on there, which might sound good. But that can encourage “false sharing”: two different rows needed by two different members on the same page. The bigger the pages, the more likely that is to happen. So small page sizes can reduce false sharing conflicts, etc... Another sort of paradigm that tends to come up a fair bit, is this idea of a tiny but hot table, with frequent updates. You can have a great big complicated schema, but it turns out that there's one very important small linchpin kind of table that has a small number of rows but is frequently updated, and as a result, those pages are getting yanked back and forth between members and can cause lots of reclaims. What we would recommend there is to use PCTFREE. So if I had a table with, say, 100 rows, and it was being frequently updated, those 100 rows might all fit on one page. If I set PCTFREE to 99 and REORGed, chances are those 100 rows are going to end up on 100 different pages. It's now 100 times bigger than it was, but we're still only talking about 100 pages, which is really small, and now in a case like that where I can get down to one row per page, I've completely eliminated this false sharing and things like that. Slide 26: CURRENT MEMBER default column reduces contention A new feature that was added in DB2 10 was the CURRENT MEMBER special register and hidden column. This helps me deal with probably the most common causes of reclaims. And one of them is, case number one, frequent inserts of increasing numeric values. Call this a high key insert. So we're inserting rows into a table because an index on a column where the new value is always bigger than all the previous ones. It might be the timestamp, it might be an order number, it might be something like that. You can picture an index in your mind, all the inserts, all the activity, all the new key insertion, is kind of down at one side. It's at the high end of the index. There's a lot of contention there for those pages at the DB2 pureScale: Best practices for performance and monitoring page 21 of 36 high end of the index. And if I've got, since this is pureScale, multiple members who are all wanting to insert rows into those pages at the top end of the index, that can create contention. What we can do here, is we can add a CURRENT MEMBER column, and some SQL here is going to show us what we're going to do. We're going to ALTER TABLE. The table here in this example is called ORDERS. We're going to add a column called CURMEM, type small integer. We're going to say its 'default current member implicitly hidden'. OK, so what's going to happen now is every time a row is inserted, that CURRENT MEMBER column is going to be populated with the number of the member where the agent was running when the insertion happened. So it could be member zero or one or two. Now it's implicitly hidden, which is good, just in case there's any bad programming practice out there where we say SELECT *, you won't get the current member back. But we're not talking about any application changes here. This is just a DDL change. And then we're going to create an index. Instead of the index just being on sequence number, of table orders, we're going to make it on CURRENT MEMBER and sequence number. We're going to stick CURRENT MEMBER out front. Now, if I had a query that then said, you know, “SELECT stuff FROM ORDERS, WHERE sequence number = some value”, you might think that this index wouldn't be used because I have to go specify CURRENT MEMBER, and how do I know what CURRENT MEMBER should be – it's just not very good. The really great thing that goes along with this is that DB2 10 introduced a feature called jump scan that actually makes this index work the way you want it to work. In jump scan terms we call it a gap column. Even if CURRENT MEMBER is not specified in the query, we can still use the index in a very efficient way. And that means that my index is now not just on sequence number, but CURRENT MEMBER and sequence number. And so each member's insertions are in a different part of the index. Member 0 could be down the left hand side, member 1 might be in the middle, member 2 might be on the right hand side. And they're staying out of each other's way and reducing page reclaims. Case two for this is a low cardinality index. So suppose I'm inserting rows into a table of people and there's a column called GENDER that's just got two values. And this index on gender just has two keys: male and female. And then it's got pointers to all the records, all the male records, all the female records, etc. You can imagine the contention in trying to update those pointers. So CURRENT MEMBER comes to the rescue again. We do a similar ALTER TABLE. We're going to add CURMEM. In this case here, we don't have to add it in front – we can add it behind. But the point is that now we've taken these, in this DB2 pureScale: Best practices for performance and monitoring page 22 of 36 example I've got STATE, the same thing would happen for GENDER. By adding CURRENT MEMBER to the column I've made the key sort of a finer granularity, and I'm helping to spread out the keys that are being updated with pointers that point to rows with duplicate values. In doing that I also reduce page reclaims between the members. Very cool stuff. Now, the studious people that are listening to this will notice that this technique, in particular case one, works for non-unique indexes. If this was a unique index this wouldn't do the trick. But it works great for non-unique indexes. Slide 27: Monitoring CF CPU utilization Let's look at monitoring CF CPU utilization – how do I tell how busy my CF is? Like I said at the beginning, if I go onto a CF, and I run vmstat, which tells me how busy the CPUs are, it's going to show about 100% busy, even when the cluster is completely idle. And that's because there are threads running on the CF and all their job is is to spin looking for new messages coming in from members. We want to get those messages in and processed and out in 10s of microseconds. And the only way to do that is to be right on the job. We can't afford to take interrupts, we can't afford to do context switches. As a result, the architecture of the CF looks really busy. How do you tell how busy it really is? How close to saturated is it? Well, we have a new [admin view] called env_cf_sys_resources, and this reaches into the CF and gets me some good information in this example output on the right here. And it shows the two CFs I've got, the primary and the secondary, and its shows me memory, and lots of good stuff there. But the one I'm really interested in is the CPU utilization. In this particular example, it's showing that both members are 93% busy. To me that's busier than I'd want it to be. I would see this and say, “you know what, my CF is overextended”. Is this something that I need to deal with? Is this a momentary spike? Then I probably want to add some more cores to my CF so that my CF utilization comes down. Slide 28: AUTOMATIC CF memory: simple case – 1 active database DB2 pureScale: Best practices for performance and monitoring page 23 of 36 In terms of CF memory, the important CF memory configuration parameters can be set to AUTOMATIC, and they generally do a fine job. It's not like self-tuning memory manager; it's not like constantly moving memory around or anything like that. These are calculated automatics. And most of the time they do a great job. Let's talk a little bit about the calculations that they do. So at the top most level, there's CF_MEM_SZ. That's the total amount of memory available to the CF. And if we set everything to AUTOMATIC, the kinds of things that we get are... CF_MEM_SZ set between 70 to 90% of the physical memory on the box. CF_DB_MEM_SZ, that's the amount of memory for one database to use. It defaults to CF_MEM_SZ, meaning that basically I'm going to take all the memory that I've got in the CF, and I'm going to allow it to be used by one database. This is the 'one active database' scenario. CF_SCA_SZ is the shared communication area. It's going to be calculated to be somewhere in the 5 to 20% range of CF_DB_MEM_SZ. This is holding things like table control blocks, and other metadata. Not particularly interesting, but we've got to have some memory for stuff like that. CF_LOCK_SZ is 15% of CF_DB_MEM_SZ, and this is obviously for the global lock list. And then the rest of it, the bulk of the storage, goes to the group bufferpool. And that's on the right here, we can kind of see an approximate break down of how the memory is doled out to different consumers. Slide 29: AUTOMATIC CF memory and multiple active databases Now what if I've got multiple active databases and I still want to use AUTOMATIC? Now you saw the scenario we just went through, and we gave all the memory to the first active database. And what if I've got two? How does that work out? Well, if I'm going to be having multiple active databases, there's a registry variable I should set called DB2_DATABASE_CF_MEMORY. That basically gives the CF a heads up on how many databases you want to have running concurrently. So if I set this registry variable to -1, it basically divides up the memory by NUMDB. So the NUMDB configuration parameter would be used to divide it out. In this case it's set to three, and each of these three databases, one, two, and three, would get a third of the memory. Makes complete sense. DB2 pureScale: Best practices for performance and monitoring page 24 of 36 If I don't want to depend on NUMDB, I can also set it to an explicit percentage. I can say in this case 33, so everybody gets 33%. You get the general idea. Now if I just fire up pureScale, the defaults are DB2_DATABASE_CF_MEMORY is set to 100, in other words, one database gets 100% of the memory, and NUMDB is 32. So if I want to have multiple active databases, I want to pay some attention to these settings, NUMDB and this registry variable DB2_DATABASE_CF_MEMORY so that I'm able to start up those multiple databases without running out of memory. Slide 30: Detecting an interconnect bottleneck The interconnect, which brings messages from the members to and from the CF, is very important to pureScale performance, and is something that we need to make sure doesn't create a bottleneck. You know, Infiniband is something that sounds kind of infinite, but it definitely isn't. And we need to make sure that there isn't a bottleneck which is occurring on the host communication adapters. The typical ratio that I mentioned at the beginning - 6 – 8 CF cores per HCA - but there are cases when even a 4 cores, if it was really busy, and there were lots of messages going back and forth, we could be creating a bottleneck. So the kind of symptoms that we would see: poor cluster throughput overall, with the CF not running out of CPU. In other words, there's free CPU on the CF, and cluster performance overall is pretty poor, high CF response time, and I'll get to how you measure that. And increased member CPU time. Now, why would the CPU on the member go up if I have an interconnect bottleneck? Ordinarily the CF responds so quickly, that an agent makes a request to the CF - to get a lock, to get a page, all these kinds of things – it spins and waits to get an answer. It doesn't go back into the run queue and wait to be rescheduled. It's counting on the fact that these requests are going to come back in a few tens of microseconds. And so it's spinning, waiting, spinning, waiting – usually the response comes back really fast, and spinning is the right thing to do. But if I'm in a cluster with a very busy CF, response time is poor, and the member can end up spinning and soaking up more CPU time [on the member.] DB2 pureScale: Best practices for performance and monitoring page 25 of 36 How to measure? Probably the best, simplest way: we have two metrics that are very useful in this case. We have CF_WAITS is roughly the number of calls to the CF, give or take. And the number of calls is sort of the number of locks, the number of pages requested. All kinds of things like that – it's very workload dependant. We're going to divide that into CF_WAIT_TIME, which is the total amount of time spent waiting for those requests to get done. If we use the two of those together, we can get the average CF wait time. We take the total message time divided by the total number of messages, we get the average time for a single message. We also have a separate one that's a little more specialized, RECLAIM_WAIT_TIME. It's the amount of time spent waiting on reclaims, because the time spent waiting on reclaims is not included in CF_WAIT_TIME. The good thing is that these metrics are available in a number of different places. At a per statement level, in the package cache via mon_get_pkg_cache_stmt, or at the agent level through mon_get_workload, mon_get_connection, etc... So we can tune at multiple different levels of the pureScale system. At the top level, the overall system level, at the statement level, at the agent level, and the application level. We have all these different ways of looking at who is sending messages to the CF, and how long are they taking. Slide 31: Drilling down on interconnection traffic We can drill down a little bit. This first example here at the top of the page on page 31, is the example of using CF average wait times. So, CF_WAITS and CF_WAIT_TIME include totals for all message types, Lock requests, reads, writes. It includes time to send, time to process the command on the CF, and the time to receive. The average time is a good, overall metric, but we can get more specific than that. New in DB2 10, we have a table function called MON_GET_CF_WAIT_TIME, and it gives us breakdowns by message type. So if you're really interested, you can look at different message response times and processing times for lock requests, write requests, read requests, which do have different average times. Lock requests are typically really fast. Writes can be really heavy depending on how many pages are moving up, and things like that. DB2 pureScale: Best practices for performance and monitoring page 26 of 36 Here's an example of some of the output for that. We show the three different message types: SetLockState, WriteAndRegisterMultiple, ReadAndRegister. And by the way, there are about 65 other message types. If you're really interested you can see – but these are the core ones: lock, write and read. The number of requests and the total amount of time. So we can get average request time for individual message types. There's even another table function called MON_GET_CF_CMD that reaches over to the CF and it tells us how much time is actually spent on the CF. And this is really useful, because I can then look at the average SetLockState time, as seen by the member, versus as seen by the CF. It can be quite different depending on where am I spending my time. Am I spending my time in the network, or am I spending my time on the CF? Where is the time going? Using these, I can help break that down. Slide 32: Finding interconnect bottlenecks with MON_GET_CF_CMD This is another example here on page 32. MON_GET_CF_CMD gives us one extra thing that we can't really get with an average CF wait time. And this is pretty handy. This is a message type called CrossInvalidate. Now why is a CrossInvalidate message special? It is the message that the CF uses to invalidate pages at members. So, remember the example we talked about at the beginning about member A that modified a page and committed. The CF also knows that member B also has a copy of that page. And in this case, we know now that it's going to send a CrossInvalidate message to member B to mark that page as invalid. CrossInvalidate is useful because it's a very small message, it's one byte long, it doesn't take any member processing, it takes very little CF processing. So it's just about the best thing we have, as a way of measuring network performance in pureScale, looking for the times of CrossInvalidate. So, on average, even if all kinds of other stuff is going on in the cluster, very busy, etc., XI CrossInvalidate times are pretty stable. You would expect them to be less than 10 microseconds. And if it was up more than 20 microseconds, that's a pretty good, reliable sign that you've got a network bottleneck to deal with. DB2 pureScale: Best practices for performance and monitoring page 27 of 36 Slide 33: Interconnect bottleneck example Here's a great example of an interconnect bottleneck. So, in this situation we're running a Linux pureScale cluster. It was really busy, it was running an SAP workload. We already had a CF with two Infiniband HCAs. So we'd already given it extra bandwidth into the CF by having two cards instead of one. We divided out CF_WAIT_TIME by CF_WAITS to get the average CF wait time. In this case it was 630 microseconds. Now on pureScale, we're always talking about tens of microseconds. You know, 100 microseconds is pretty high, 200 microseconds is really pretty high, 600 is crazy high. Six hundred was a sign of a really serious problem here. So the first thing we did, we checked over on the CF. Was it really busy? Well, it was about 75% busy. It's high, but not so high that could explain a 630 microsecond request time. RECLAIM_WAIT_TIME was also pretty high. Slide 34: Add another CF HCA So what we did is we decided to add another HCA. So we went from two CF HCAs to three. And there's what it looks like. And the result were pretty spectacular. I wish all performance problems laid down and died like this one, because this one was great. We went from 630 microseconds down to 145. We basically ran four times faster by adding 50% more CF HCA capacity. This is an excellent indication that this was the bottleneck. So we had a huge drop in average CF_WAIT_TIME. The activity time of a really important INSERT statement dropped from 15 milliseconds down to four. So this is not just some theoretical change in microseconds at some low level of networking. This is the time to run an INSERT statement that the application would see, dropped by a factor of almost four. Lots of performance improvement thanks to that in this particular case. Definitely a bottleneck worth keeping an eye out for. Slide 35: Low-level interconnect diagnostics DB2 pureScale: Best practices for performance and monitoring page 28 of 36 We can go lower level than CF_WAIT_TIME. There are some low level interconnect diagnostics. The bad news is our friend netstat, which is a very common Linux/UNIX network utility, does not provide useful information about Infiniband throughput. It basically says it's there, but it won't tell us anything about packets. The good news is that there's some other way of doing that. If you're on Linux, there's a tool called perfquery, which will tell us how many packets have been sent and received, and how long they took so we can get average times. Some instructions here about how to do that, and a target range of, say, 300,000 – 400,000 packets per second in or out bound is a good upper bound for a single HCA. If you're getting up to that, you're probably hitting the wall, and you want to consider adding another HCA. If you are not on Linux, you can also get packet counts directly from the IB switch management port. And this actually applies to ethernet as well. You can go to the switch. You can ask the switch how many packets have been sent or received from the port that is connected to the CF. And so there are different instructions that we can use depending on the make of the switch, whether it's QLogic or Mellanox, etc... These are not everyday monitoring steps by any means. These are the kind of things that you would do in an exception case. You have a problem, you really want to know what's going on. You can go down to these low level statistics at the switch if need be, and figure out how many packets are going back and forth. Slide 36: pureScale disk IO I mentioned at the beginning, pureScale disk IO is not really different from EE all that much. Really we want to see random reads in the five to ten millisecond range. Asynchronous writes via castouts, like page cleaning, in the one to five millisecond range, and because we're sensitive to log writes, one to three milliseconds for your average log write time. As I said at the beginning, we're sensitive to that, so we want to track transaction write log performance. We can use mon_get_workload, a database snapshot, or something new in DB2 10, DB2 pureScale: Best practices for performance and monitoring page 29 of 36 mon_get_transaction_log, which is a new table function which gives us lots of really good, precise information about the performance of the transaction log. Also in terms of disk IO on pureScale, I mentioned earlier on that db2cluster sets good initial values. Probably, the one thing that we might suggest is tuning up for GPFS in version 9.8, would be setting worker1threads. It is a GPFS parameter that determines how many concurrent IOs can be going on at the same time. It's kind of internal threads inside GPFS. We recommend setting it to 256 to get greater concurrency. In DB2 10, we've already taken that over and we set that to 256 automatically. But if you're running on one of the earlier versions of pureScale, then you might want to set that explicitly yourself. Slide 37: Castout configuration We talked about castout being like page cleaning. If you've heard about Alternate Page Cleaning in EE, castout in pureScale is quite similar to that. And that means that it depends on a single database parameter instead of two database parameters to control it. Castout engines run on the members, and like I said earlier on, the CF doesn't do any disk IO itself – the members do all the IO. And so the castout engines run on the members, and write modified pages to disk that they obtain from the CF. There are still page cleaners in a pureScale cluster by the way. There are two kinds of write threads. There are castout engines and page cleaners. And page cleaners have a slightly reduced job in pureScale. They just write what we call GBP independent modified pages. There are certain pages that are produced – modified – in the local bufferpool that don't make any sense to send to the group bufferpool. Because who else is going to need it? Think about something like an index CREATE. The pages of an index under construction are of no use to anybody else until the entire index is created. And so there's no point in sending those up to the CF, the GBP and the CF, in case anybody else needs them. Nobody will. And so that is an example of a GBP independent page that just got written to disk directly from the member without having to go up to the CF. A couple of things that influence castout activity. First of all, it's impacted by SOFTMAX, soft checkpoint value. If I reduce the value of SOFTMAX, that is going to have two implications. I'm going to get faster group crash recovery. That's if there's an entire cluster outage, for example a power outage. You know, if the power goes out in the building, when it comes back on you have to do crash recovery. Group crash DB2 pureScale: Best practices for performance and monitoring page 30 of 36 recovery will be faster if you have been running with a lower value of SOFTMAX. But of course, it does more aggressive cleaning and may have a performance implication at run time. If you are migrating to pureScale, migration tip number one: you might want to consider setting SOFTMAX a bit higher than you had it in EE. And the reason is that pureScale is a highly available architecture. If there was, for example, a machine problem, that took down a member, that's only one member in the cluster. DB2 pureScale is able to keep on going even without that member, whereas a single machine failure in a DB2 EE configuration is the whole system. And in that case, we want to be able to come back up as quickly as possible. There's reason to have a low SOFTMAX. But in pureScale, because we have this extra layer of availability insurance, if you will, in the cluster configuration, where we can tolerate a member or multiple members even going away, etc., you can consider having SOFTMAX set a little bit higher, and be able to have better run time performance. Migration tip number two: because this is based on alternate page cleaning, there is a configuration parameter that you might be used to called CHNGPGS_THRESH. And it has no effect on castout in pureScale. So if you happen to be coming from a DB2 EE system, where all the page cleaning is happening due to CHNGPGS_THRESH, and maybe SOFTMAX was set a little bit too high, when you go to pureScale CHNGPGS_THRESH no longer has any impact, and the cleaning will all be done based on SOFTMAX. So, migration tip number two, make sure that SOFTMAX is set to a reasonable value before starting up on pureScale. So that you're not kind of surprised in case cleaning doesn't kick in when you thought it should. Something else that impacts castout activity: group bufferpool size relative to the database size. For example, if you had a 1 GB group bufferpool, and a 10 TB database, the group bufferpool can obviously only hold a very tiny fraction, one one hundredth of 1%, of the total database size in that example. And so it's going to be having to free up pages for new pages, it's going to have to be victimizing pages and writing pages out at a very high rate. A much higher rate than if the database size was only ten times bigger instead of a thousand times bigger. DB2 pureScale: Best practices for performance and monitoring page 31 of 36 The last thing that impacts castout activity is the number of castout engines set with NUM_IOCLEANERS. The automatic setting is generally fine. But in DB2 10, one of the things that was changed with NUM_IOCLEANERS setting to AUTOMATIC, is that it is now based on the number of physical cores in the system, and not the number of threads. And so, if you are on a large machine, with DB2 9.8 and use AUTOMATIC, you probably don't want to use AUTOMATIC - because the machine is running, say, with multi-threading, about 32 or 64 threads on 16 cores - instead of having 16 page cleaners you would end up with, maybe 32 or maybe 64 page cleaners, which could be too many. So we would recommend sticking to the number of cores, and you need to do that explicitly on version 9.8, but in version 10, AUTOMATIC will do that for you. Slide 38: Castout monitoring How to monitor castout? It's easy! The basics of it are the same as monitoring page cleaning in EE. So the same things that one would do on EE, like calculating the frequency of disk writes, and the amount of time that disk writes take, can be done the same way, etc., either from snapshots or from the table functions. Here's an example of a query that just pulls out some information from MON_GET_WORKLOAD, MON_GET_BUFFERPOOL, to calculate index writes and data writes per transaction. It's also a good idea to keep track of the performance of this by looking at the OS level, the operating system level, using tools like IOSTAT and NMON, to track the IO activity from the database. A good tip is if the write activity is bursty... you know, it's either coming in smooth...if it's smooth, it's great. If it's bursty, if it's choppy, if it's got lots of peak values, it could be that SOFTMAX is a little too high. You might want to reduce SOFTMAX a little bit so that we smooth out the number of writes, and we achieve better overall performance. Also want to keep track of whether or not the write times at the OS level are reasonable. You know, things more than about ten milliseconds or so, are a sign that the IO subsystem is not keeping up. Slide 39: Optim Performance Manager and DB2 pureScale monitoring DB2 pureScale: Best practices for performance and monitoring page 32 of 36 I just want to mention Optim Performance Manager and DB2 pureScale. OPM has been able to monitor DB2 pureScale for quite a while now. OPM 4.1.1 could do global monitoring of DB2 pureScale. So it can do a per member and cluster-wide monitoring, etc., CF, CPU, a number of things, but at a pretty high level. OPM 5.1, which has a huge number of features and improvements relative to 4.1.1, also improves in the area of pureScale by tracking GBP hit ratio per connection, all these kinds of things: CF requests per time, page reclaim rate and time, global lock manager information, etc. And, in DB2 10 and OPM 5.1.1, it even tracks the average CrossInvalidate time, the number of CrossInvalidate requests, and a whole bunch of things that make it very suitable for tracking, and for monitoring tuning performance in DB2 pureScale. Slide 40: Summary OK, and that brings me to the end. So just to wrap up, I think you'll notice after you've listened to this presentation that there are a lot of things in pureScale that are very, very similar to how they were in EE, in the sense that if you are a database administrator, or developer, or other interested party coming from a DB2 EE environment, or even DPF for that matter, and you're going to pureScale, and you have in your back pocket, your kit bag of processes and queries and tricks and things that you look at, very many of those are going to be equally applicable on pureScale, in terms of configuration parameters, monitoring techniques, desired or problematic metric ranges, etc. The important thing I think in there, if you're coming from a EE environment to a pureScale environment, is to keep those four key architectural differences I mentioned right at the very beginning, keep those in mind as a way of helping identify some of the new, potential performance areas that you want to keep track of. First of all, the CF being the hub of cooperation and communication between the members. That we've got a very key piece of technology in low latency interconnect between members and the CF. We've got this two layer bufferpool, group bufferpool and a local bufferpool at each member. And we've got the notion of page locks and page negotiation that make lock processing in pureScale just a little bit different than in earlier versions of DB2. DB2 pureScale: Best practices for performance and monitoring page 33 of 36 Slide 41: Summary (continued) So in terms of a way to get going, I'd suggest, if your just starting off with EE based monitoring and tuning techniques, simple stuff, looking at the core tools and techniques that you have, using AUTOMATIC in most cases, and tuning from there. So not getting caught up in trying to figure out exactly what number should be set for each configuration parameter, but letting DB2 do some of the work there with setting AUTOMATIC. Using the standard bufferpool hit ratio methods, applying them to the system as a whole at the LBP and then at the GBP, and keeping track of IO tuning, using our target range for read and write times, and just making sure that IO bottlenecks don't develop. From there, from the core EE, the core DB2 stuff, we can progress to the pureScale areas. For example, CF resource allocation, CF response time, CPU utilization on the CF, etc., and the behaviour of page negotiations, i.e. reclaims. How often are they happening and how much impact are they having on the system? And of course if you are considering moving to DB2 pureScale, DB2 10 is a great place to go because we've got a number of new improvements in DB2 10 in the area of pureScale that are enhancing performance. For example, CURRENT MEMBER, better monitoring information, the jump scan, and other core DB2 engine improvements. And plus broader support in OPM, broader support for pureScale in OPM 5.1.1. Thank you very much! DB2 pureScale: Best practices for performance and monitoring page 34 of 36 Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. Without limiting the above disclaimers, IBM provides no representations or warranties regarding the accuracy, reliability or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS. The use of this information or the implementation of any recommendations or techniques herein is a customer responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Anyone attempting to adapt these techniques to their own environment do so at their own risk. This document and the information contained herein may be used solely in connection with the IBM products discussed in this document. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual DB2 pureScale: Best practices for performance and monitoring page 35 of 36 results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: © Copyright IBM Corporation 2013. All Rights Reserved. This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Windows is a trademark of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Contacting IBM To provide feedback about this transcript, write to [email protected] To contact IBM in your country or region, check the IBM Directory of Worldwide Contacts at http://www.ibm.com/planetwide To learn more about IBM Information Management products, go to http://www.ibm.com/software/data/ DB2 pureScale: Best practices for performance and monitoring page 36 of 36
© Copyright 2026 Paperzz