ITCAM for Transactions: updating Web Response Time agent

ITCAM for Transactions: updating
Web Response Time agent
configuration to address Transaction
Tracking overload
Preface
This document records the technical challenges encountered during a particular Agentless Monitoring
deployment and the techniques and strategies used to overcome them.
Authors:
[email protected] William Lanny Short, Certified Process Specialist, IBM
[email protected] Robert Cheung, ITCAM for Transactions Developer, IBM
The problem
A business has a mission critical application that spans many systems and components, and creates a lot
of network traffic. When the deployment of ITCAM for Transactions: Web Response Time (KT5) was
planned, the intention was to configure each agent to run in appliance mode. This deployment strategy
would mean that only a small number of agents would be required and they could be deployed to
monitor the application-generated web traffic.
To allow an KT5 agent to monitor many hosts at once, network switches will need to enter Port
Spanning mode and gave the KT5 agent host a copy of the network traffic. In this case, the network
team could not support these hardware changes and port spanning could not be implemented. This
resulted in the need to install KT5 agents on every host that monitoring was desired. For example, if
the application consists of six IBM HTTP servers and six Websphere Application Servers, a KT5 agent
is needed on each of those servers.
The KT5 agent, by default, monitors all network traffic seen by its host's network interfaces. This has
the potential for generating a lot of noise that can significantly impact the performance of the KT5
agent and the ITCAM infrastructure that it points to. The Application Management Console agent
(KT3), which is critical for the management and maintenance of the ITCAM for Transactions solution,
could use up a lot of CPU cycles or crash. This can happen because the KT3 agent is configured, by
default, to report all of the applications that each KT5 agent finds. The CPU cycle consumption or
crashing of the KT3 agent could be the result of seeing duplicate applications or other unnecessary
network noise.
So how do you mitigate this problem and prevent such a problem from occurring in future application
on-boarding exercises? This white paper addresses these questions.
What are the underlying issues?
The underlying issue is the performance of the Transaction Reporter (KTO) and the Application
Management Console agent (KT3) agents. Due to a large number of KT5 agents deployed, these
agents have a lot of incoming data which causes each agent to consume a lot of CPU cycles or to crash.
The KT3 agent has the primary responsibility for configuring all the ITCAM for Transactions data and
filtering it based on applications that have been created as part of its configuration settings. It also
consolidates all the applications detected from the various ITCAM for Transactions agents, such as the
KT5 agent, the Robotic Response Tracking agents, and other ITCAM for Transactions agents, into one
pane.
The KTO agents have the primary responsibility for displaying the ITCAM for Transactions data as it
has been collected. They read through all the collected data and create transaction overlays based on
the configurations created in the KT3 and the server and client sources and destinations of the collected
data.
If the data collected is extremely noisy, the agents spend a significant amount of time trying to sort
through the noise and that is the primary cause of CPU cycle consumption or agent crashes.
What is the solution?
The solution to this problem has two phases.
The first phase is to correct the immediate problem. The steps for this phase address the current
problem with the KTO and KT3 agents getting bogged down in all the noisy ITCAM for Transactions
data and help get the agents back to doing their jobs with no performance problems.
The second phase is to prevent future problems when on-boarding applications. The steps for this phase
include creating a sand-box that can be used for initial testing of the KT5 agents deployed onto the new
application's components so that filtering configuration can be done as part of the on-boarding process.
Phase 1: Correcting the immediate problem
The aim of this phase is to limit the amount of data that KT5 agent creates by configuring it to only
monitor traffic of interest. Using above example, the T5 agents are configured to monitor only
HTTP(S) and Websphere Application Server traffic, and ignore other traffic. Furthermore, the KTO is
checked to ensure that the number of entities it is monitoring has not exceeded its capacity. If it has,
additional KTO agents has to be deployed to spread the load. This second step is important particular
in scenario where dozens or even hundreds of KT5 agents has been deployed.
First, stop the KTO agent from retrieving data from all the KT5 agents that is connected to the TEMS
but instead restricted to a relevant subset. The KTO agent should be talking only to the KT5 agents
that are deployed onto the application in question. This is done by setting the Aggregation Agent List
configuration parameter of the KTO agent, documented in this knowledge center page.
Second, using the Application Manager Console Editor (in the Tivoli Enterprise Portal) configure both
the KT5 data sources and the Transaction Collector (KTU) data sources to clearly restrict the KT5
agents to only report on traffic related to the application at hand. Detailed instruction on this can be
found in the best practices guide ITCAM for Transactions V7.3 Customization: Transaction Tracking
Filtering and reporter.
Third, confirm that the number of nodes and edges seen in the KTO transaction overlay diagrams is
manageable. To get an idea of the number of nodes, edges, and interactions, in the Tivoli Enterprise
Portal, complete the following steps:
1. In the navigator, select Transaction Reporter.
2. Right-click Transactions and select the Transaction Aggregate Topology workspace.
3. Click on the Table/Topology view toggle button to switch the view into a table row.
4. Check the number of rows returned.
Figure 1: TEP Transaction Aggregate Topology Workspace
Figure 1 documents what would be seen in the TEP Transaction Aggregation Topology Workspace.
Alternative Methods:
Check the Total displayed (circled in Figure 1). This is a coarser estimate because some of the
nodes and interactions can be hidden by various conditions
Perform a more detailed investigation of the Transaction Reporter logs. During each collection
interval (configured to be 2 minutes by default), the KTO gathers all aggregates and interactions
and logs how many of each were gathered and from which Transaction Tracking agents. To
find this information, search for "collectionPeriods()" in the latest KTO log.
Count the number of RecordIdentityxxx.xml files that are contained in the
<ITMHOME>/todata directory for the KTO. Each file represents a node that the KTO has
seen at some stage. For example, on UNIX or Linux run the following command:
> find <todata directory> -name "RecordIdentity*.xml" | wc -1
Example output: 5628
A manageable overlay diagram should be less than 5,000 nodes and edges. If there are more nodes
than that number, add more KTO agents to reduce the load, and then test the outcome. Repeat this
process interactively. That is, activate only a small number of KT5 agents at a time, and complete the
filtering and reporting steps for those agents before activating additional KT5 agents.
Phase 2: Preventing future problems when on-boarding applications
After phase 1, the KTO agent began working but introduction of additional KT5 agents cause it to
again be overloaded. So how can future applications be on-boarded problem free?
Ideally, create a sand-box in your Production environment where you can ensure that the on-boarding
steps for the Production version of the new application do not cause any problems for the current
deployment. More importantly, in the event that overload occurs, you can reset the KTO by flushing
filtered transaction tracking nodes and edges.
The only required component of the sand-box is a KTO agent that is used as part of the on-boarding
process. The KTO agent's Aggregation Agent List should contain only those KT5 agents that have
been deployed onto the new application's components.
Perform the same reporting and filtering steps that you used when correcting the original problem.
After you have completed those steps, note how many nodes and edges are in the sand-box KTO
transaction overlay diagram.
If the current main KTO's number of nodes and edges plus the sand-box KTO's number of nodes and
edges is more than 5,000, complete the following steps:
1. Make sure that the main KTO's Aggregation Agent list includes only those KT5 agents that
were already deployed.
2. Deploy and configure an additional main KTO agent and add the KT5 agents that were
pointing to the sandbox KTO to the new main KTO's Aggregation Agent list.
3. Remove the new KT5 agents from the sandbox KTO's Aggregation Agent list.
4. Reset the sandbox KTO agent by deleting the <agent_home>\todata directory and restart that
agent.
5. Repeat steps 1 - 4 until all the new application's KT5 agents have been added.
Note: Do not turn on data warehousing for the new main KTO until the filtering is complete. Never
turn on data warehousing for the sandbox KTO, there is no benefit and the agent's performance will be
impacted.
Conclusion:
This white paper discussed addressing Transaction Tracking overload problems that occur when there
is a need to deploy a significant number of ITCAM for Transactions: Web Response Time (KT5) agents
to monitor web traffic for an application.
Document History
Date
Revision Notes
June 2017
1.0
Initial version
June 2017
1.1
Added sandbox concept, added diagram
17 June 2017
1.2
Added Preface, more edits. Thanks to Alexander Thornton for
editorial review. First published