שקופית 1 - CERN Indico

ATLAS FroNTier cache consistency stress testing
David Front
Weizmann Institute
1
ATLASFroNTier chache consistency stress
testing
September 2009
outline
• Goals
• ??
• Parameters to test
2
ATLASFroNTier chache consistency stress
testing
September 2009
Handling
cache
consistency
ATHENA
COOL
CORAL
Client
machine
FroNTier client
squid
squid
3
Server
site
Tomcat
FroNTier servlet
Consult modification
times to invalidate or not
Oracle DB server
Memorize modification
time of each table
ATLASFroNTier chache consistency stress
testing
September 2009
Goals
1) Verify that the table modification time trigger does
not consume too much resources
I expect this to be trivial
2) Show that the performance of frontier with the cache
consistency does match expectations
3) Compare performance of frontier to direct Oracle
connection
Testing to be done at CERN
And possibly also sending requests from remote sites
4
ATLASFroNTier chache consistency stress
testing
September 2009
Testing
• To test creating high load on the caching system,
multiple instances of a job, created by Richard
Hawkings, will be submitted
• Jobs submitted as grid jobs
• Two Atlas FroNTier server machines that are
currently being configured will be used for this
testing
5
ATLASFroNTier chache consistency stress
testing
September 2009
Testing challenge
• According to Dave Dykstra, in order to load a
FroNTier server, one needs many tens of squids. squid
is a single threaded application
• The reasonable amount of squids to be installed on a
machine is its number of cores
• Allocating servers to function as pools of squids does
not seem to be the correct attitude
• An alternative is that a squid will be available on
each client machine
6
ATLASFroNTier chache consistency stress
testing
September 2009
A squid per client machine
In order to have a squid per client machine:
– Tens of worker nodes should be pre allocated
– As suggested by Johannes Elmsheuser, HammerClould can
not accommodate such a setting efficiently
– Rather, local submission to scheduler, using a dedicated
job queue may be done
– For this the appropriate resources should be allocated
(Dario Barberis?)
– In addition, there is a need for a way (script) to setup a
squid at beginning of test and tearing it down at end of
test
7
ATLASFroNTier chache consistency stress
testing
September 2009
Testing elements
• Squids setup and tear down
• Spawn writer jobs
one every 15 minutes (or 30 or 60)
• Spawning reader jobs via the grid
• monitoring
8
ATLASFroNTier chache consistency stress
testing
September 2009
FroNTier stress testing (at CERN)
Tens of pre allocated worker nodes
Using a dedicated job queue
Used to spawn Athena reader jobs
Via the grid
Testing manager
Setup/tear down
squids
Athena
Athena
reader
job job
Spawn writer jobs
Spawn reader jobs
Monitoring:
DB server
Frontier
machines
squid
Once a while
(15 minutes)
A writer job
writes fresh data
to Oracle
writer job
Tomcat
FroNTier servlet
Oracle DB server
9
ATLASFroNTier chache consistency stress
testing
September 2009
Parameters to test with
10
ATLASFroNTier chache consistency stress
testing
September 2009
Parameter: Arrival rate of reading client jobs
– I do not know what the expected rate is
– Taking into account the number of queries per
Athena job (~1-3k),
and comparing with the capability of CMS
FroNTier to answer queries,
I expect the whole system to be able to handle
2-300 jobs per minute (100-20K jobs per hour)
– Hence, testing may be repeated with the following
rates of reader jobs per minute: 10,100,1000
11
ATLASFroNTier chache consistency stress
testing
September 2009
Parameters to test with
• The rate in which ATLAS COOL tables are updated:
– Using: once every 15 minutes
• This is the rate that PVSS2COOL works
• According to Fred Luehring,
other COOL tables are updated in a slower rate
– Testing may be repeated with slower periods:
• Once every 30 minutes
• Once every 60 minutes
12
ATLASFroNTier chache consistency stress
testing
September 2009
Compare performance of frontier to Oracle
Network latency is an important issue, hence test:
– Reading client runs from CERN or remote near/far location
– TBD: Candidate remote locations may be
– For each of these 3(4) sites, run the example workload via frontier or
directly to Oracle at CERN.
13
ATLASFroNTier chache consistency stress
testing
September 2009
cache consistency policy
• A comparison between the performance with
and without cache consistency may be
expected.
• Test where both compared cases use the same
frontier delays (5 minutes).
14
ATLASFroNTier chache consistency stress
testing
September 2009
frontier delays
• We are currently using 5 minutes for the three
delays (squid/frontier server/DB trigger)
• I do not see the need to test with shorter
delays.
• Yet, it may be expected to compare
performance between different delays times,
in minutes: [5,10,15]
15
ATLASFroNTier chache consistency stress
testing
September 2009
Monitoring
16
ATLASFroNTier chache consistency stress
testing
September 2009
Testing time frame
• Each test will be done for a given period (one
hour) at steady load, after ramping-up the load
• Ramp-up may take an order to 1-2 hours,
according to Johannes Elmsheuser
• After stopping to spawn jobs, load will slow
down (maybe within an hour or two as well)
• Hence each test is expected to last about 3-5
hours
17
ATLASFroNTier chache consistency stress
testing
September 2009
Links
• David Dykstra: poster of the CMS solution
http://frontier.cern.ch/dist/Poster_CHEP09_Frontier-newcaching.pdf
• Richard Hawkings: ATLAS COOL reference workloads and tests
https://twiki.cern.ch/twiki/bin/view/Atlas/CoolRefWork
• David Front: Previous related presentations
http://indico.cern.ch/getFile.py/access?contribId=5&resId=1&materialId=slides&confId=59928
http://indico.cern.ch/getFile.py/access?subContId=2&contribId=2&resId=1&materialId=slides&confId=62120
18
ATLASFroNTier chache consistency stress
testing
September 2009