lcg-CE logs: parsing modes

Log analysis
and user traceability
Eygene Ryabinkin, [email protected],
Russian Research Centre «Kurchatov Institute»
March, 12th 2009, OSCT-7 meeting, Madrid
lcg-CE logs: general ideas



CE logs link Grid jobs to the local jobs, so they
are the most logical point to start from.
Jobmap logs are available and they have
almost all information: user DN, VOMS FQAN,
Grid (EDG) and LRMS job IDs, local user
mapping and gatekeeper contact.
With LRMS ID we can trace the job down to the
execution nodes. For Torque, one can use
either accounting logs or plain job logs. Don't
currently know about SGE in YAIM's flavour.
lcg-CE logs: additional details


Jobmap logs are laid out by date: file names are
grid-jobmap_YYYYMMDD. This is very good
and handy.
Jobmap logs are missing the IP address of the
client, so one should also parse gatekeeper
logs – oops! GK logs are huge and ugly, the
only unique identifier that links jobmap entry to
GK entries is Grid (EDG) job ID. IP address
lookup involves GK JM ID lookup and search
for IP on the previous entries.
lcg-CE logs: file locations
•
/var/log/globus-gatekeeper.*: most verbose
logs about jobs that gatekeeper processes.
•
/opt/edg/var/gatekeeper/grid-jobmap_*:
summaries of job run by lcgpbs and friends.
•
/var/spool/pbs/server_priv/accounting/*:
Torque logs that carry most activity traces, we
are mainly interested in start/end events.
•
/var/spool/pbs/server_logs/*: carry more
verbose Torque logs, but exist only on the
Torque server, not necessarily on the CE.
lcg-CE logs: parsing modes – 1


Typical problem 1: some user with a known DN
executed some jobs on the local farm in a given
interval of time. Find these jobs and, possibly,
dig out their details.
Solution: use 'job-search' parsing mode
providing the DN (regex, really) of user and time
interval. This gives the list of jobs for this user.
Modifier '--dig-lrms' instructs the tool to look up
job statistics from the LRMS records (currently
Torque-only using accounting logs).
lcg-CE logs: parsing modes – 2


Typical problem 2: we want to trace the job by
its Grid ID.
Solution: use 'job-search' parsing mode
providing Grid job ID. Jobmap logs are parsed
in the time-reversed order and search
terminates on the first hit (Grid job IDs are
unique), so recent jobs will be found rather
quickly. '--dig-lrms' can be used to get LRMS
job particulars.
lcg-CE logs: parsing modes – 3


Typical problem 3: find jobs that are submitted
using pure Globus (not LCG/gLite) methods in
the given time frame. The rationale is to look
who is submitting jobs directly to our CE.
Solution: use 'job-search' parsing mode
providing the time range and specifying '--onlydirect' switch. This mode will catch only LRMS
jobs: usages of fork jobmanager won't be
catched.
lcg-CE logs: parsing modes – 4


Typical problem 4: find all jobs that were using
'fork' jobmanager (direct execution on CE host).
This parsing mode is not finished, but 'jobsearch' with modifier '--only-fork' and a time
range will do the work. One problem is that
here we need to parse full gatekeeper logs and
extract records that aren't correspond to a
regular non-fork jobs. Since normal jobs also
use fork jobmanager to spawn gridmonitor/Condor-C, the problem isn't fairly trivial.
lcg-CE logs: GridFTP




We also have GridFTP logs on the CE. Do we
really need to parse them too?
The best request we easily process is the
following one: please, find all GridFTP activity
for the given user in the given time frame.
We can try relate various GridFTP sessions and
even tie them to the jobs, but this will involve
heuristics and checks won't be easy.
So, the question is: do we need this?
lcg-CE logs: current status
•
Have a toolset to trace jobs by their Grid
(EDG) ID, user DN and to find pure Globus
jobs.
•
The toolset is currently refactored to provide
the framework for doing log lookup on other
node types and to abstract file parsers from
analysis core.
•
Current language is Perl, but I thinking about
Python variant – it can be faster and cleaner.
•
Will show the tools to the public after some
refactoring and polishing.
lcg-CE logs: roadmap



Finish 'fork' jobs detection.
SGE support: Sun GridEngine is currently
supported by gLite too, although user base isn't
fairly large now.
Add more bells and whistles to the current tools:
limit the number of job records, provide
command to find most active users, etc.

Probably implement parsing of GridFTP logs.

Anything else I had missed.
RB/LB logs: ideas and questions



No real code written, only research/planning.
RBs are now slightly out-of-fashion, people like
WMS, but still, we have some working RBs.
LB has the database where bookkeeping
information is stored and we can use old good
SQL to interrogate it. But Daniel said that we
–
shouldn't use pure SQL, because of possible
schema changes;
–
it doesn't have all useful information.
LB logs: ideas and questions
•
Daniel also said that there should be a better
way to interrogate LB database, but I always
used plain SQL to do it up to now.
•
Gathered data will be the same as one
provided by 'edg-job-logging-info'. One
distinction is that the use of 'job-logging-info' is
subject to ACLs, direct usage of SQL DB –
isn't.
•
In the case of combined LB/RB (or LB/WMS)
can also extract some information from the
SandBox directory.
RB logs: GridFTP


GridFTP logs on RBs are minimal: no session
traces, just accounting data in /var/log/edg-wlin.ftpd.log. No user DN's, only poolaccount
user names. Some path names carry job IDs,
so we can identify user sessions and can relate
them to the jobs – this could be handy.
In principle, it is sometimes interesting to know
who got user's output sanbox, so we probably
should try to parse these logs.
WMS/Cream CE



No real work was done up to date, only
planning.
I have WMS instance, so I plan to research on
what data could be collected from this node
type. I expect that job traces simular to RB
ones and download upload records (both
GridFTP and HTTP) will be available.
Cream CE instance is going to be deployed in a
couple of months. Once it will be up – I'll
analyze it too.
Data management: SE logs
•
Only in plans, no real work was done.
•
Can only speak about DPM SE for now: have
no dCache instance.
•
As a recall from the SSC2, DNPS and DPM
logs have some shared identifiers that can be
used to relate the records in the various log
files.
•
Needs more analysis: I hadn't concentrated on
the DM logs yet.
Thanks!
Thanks for Daniel Kouril for presenting this stuff
and discuissing/advicing on the presentation.
Thanks to everyone listened to this session.
Questions? Suggestions?
Feel free to ask ;))