Group meeting

Discussions on group meeting
2013.5
Site Monitoring
• Two kinds of monitoring are proposed
• “SAM test” monitoring
– Just like SAM tests
– Send regular tests, collect and filter results and publish
– Easy to know critical service status, eg. CVMFS, PBS, SE……
• Ganglia-based monitoring
– Similar to Atlas T3 monitoring
– Set up local and global ganglia monitoring, collect info and publish
– Easy to know server status and total job numbers…..
• Site info from two monitoring will be collected into one database and
summarized in one web page like dashboard
• We need to decide what kind of information are necessary
–
–
–
–
Service status: ce, se, cvmfs
Transfer status: channel, fts
Job number: production, analysis, tests
CPU consumption, CPU efficiency
Similar to LCG one
Further thoughts about “SAM tests”
monitoring
• DIRAC resource status system
– Similar functions, not completely what we want
– Not send tests, only collect info from the existing
jobs
– In development and in plan
• Propose to establish our own one
– It seemed as if not too difficult, if someone can
spend time on it
Preliminary designs of site monitoring
• Develop based on DIRAC framework
send tests to sites
get site info
Monitor
Configuration
Resources
Agent
Service
record test results to DB
MonitorDB
Command
Line
Web
Page
Preliminary designs of site monitoring
• Tests design
– CE, SE, CVMFS…..
– CE and CVMFS tests by jobs
– SE tests by issuing gLite commands
• Agents
– Monitor Agent is responsible for getting site info
from DIRAC configuration service, sending tests,
retrieving and filtering results, updating DB
Preliminary designs of site monitoring
• DB
– MonitorDB and table SiteStatus to record site status
• Commands
– bes-dirac-site-monitor --sitename --timerange
– the default print out the latest site status
– Interact with DB interface to get status
• Web
– DiracWeb is in migration period to tornado, better
consider later
BESIII data transfer
• Two transfer protocols are added
– DIRACFTS(dirac-dms-fts-submit)
• dirac-dms-fts-submit is not well coded and not easy to
debug, need to be fixed if we have time
– DIRACDMS(dirac-dms-replicate-lfn)
• Testing
– Preliminary tests are successful with two modes
• Dataset created-> transfer request created->transfer
status can be followed->transfer errors are showed
• Error logs still need to be improved
BESIII data transfer
• Currently no good channels are available. Dubna SE is in
downtime, USTC and IHEP SE need to be tuned
• Going to use IHEP and IHEPD for testing, a certain volume of
transfer tests need to be done
• Accounting
• Update transfer info to central DIRAC accounting system
• DIRACFTS accounting is available, but not correct, need to be
fixed
• DIRACDMS accounting is not available, need to be added. We
do it ourselves, or ask DIRAC to fix?
• More and more small fix need to be done inside DIRAC, need to
find out regular procedure to do that
BESIII data transfer
• More functions are needed
– Options needed to be introduced to do the switch
between two transfer types
– Transfer types used need to be recorded for each
request in DB
– Functions such as cancelling requests need to be
introduced
– Need to consider to use datasets defined from
badger
BESDIRAC
• An extension to DIRAC
– More and more BESIII-specific extensions are coming
– Definitely need an extension
• How to manage and maintain extension
– Need a new release for server and client
– Need someone to look into it
– If simply add extra packages locally, there would be
problems with pilot jobs during software download
• We have set up a development env in bager01
– Dubna need one too
• Use Git for code management?
– To be consistent with DIRAC development environment
UMN site
• UMN site is going to have a SE for BES
– Good news!
– Currently they are working in joining SE to BES VO
• Their SE type is BestMan
– We seemed not trying to add BestMan SE to BES
VO before
• Document for that is not available
– Someone need to look into it if they help
Virtual sites
• PBS cluster are set up over virtual resources
– WHU is using VirtualBOX
– NSCCSZ is using KVM
• It is easy to add new nodes and extend cluster
– Use images generating by existing node
– Light configuration and check can be done to VM after
booting to make all the necessary services up and
running
• Virtual sites are working well as a normal DIRAC
cluster site
Virtual sites(2)
Virtual sites
• Advantage:
– Site don’t need to change basic OS
– Clusters are easy to set up and extend with virtual images
• Expect to be improved:
– Virtual sites expect to provide cloud resource management
platform (eg. Openstack) and provide API for creating and
deleting VM
• DIRAC has a good support to some well-known resource
management platform such as openstack, cloudstack, opennebula
– The size of virtual resources is able to vary with the
number of job in real time
• In this way resource usage is more flexible and efficient
• Currently resources are relatively static and VM set-up are done by
hand