Discussions on group meeting 2013.5 Site Monitoring • Two kinds of monitoring are proposed • “SAM test” monitoring – Just like SAM tests – Send regular tests, collect and filter results and publish – Easy to know critical service status, eg. CVMFS, PBS, SE…… • Ganglia-based monitoring – Similar to Atlas T3 monitoring – Set up local and global ganglia monitoring, collect info and publish – Easy to know server status and total job numbers….. • Site info from two monitoring will be collected into one database and summarized in one web page like dashboard • We need to decide what kind of information are necessary – – – – Service status: ce, se, cvmfs Transfer status: channel, fts Job number: production, analysis, tests CPU consumption, CPU efficiency Similar to LCG one Further thoughts about “SAM tests” monitoring • DIRAC resource status system – Similar functions, not completely what we want – Not send tests, only collect info from the existing jobs – In development and in plan • Propose to establish our own one – It seemed as if not too difficult, if someone can spend time on it Preliminary designs of site monitoring • Develop based on DIRAC framework send tests to sites get site info Monitor Configuration Resources Agent Service record test results to DB MonitorDB Command Line Web Page Preliminary designs of site monitoring • Tests design – CE, SE, CVMFS….. – CE and CVMFS tests by jobs – SE tests by issuing gLite commands • Agents – Monitor Agent is responsible for getting site info from DIRAC configuration service, sending tests, retrieving and filtering results, updating DB Preliminary designs of site monitoring • DB – MonitorDB and table SiteStatus to record site status • Commands – bes-dirac-site-monitor --sitename --timerange – the default print out the latest site status – Interact with DB interface to get status • Web – DiracWeb is in migration period to tornado, better consider later BESIII data transfer • Two transfer protocols are added – DIRACFTS(dirac-dms-fts-submit) • dirac-dms-fts-submit is not well coded and not easy to debug, need to be fixed if we have time – DIRACDMS(dirac-dms-replicate-lfn) • Testing – Preliminary tests are successful with two modes • Dataset created-> transfer request created->transfer status can be followed->transfer errors are showed • Error logs still need to be improved BESIII data transfer • Currently no good channels are available. Dubna SE is in downtime, USTC and IHEP SE need to be tuned • Going to use IHEP and IHEPD for testing, a certain volume of transfer tests need to be done • Accounting • Update transfer info to central DIRAC accounting system • DIRACFTS accounting is available, but not correct, need to be fixed • DIRACDMS accounting is not available, need to be added. We do it ourselves, or ask DIRAC to fix? • More and more small fix need to be done inside DIRAC, need to find out regular procedure to do that BESIII data transfer • More functions are needed – Options needed to be introduced to do the switch between two transfer types – Transfer types used need to be recorded for each request in DB – Functions such as cancelling requests need to be introduced – Need to consider to use datasets defined from badger BESDIRAC • An extension to DIRAC – More and more BESIII-specific extensions are coming – Definitely need an extension • How to manage and maintain extension – Need a new release for server and client – Need someone to look into it – If simply add extra packages locally, there would be problems with pilot jobs during software download • We have set up a development env in bager01 – Dubna need one too • Use Git for code management? – To be consistent with DIRAC development environment UMN site • UMN site is going to have a SE for BES – Good news! – Currently they are working in joining SE to BES VO • Their SE type is BestMan – We seemed not trying to add BestMan SE to BES VO before • Document for that is not available – Someone need to look into it if they help Virtual sites • PBS cluster are set up over virtual resources – WHU is using VirtualBOX – NSCCSZ is using KVM • It is easy to add new nodes and extend cluster – Use images generating by existing node – Light configuration and check can be done to VM after booting to make all the necessary services up and running • Virtual sites are working well as a normal DIRAC cluster site Virtual sites(2) Virtual sites • Advantage: – Site don’t need to change basic OS – Clusters are easy to set up and extend with virtual images • Expect to be improved: – Virtual sites expect to provide cloud resource management platform (eg. Openstack) and provide API for creating and deleting VM • DIRAC has a good support to some well-known resource management platform such as openstack, cloudstack, opennebula – The size of virtual resources is able to vary with the number of job in real time • In this way resource usage is more flexible and efficient • Currently resources are relatively static and VM set-up are done by hand
© Copyright 2026 Paperzz