Building a Massive Virtual Screening using Grid Infrastructure

Building a Massive Virtual
Screening using Grid Infrastructure
Chak Sangma
Putchong Uthayopas
Centre for
Cheminformatics
Kasetsart University
High Performance Computing and
Networking Center,
Kasetsart University
Motivation
• Thailand’s Medicinal Plants is
important for Thai society
– Over 1,000 species
– Over 200,000 compounds
– Multiple disease targets
• Problem
SIATIC PENNYWORT
– No complete collection of compounds
database
– The practice is still mostly rely on local
knowledge and conventional wisdom
– Lack of systematic verifications by scientific
methods
Bariena lunulina Linae
Kasetsart University Thai Medicinal
Plants Effort
• Led by Center for Cheminformatics,
Kasetsart University (Dr. Chak Sangma)
• Goal
– Establish Thai medicinal plant knowledgebase
by building 3D molecular database
– Employ Virtual Screening to verify active
compounds with conventional knowledge
Reports and Literatures
2D Structures
Approximated 3D Structures
Optimized 3D Structures with GAMESS
Calculated Binding Energy
with Autodock 3.0
Structure in 0.5 Å from Binding Site
SOM Neural Network Map
Results
Compute
Intensive!
ThaiGrid Drug Design Portal
• Partners
– High Performance Computing and networking Center, KU
– Center for Cheminfomatics, KU
– IBM Thailand
• Goal
– Building a virtual screening infrastructure on ThaiGrid System
– Start from KU campus Grid and extended to other ThaiGrid
partner universities later
• Link
– http://tgcc.cpe.ku.ac.th
– http://www.thaigrid.net
Challenge
• Recent project for National Center for Genetic
Engineering and Biotechnology, Thailand
– Screen 3000 compounds in 3 months
• Computation time on 2.4 GHz Pentium IV 4 system
– Over 30 mins/1 optimized structure
– Over 30 mins/1 docking
• Estimate computing time on single processor
–
–
–
–
(3,000 x 30) + (3,000 x 30)
3,000 Hours
125 Days
4 month 16 days
• Not fast enough!
Key Technologies
• Three key technologies must be combined
to provide the solution
– Cluster Computing
– Grid Computing
– Portal Technology
What we want to do?
Hide the complexity of Grid and
computational chemistry
software from scientists while
providing massive
computational power needed
Infrastructure
• ThaiGrid infrastructure are
used
• 10 Clusters from 6
organizations
–
–
–
–
–
–
–
–
–
–
AMATA – KU
GASS – KU
MAEKA – KU
WARINE – KU
CAMETA – SUT
OPTIMA - AIT
ENQUEUE – KMUTNB
PALM – KMUTNB
SPIRIT – CU
INCA - KMUTT
SPIRIT
INCA
CAMETA
CU
GASS
KMUTNB
OPTIMA
SUT
WARINE
AIT
Network
ENQUEUE
KU
MAEKA
Grid Job Scheduling
PALM
KMUTT
AMATA
• 158 CPUs on 110 nodes
Submit
ThaiGrid Portal
Tgcc.cpe.ku.ac.th
ThaiGrid User
Software Architecture
• Each cluster has local
scheduler
– SGE, OpenPBS, Condor can
be used
– We use our SQMS scheduler
Portal
SQMS/G
• Globus2.4 is used as
middleware
– Resources control and security
(GSI)
• Grid level scheduler control
multi-cluster job submission
– Use KU own SQMS/G
SCMSWeb
Globus 2.4
SQMS
SQMS
SQMS
SQMS
AMATA
Warine
GASS
Maeka
KU Gigabit Campus Network
The Portal
• Roles
– User interface
– Automate execution flow
– File access and management
• Features
– Create project
– Add ligand, enzyme
– Submit screening job, monitor job
status
– Download output
• Current portal is built using Plone
– http://www.plone.org/
– Python based web content
management
– Flexible and extensible
How things work!
Task
Task
Resource
Broker
(SQMS/G)
Portal
Grid Middleware
Globus2.4
Task
Task
Monitor
Compute
Resource
Compute
Resource
Compute
Resource
KU Campus network
Compute
Resource
Task
Compute
Resource
Results
• The first version of
compound databases
(around 3,000 compounds)
• 3,000 compounds screened
( found 30 high potential
compounds)
– 4 drug targets (Influenza,
HIV-RT, HIV-PR, HIV-IN)
XK-263
Experiences
• Some files such as enzyme structure and output are very
large.
– Require a good bandwidth between sites
– Some simple optimizing techniques can help
• Implements caching of enzyme structure file at target hosts.
Substantially reduce the number of transfer needed
• Batch scheduling approach is good if the systems are
very homogenous
– Allow dynamic execution code staging to the target host without
installation/recompilation
• Many script tools must be developed to
– Streamline the execution
– Handling data and code staging
– Cleanup the execution
Next Generation Massive
Screening on Grid
• Move to Service Oriented Grid
– Use Grid and Web services to encapsulate key applications
– Build broker and service discovery infrastructure
– Rely heavily on OGSA and GT3.X, 4.X
• Portlet based portal
– JSR 168: Portlet Specification compliance
– More modular , customizable, flexible
– Plan to adopt GridShpere from gridlab (www.gridlab.org)
• Use database as backend instead of files
– OGSA DAI might be used for data access
Progress
• We are working on
– New portal using GridSphere technology (done, testing)
– Service wrapper for lagacy code
• Gamess, autodock (done, testing)
–
–
–
–
–
MMJFS interface ( progress)
OGSA DAI integration (progress)
Service Registration and Discovery (partial)
Broker System ( design)
New Monitoring (done)
• Schedule
– Finish and testing Jan-Feb 2005
– Deploy in March 2005
File Server
Molecular
DB
Gamess
Scheduler
MMJFS
Gamess
Service
Gamess
Portlet
Portal
Broker
Server
Registration
Server
Backend
DB
Design Choices
• Mass Data Transportation across site
– Central ftp server is used to store data/database
– Each compute node can pull required data from this
ftp
• Adhoc – ftp , wget/http (firewall friendly)
• Next – Grid ftp
• Cluster/ Single server
– Gridify using service wrapper to expose grid service
of that lagacy application to the grid
– Not working for cluster since compute node are
hidden behind head node
• Back to MMJFS interface that talk to local shceduler
Design Choices
• Service Discovery Mechanism
– Publish/subscribe model
• Service advertising interface/protocol
• Backend data based that shared
between registration service component
and broker component
Broker
Service
Registration
Service
• Adoption of Grid Notification
service and model
– Available from mygrid project, seems
to be useful for more dynamics
environment
– Scalability….
Discovery (SQL)
Job Submission
Job Status
Result visualization
Performance
Record
System
Status
Job Queue
Monitoring
Service
Discovery
Conclusion
• Grid and cluster computing is a key technology that
can give us the power. Grid works if use wisely!
• Challenges
– Grid standard is still rapidly evolving
• Things change before you can finish!
– Difficult to configure, maintain, Some part is still
unstable
– Firewall and security concern
– Lack of manpower with expertise
• Opportunity
– Secure infrastructure
– Cost reduction by the integration of networked
resources on demand
Acknowledgement
• HPCNC Team
– Somsak Sriprayoonsakul
– Nuttaphon Thangkittisuwan
– Thanakit Petchprasan
– Isiriya Paireepairit
The End
Backup
Process
GRID
2D Structure
Molecular
Structure
Database
Enzyme
Results
3D Structure
Optimized
3D Structure
Enzyme Grid
SOM
Neural Network
Analysis
GAMESS
GAMESS
GAMESS
GAMESS
GAMESS
Autodock
Autodock
Autodock
Autodock
MAEKA
WARINE
AMATA
GASS
Workflow Engine
Grid
Portal
Portlet Portlet Portlet Portlet
Grid Middleware (OGSA )
Broker
Services
Docking
Services
Optimizing
Services
OGSA
DAI
Molecule
Database
Monitoring
Services
Resources ( Computer, Network)