Self Organising Hypothesis Networks : Organising (Q)SAR knowledge Thierry Hanser, Chris Barber, Edward Rosser, Jonathan Vessey, Samuel Webb, Stéphane Werner Lhasa Limited, Granary Wharf House, 2 canal Wharf, Leeds, LS11 5PS Supporting examples Background e7 Most relevant part of the knowledge SO3 H h1 NO2 Interpretation h0 NO2 SO3 H Prediction NO2 h6 In the context of regulatory risk management QSAR models are expected to go beyond their role of classifier/estimator and provide wider assistance to human experts facing a difficult decision process. This support requires not only good predictive performance but also a high degree of transparency. QSAR transparency covers interpretability of the predictions (why did the model come to this conclusion?), prediction confidence level (is the model confident for this individual prediction?), supporting evidence (are there known examples of similar compounds for which we already know the behaviour?), interpretable applicability domain (why is my compound in/out of the domain?). O Unseen instance Risk assessment Good model e6 e4 Performance Figure 7 : Excerpt of SOHN build using 400 structural hypotheses extracted from a training set of 8000 compounds (50% mutagens and 50% non mutagens) using fragmentation and recursive partitioning. The hypotheses near the root capture general patterns of toxicophores where as the intermediate hypotheses model relevant refinements all the way up to specific examples Transparency To address this challenge we have designed a new knowledge representation based on the concept of a hypothesis: a simple and interpretable knowledge unit. In this method, knowledge is broken down into a collection of hypotheses which are organized into a hierarchical network called Self Organising Hypothesis Network (SOHN)[1]. The resulting model is transparent and agnostic of its source of knowledge (learning process). When applied to building QSAR models, this new approach provides accurate and yet transparent predictions along with individual prediction confidence levels and hence is well suited for risk assessment. Once a prediction and a confidence level have been calculated for each of the relevant hypotheses we can construct an overall call based on these values. Different reasoning heuristics can be used depending on the context. The default reasoning algorithm simply weights each hypothesis according to its confidence level (figure11). Chemical Space e e x2 Methodology: Self Organising Hypothesis Network h The approach relies on 3 main steps: w e x1 e e e 1. Extract knowledge from one or more source of information (Learn) 2. Translate this knowledge into a unified representation independent of its origin (Unify) 3. Organise the knowledge according to its generalization hierarchy (Organise) Figure 8 : Each path from the root to the most specific hypothesis provides rich information to assist the interpretation of a prediction and the supporting examples are used to perform a local weighted kNN evaluation. x3 Figure 9: The weighted local kNN evaluation allows to take into account the nature of the SAR landscape and the similarity of the supporting examples with the query (xi). For instance x1 will be predicted positive with high confidence, x2 positive with less confidence and x3 will be predicted negative. Figure 10: The weight of each example is linked to its similarity with the query compound and the global signal (s=-1 for non mutagens and s=1 for mutagens) is dependent on the average similarity of the k nearest neighbours. When comparing the mutagenicity SOHN model to other machine learning techniques using the same structural descriptors and a 5 fold internal cross validation we observe comparable performances whilst SOHN offers high transparency. External validation on a proprietary of 800 compounds dataset representing adifferent chemical domain from the training set, has shown that SOHN is relatively robust to domain variations (figure 12). External validation Decision Tree Figure 2 : Building a unified and hierarchical knowledge representation For the final model to be transparent it must be composed of interpretable elements of knowledge. For that purpose we introduce the concept of a hypothesis. A hypothesis is a simple and interpretable knowledge unit that captures an elementary pattern of SAR behaviour. A hypothesis can be seen as a very local model; it defines a class of compounds that exhibit a SAR trend. In practice, hypotheses are either learned by a machine learning algorithm or elaborated by a human expert; they can take different forms depending on the information that they take into account (figure 3). h o Prediction o Confidence h o Prediction o Confidence h o Prediction o Confidence Relevant hypotheses KNN Reasoning o Prediction o Confidence Overall call Local calls more general than Hypothesis A ? 2 < logP < 3 more general than Hypothesis A Figure 3 : Different types of hypotheses. A hypothesis represents a class of compounds exhibiting a similar SAR trend. Random Forest SOHN 0.65 0.70 0.75 0.80 0.85 Figure 12 : Comparison with other machine learning techniques show that SOHN can be used as prediction models with comparable accuracy whilst providing a high level of transparency. SOHN also demonstrate a relative good external domain adaptability compare to other algorithms (less overfitting) Application: Sarah Nexus Hypothesis B > SVM 0.60 Figure 11: Combination of individual hypotheses into an overall classification call using a reasoning module. By default the hypotheses are weighted using their confidence level. > Internal validation Hypothesis C Figure 4 : Comparison of the level of generalisation of hypotheses of identical type is generally trivial (top). Comparing hypothesis of different types (bottom) is however more challenging and can't be evaluated directly on the hypotheses themselves. Different hypotheses can be compared for their degree of generalisation (figure 4, top), however comparing hypotheses of different types is more challenging (figure 4, bottom). To solve this issue, the SOHN approach uses a reference dataset to define a generalisation order based on the coverage of the hypotheses within this dataset. The method is illustrated in Figure 5 with a reference dataset containing 10 examples (e1 to e10). If a hypothesis h1 covers examples e1 to e5 and a hypothesis h2 applies to the examples e1, e2 and e3 then it becomes intuitive to infer that h1 is more general than h2 since all the examples covered by h2 are also covered by h1. To improve the safety in the context of mutagenicity risk assessment the international regulators introduced the ICH-M7[2] guidelines on genotoxic impurities, recommending the combined usage of an expert system and a (Q)SAR model. At Lhasa limited transparency in prediction has always been a priority and was our principle for developing Derek Nexus[3] an expert system for toxicity prediction. In response to ICH-M7, we developed Sarah Nexus a user friendly and fully featured QSAR prediction application (figures 13 and 14) to complement Derek Nexus. Sarah Nexus uses the SOHN methodology and has been trained for the Ames mutagenicity end-point; it is compliant with the OECD QSAR principles[4] allowing for accurate and transparent predictions. Sarah Nexus has been favourably evaluated by the FDA[5]. Each prediction is supported by one or more self-explanatory hypotheses along with a confidence level and supporting evidence. This rich set of knowledge assists the toxicology expert in forging a well informed opinion and facilitates a safe decision process. Examples hypotheses (most specific - facts) e2 e3 e4 h2 h5 Hypotheses (knowledge unit) h1 e6 e7 e8 h6 h4 e9 e10 h7 h3 h0 NO2 SO3 H e5 Instantiation Generalisation e1 O Root Hypothesis (most generic - null hypothesis) Unseen instance Figure 5 Hypothesis hierarchy induced by the coverage relationship. Note that h0 is a special hypothesis (the root) that covers the whole chemical space and each example ei are ultimately specific hypothesis that cover only themselves. Figure 6: Self Organised Hypothesis Network. Highlighted, a virtual search path for the most specific hypotheses (h6, h7) applying to a given query compound. The search is performed using recursive analysis starting from the root. Once hypotheses have been defined for a given endpoint they can be organised into a hierarchical structure that captures the different level of generalization of the knowledge. This structure is automatically updated each time a new hypothesis is inserted and is called a Self-Organising Hypothesis Network or SOHN (figure 6). An example of SOHN for the mutagenicity end point is shown in figure 7. To predict the class of a new compound, the algorithm identifies the most specific hypotheses that apply to the query (figure 6). The path from the root to each selected hypothesis is used as supporting knowledge to interpret the prediction’s outcome (figure 8) and the actual prediction is built using a local weighted kNN model on the examples covered by the hypothesis (figure 9,10). Figure 13: Screenshot of the prediction for 2-(4-Nitrophenyl)aziridine. Sarah nexus has discovered two toxicophores (the aromatic nitro and aziridine motifs). The query compound is shown with both hypotheses highlighted when selected in the hypothesis summary. In this example, both hypotheses suggest a mutagenicity risk, with confidence of 47 and 66% respectively (0% = random), leading to the query compound classified as a mutagen with 57% confidence. Conclusion We have introduced a new paradigm in QSAR that decouples the learning methodology from the final knowledge representation using a collection of hypotheses organised in a network. Using the most relevant part of this network as ad hoc local models leads to accurate and highly interpretable predictions. Additionally this methodology provides access to a meaningful confidence metric at an individual prediction level which is critical in a regulatory context. We have successfully integrated the SOHN methodology into a new prediction application Sarah Nexus. Ultimately the human expert remains the key asset in a risk assessment process. By providing transparent predictions Sarah and Derek Nexus combined offer valuable assistance in this decision process in line with the ICH-M7 and OECD guidelines. References [1] Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge Thierry Hanser, Chris Barber, Edward Rosser, Jonathan D Vessey, Samuel J Webb and Stéphane Werner . Journal of Cheminformatics 2014, 6:21 doi:10.1186/1758-2946-6-21 [2] ICH: International Conference on Harmonisation, http://www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Multidisciplinary/M7/M7_Step_4.pdf [3] Derek Nexus: http://www.lhasalimited.org/Public/Library/Lhasa%20Library/Brochures/Derek%20Nexus.pdf [4] OECD: Organisation for Economic Co-operation and Development, http://www.oecd.org/env/ehs/risk-assessment/37849783.pdf [5] FDA: Food and Drug Administration, Benchmarking Assessment of Open Source and Newly Released Salmonella Mutagenicity (Q)SAR Models for Potential Use Under ICH M7 Lidiya Stavitskaya, Barbara L. Minnier, and Naomi L. Kruhlak. FDA Center for Drug Evaluation and Research (CDER), SOT conference Phoenix Arizona 2014
© Copyright 2025 Paperzz