Methodology: Self Organising Hypothesis Network Background

Self Organising Hypothesis Networks : Organising (Q)SAR knowledge
Thierry Hanser, Chris Barber, Edward Rosser, Jonathan Vessey, Samuel Webb, Stéphane Werner
Lhasa Limited, Granary Wharf House, 2 canal Wharf, Leeds, LS11 5PS
Supporting examples
 Background
e7
Most relevant part
of the knowledge
SO3 H
h1
NO2
Interpretation
h0
NO2
SO3 H
Prediction
NO2
h6
In the context of regulatory risk management QSAR models are expected to go beyond their role of
classifier/estimator and provide wider assistance to human experts facing a difficult decision process. This
support requires not only good predictive performance but also a high degree of transparency. QSAR
transparency covers interpretability of the predictions (why did the model come to this conclusion?), prediction
confidence level (is the model confident for this individual prediction?), supporting evidence (are there known
examples of similar compounds for which we already know the behaviour?), interpretable applicability domain
(why is my compound in/out of the domain?).
O
Unseen instance
Risk assessment
Good model
e6
e4
Performance
Figure 7 : Excerpt of SOHN build using 400 structural hypotheses extracted
from a training set of 8000 compounds (50% mutagens and 50% non
mutagens) using fragmentation and recursive partitioning. The hypotheses
near the root capture general patterns of toxicophores where as the
intermediate hypotheses model relevant refinements all the way up to
specific examples
Transparency
To address this challenge we have designed a new knowledge representation based on the concept of a
hypothesis: a simple and interpretable knowledge unit. In this method, knowledge is broken down into a
collection of hypotheses which are organized into a hierarchical network called Self Organising Hypothesis
Network (SOHN)[1]. The resulting model is transparent and agnostic of its source of knowledge (learning process).
When applied to building QSAR models, this new approach provides accurate and yet transparent predictions
along with individual prediction confidence levels and hence is well suited for risk assessment.
Once a prediction and a confidence level have been calculated for each of the relevant hypotheses we can
construct an overall call based on these values. Different reasoning heuristics can be used depending on the
context. The default reasoning algorithm simply weights each hypothesis according to its confidence level
(figure11).
Chemical Space
e
e
x2
 Methodology: Self Organising Hypothesis Network
h
The approach relies on 3 main steps:
w
e
x1
e
e
e
1. Extract knowledge from one or more source of information (Learn)
2. Translate this knowledge into a unified representation independent of its origin (Unify)
3. Organise the knowledge according to its generalization hierarchy (Organise)
Figure 8 : Each path from the root to the most specific hypothesis
provides rich information to assist the interpretation of a prediction and
the supporting examples are used to perform a local weighted kNN
evaluation.
x3
Figure 9: The weighted local kNN evaluation allows to take into account the
nature of the SAR landscape and the similarity of the supporting examples with
the query (xi). For instance x1 will be predicted positive with high confidence, x2
positive with less confidence and x3 will be predicted negative.
Figure 10: The weight of each example is linked to its similarity with the
query compound and the global signal (s=-1 for non mutagens and s=1
for mutagens) is dependent on the average similarity of the k nearest
neighbours.
When comparing the mutagenicity SOHN model to other machine learning techniques using the same structural
descriptors and a 5 fold internal cross validation we observe comparable performances whilst SOHN offers high
transparency. External validation on a proprietary of 800 compounds dataset representing adifferent chemical
domain from the training set, has shown that SOHN is relatively robust to domain variations (figure 12).
External validation
Decision Tree
Figure 2 : Building a unified and hierarchical knowledge representation
For the final model to be transparent it must be composed of interpretable elements of knowledge. For that
purpose we introduce the concept of a hypothesis. A hypothesis is a simple and interpretable knowledge unit
that captures an elementary pattern of SAR behaviour. A hypothesis can be seen as a very local model; it defines
a class of compounds that exhibit a SAR trend. In practice, hypotheses are either learned by a machine learning
algorithm or elaborated by a human expert; they can take different forms depending on the information that
they take into account (figure 3).
h
o Prediction
o Confidence
h
o Prediction
o Confidence
h
o Prediction
o Confidence
Relevant
hypotheses
KNN
Reasoning
o Prediction
o Confidence
Overall call
Local calls
more general than
Hypothesis A
?
2 < logP < 3
more general than
Hypothesis A
Figure 3 : Different types of hypotheses. A hypothesis represents a
class of compounds exhibiting a similar SAR trend.
Random Forest
SOHN
0.65
0.70
0.75
0.80
0.85
Figure 12 : Comparison with other machine learning techniques show
that SOHN can be used as prediction models with comparable accuracy
whilst providing a high level of transparency. SOHN also demonstrate a
relative good external domain adaptability compare to other algorithms
(less overfitting)
 Application: Sarah Nexus
Hypothesis B
>
SVM
0.60
Figure 11: Combination of individual hypotheses into an overall
classification call using a reasoning module. By default the hypotheses
are weighted using their confidence level.
>
Internal validation
Hypothesis C
Figure 4 : Comparison of the level of generalisation of hypotheses of identical type is
generally trivial (top). Comparing hypothesis of different types (bottom) is however
more challenging and can't be evaluated directly on the hypotheses themselves.
Different hypotheses can be compared for their degree of generalisation (figure 4, top), however comparing
hypotheses of different types is more challenging (figure 4, bottom). To solve this issue, the SOHN approach uses
a reference dataset to define a generalisation order based on the coverage of the hypotheses within this dataset.
The method is illustrated in Figure 5 with a reference dataset containing 10 examples (e1 to e10). If a hypothesis
h1 covers examples e1 to e5 and a hypothesis h2 applies to the examples e1, e2 and e3 then it becomes intuitive
to infer that h1 is more general than h2 since all the examples covered by h2 are also covered by h1.
To improve the safety in the context of mutagenicity risk assessment the international regulators introduced the
ICH-M7[2] guidelines on genotoxic impurities, recommending the combined usage of an expert system and a (Q)SAR
model. At Lhasa limited transparency in prediction has always been a priority and was our principle for developing
Derek Nexus[3] an expert system for toxicity prediction. In response to ICH-M7, we developed Sarah Nexus a user
friendly and fully featured QSAR prediction application (figures 13 and 14) to complement Derek Nexus. Sarah
Nexus uses the SOHN methodology and has been trained for the Ames mutagenicity end-point; it is compliant with
the OECD QSAR principles[4] allowing for accurate and transparent predictions. Sarah Nexus has been favourably
evaluated by the FDA[5]. Each prediction is supported by one or more self-explanatory hypotheses along with a
confidence level and supporting evidence. This rich set of knowledge assists the toxicology expert in forging a well
informed opinion and facilitates a safe decision process.
Examples hypotheses
(most specific - facts)
e2
e3
e4
h2
h5
Hypotheses
(knowledge unit)
h1
e6
e7
e8
h6
h4
e9
e10
h7
h3
h0
NO2
SO3 H
e5
Instantiation
Generalisation
e1
O
Root Hypothesis
(most generic - null hypothesis)
Unseen instance
Figure 5 Hypothesis hierarchy induced by the coverage relationship. Note
that h0 is a special hypothesis (the root) that covers the whole chemical
space and each example ei are ultimately specific hypothesis that cover only
themselves.
Figure 6: Self Organised Hypothesis Network. Highlighted, a virtual search
path for the most specific hypotheses (h6, h7) applying to a given query
compound. The search is performed using recursive analysis starting from
the root.
Once hypotheses have been defined for a given endpoint they can be organised into a hierarchical structure that
captures the different level of generalization of the knowledge. This structure is automatically updated each time
a new hypothesis is inserted and is called a Self-Organising Hypothesis Network or SOHN (figure 6). An example
of SOHN for the mutagenicity end point is shown in figure 7.
To predict the class of a new compound, the algorithm identifies the most specific hypotheses that apply to the
query (figure 6). The path from the root to each selected hypothesis is used as supporting knowledge to interpret
the prediction’s outcome (figure 8) and the actual prediction is built using a local weighted kNN model on the
examples covered by the hypothesis (figure 9,10).
Figure 13: Screenshot of the prediction for 2-(4-Nitrophenyl)aziridine. Sarah nexus has discovered two toxicophores (the aromatic nitro and aziridine motifs).
The query compound is shown with both hypotheses highlighted when selected in the hypothesis summary. In this example, both hypotheses suggest a
mutagenicity risk, with confidence of 47 and 66% respectively (0% = random), leading to the query compound classified as a mutagen with 57% confidence.
 Conclusion
We have introduced a new paradigm in QSAR that decouples the learning methodology from the final knowledge
representation using a collection of hypotheses organised in a network. Using the most relevant part of this
network as ad hoc local models leads to accurate and highly interpretable predictions. Additionally this
methodology provides access to a meaningful confidence metric at an individual prediction level which is critical in
a regulatory context.
We have successfully integrated the SOHN methodology into a new prediction application Sarah Nexus. Ultimately
the human expert remains the key asset in a risk assessment process. By providing transparent predictions Sarah
and Derek Nexus combined offer valuable assistance in this decision process in line with the ICH-M7 and OECD
guidelines.

References
[1] Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge Thierry Hanser, Chris Barber, Edward Rosser, Jonathan D Vessey, Samuel J Webb and
Stéphane Werner . Journal of Cheminformatics 2014, 6:21 doi:10.1186/1758-2946-6-21
[2] ICH: International Conference on Harmonisation, http://www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Multidisciplinary/M7/M7_Step_4.pdf
[3] Derek Nexus: http://www.lhasalimited.org/Public/Library/Lhasa%20Library/Brochures/Derek%20Nexus.pdf
[4] OECD: Organisation for Economic Co-operation and Development, http://www.oecd.org/env/ehs/risk-assessment/37849783.pdf
[5] FDA: Food and Drug Administration, Benchmarking Assessment of Open Source and Newly Released Salmonella Mutagenicity (Q)SAR Models for Potential Use Under ICH M7
Lidiya Stavitskaya, Barbara L. Minnier, and Naomi L. Kruhlak. FDA Center for Drug Evaluation and Research (CDER), SOT conference Phoenix Arizona 2014