In Silico Prediction of Major Drug Clearance Pathways by Support

1521-009X/42/11/1811–1819$25.00
DRUG METABOLISM AND DISPOSITION
Copyright ª 2014 by The American Society for Pharmacology and Experimental Therapeutics
http://dx.doi.org/10.1124/dmd.114.057893
Drug Metab Dispos 42:1811–1819, November 2014
In Silico Prediction of Major Drug Clearance Pathways by Support
Vector Machines with Feature-Selected Descriptors
Kouta Toshimoto, Naomi Wakayama, Makiko Kusama, Kazuya Maeda, Yuichi Sugiyama,
and Yutaka Akiyama
Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Tokyo,
Japan (K.T., Y.A.); Drug Metabolism and Pharmacokinetics Japan, Biopharmaceutical Assessment Core Function Unit, Eisai Product
Creation Systems, Eisai Co., Ltd., Ibaraki, Japan (N.W.); Laboratory of Pharmaceutical Regulatory Science (M.K.) and Laboratory of
Molecular Pharmacokinetics (K.M.), Graduate School of Pharmaceutical Sciences, The University of Tokyo, Tokyo, Japan; and
Sugiyama Laboratory, RIKEN Innovation Center, RIKEN Research Cluster for Innovation, RIKEN, Yokohama, Japan (Y.S.)
Received February 28, 2014; accepted August 14, 2014
We have previously established an in silico classification method
(“CPathPred”) to predict the major clearance pathways of drugs
based on an empirical decision with only four physicochemical
descriptors—charge, molecular weight, octanol-water distribution
coefficient, and protein unbound fraction in plasma—using a rectangular method. In this study, we attempted to improve the prediction
performance of the method by introducing a support vector machine
(SVM) and increasing the number of descriptors. The data set consisted of 141 approved drugs whose major clearance pathways were
classified into metabolism by CYP3A4, CYP2C9, or CYP2D6; organic
anion transporting polypeptide–mediated hepatic uptake; or renal
excretion. With the same four default descriptors as used in CPathPred,
the SVM-based predictor (named “default descriptor SVM”) resulted
in higher prediction performance compared with a rectangular-based
predictor judged by 10-fold cross-validation. Two SVM-based predictors were also established by adding some descriptors as follows: 1)
881 descriptors predicted in silico from the chemical structures of
drugs in addition to 4 default descriptors (“885 descriptor SVM”);
and 2) selected descriptors extracted by a feature selection based
on a greedy algorithm with default descriptors (“feature selection
SVM”). The prediction accuracies of the rectangular-based predictor, default descriptor SVM, 885 descriptor SVM, and feature
selection SVM were 0.49, 0.60, 0.72, and 0.91, respectively, and the
overall precision values for these four methods were 0.72, 0.77,
0.86, and 0.98, respectively. In conclusion, we successfully constructed SVM-based predictors with limited numbers of descriptors
to classify the major clearance pathways of drugs in humans with
high prediction performance.
Introduction
early stage of drug discovery. Alternatively, in recent years, several
types of in silico structure-based and ligand-based drug design
approaches have been extensively applied to the various aspects of
drug discovery, such as the prediction of substrate specificity of each
enzyme/transporter, chemical substructures metabolized, and the clearance
of each drug from the physicochemical properties calculated on the
basis of chemical structures (van de Waterbeemd and Gifford, 2003;
Lewis and Ito, 2010). Although some in silico prediction methods were
established in accordance with the principles of theoretical physics,
there are also some well known empirical prediction methods created
on the basis of the experience and intuition of experts in the field of
pharmaceutical sciences, such as Lipinski’s rule of five (Lipinski et al.,
2001). Wu and Benet (2005) expanded the basic concept of the
biopharmaceutics classification system (Yu et al., 2002) and proposed
a biopharmaceutics drug disposition classification system to clearly
predict the involvement of metabolic enzymes and transporters in the
total clearance of drugs (Wu and Benet, 2005). This type of approach
helps us to approximately understand the pharmacokinetic properties
of drugs. There are several in silico methods to predict whether a
compound is susceptible to a certain metabolic enzyme (Yamashita
et al., 2008; Mishra et al., 2010; Cheng et al., 2012), but it is fairly
difficult to identify the major clearance pathways of drugs in humans
because multiple elements, such as metabolic clearance mediated by
It is crucial to identify drug candidates that are not likely to succeed
as early as possible because the cost per drug approval increases enormously as candidates progress to late clinical development phases. As
one of the important measures for good drugs, the pharmacokinetic
properties of drug candidates are intensively investigated to select promising compounds possessing “drug-likeness” in the drug discovery stage
and to understand the possible cause of interindividual variability in the
pharmacological/toxicological effects of drugs (Lipinski et al., 2001). In
particular, it is necessary to understand the molecules responsible for
drug elimination and the relative contribution of each clearance pathway
to the overall total clearance of drugs to forecast the quantitative risk of
drug-drug interactions or genetic polymorphisms of certain metabolic
enzymes/transporters. Various methods for such purpose have been
proposed (Emoto et al., 2010; Yoshida et al., 2013); however, because
combinations of several experiments with significant cost and labor are
required, it is almost impossible to perform such experiments in the
This research was partly supported by the Okawa Foundation (to K.T. and Y.A.)
and by the Grant-in-aid for Young Scientists (A) [Grant 24689009] (to K.M.).
K.T. and N.W. contributed equally to this work.
dx.doi.org/10.1124/dmd.114.057893.
ABBREVIATIONS: log D, octanol-water distribution coefficient (lipophilicity); log P, octanol-water partition coefficient; OATP, organic anion
transporting polypeptide; SVM, support vector machine; VDW, van der Waals.
1811
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
ABSTRACT
1812
Toshimoto et al.
TABLE 1
Drug data set
The numbers in this table indicate the numbers of drugs in our data set that are primarily
eliminated through each clearance pathway as stratified by charge. The numbers of drugs in each
clearance pathway vary greatly, and drugs in some categories display a strong relationship with
their charge.
Clearance Pathway
Charge
Anion
Cation or neutral
Total
CYP3A4
CYP2C9
CYP2D6
Renal
OATP
Total
0
52
52
11
1
12
0
18
18
18
23
41
18
0
18
47
94
141
Information concerning clearance pathways in humans was collected from the
published data. Pharmacokinetic profiles after intravenous administration and in
vitro data on metabolism of drugs were also collected. The definition of the
“major” clearance pathway is that a certain clearance pathway is considered to
contribute to .50% of the total body clearance of drugs. Of the 294 drugs listed,
141 drugs that met the criteria for the following major clearance pathways were
selected for the training data set: metabolism by CYP2C9, CYP2D6, or CYP3A4;
renal excretion; or OATP-mediated hepatic uptake.
Descriptors. For the 141 drugs in the data set, their two-dimensional
structures were obtained from the PubChem Compound database (http://
pubchem.ncbi.nlm.nih.gov/), and their charges, defined as the charge of the
largest fraction of chemical species at pH 7.0, were estimated using the
ADMET Predictor (Simulations Plus, Lancaster, CA). The drugs were then
categorized by charge at neutral pH into the two following groups: cation or
neutral (cation/neutral) and anion. Then their predicted lipophilicity (log D)
values at pH 7.0 were cited from SciFinder Scholar 2007 (Chemical Abstracts
Service, Columbus, OH). In addition, protein unbound fraction in plasma was
predicted using ADME Boxes version 4.0 (Pharma Algorithms, Toronto, ON,
Canada). These four parameters are termed “default descriptors” in this paper.
Excluding the default descriptors, 1081 descriptors could be calculated using
PreADMET v2.0 (BMDRC, Seoul, South Korea). After deleting descriptors
with missing values for any drug and those with the same values for all drugs,
881 descriptors were finally obtained and used for further analyses.
Construction of Prediction Systems Using SVM. We used SVMlight
(http://svmlight.joachims.org/) as an SVM program and the Gaussian kernel as
a kernel function [K (x, z)] as follows:
Kðx; zÞ ¼ expð 2 g‖x 2 z‖2 Þ
Materials and Methods
Data Set. We used the same data set as our previous study (Kusama et al.,
2010). First, a data set was created using the appendix of Goodman and Gilman’s
The Pharmacological Basis of Therapeutics (Benet et al., 1996; Thummel and
Shen, 2001; Thummel et al., 2006) with the addition of major OATP substrates.
ð1Þ
In each SVM, it is necessary to optimize g and C, representing the trade-off
between training error and margin size. We chose the values of C and g from
C = {1, 5, 10} and g = {0.001, 0.01, 0.1, 1, 10}, which gave the best prediction
performance for each SVM as determined by the maximization of F-measure,
a harmonic mean of precision and recall, as follows:
Fig. 1. Schematic diagram of SVM-based machine learning. See
Construction of Prediction Systems Using SVM in Materials and
Methods for details.
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
various enzymes, should be correctly predicted on the basis of the
theoretical approach, such as the relative activity factor method
(Venkatakrishnan et al., 2000). Therefore, we previously established an in
silico empirical classifier using a rectangular method (“CPathPred”) to
predict the major clearance pathways of drugs, with only four physicochemical parameters easily predicted from their molecular structures
[charge; molecular weight; octanol-water distribution coefficient, or
lipophilicity (log D); and protein unbound fraction in plasma], by
optimizing the boundaries of these parameters for each clearance
pathway (Kusama et al., 2010). Then the major clearance pathways of
141 approved drugs that are known to be majorly metabolized by
CYP3A4, CYP2C9, or CYP2D6; taken up into hepatocytes by organic
anion transporting polypeptide (OATPs); or excreted into urine in an
unchanged form could be reasonably predicted by CPathPred with an
accuracy of 73%.
Although the rectangular method is visually intuitive and easily
understood, expansion of the number of clearance pathways would not
be practical due to an algorithmic issue and substantial time requirements
for the calculation (Kusama et al., 2010). In this study, we improved the
prediction performance for each clearance pathway via the introduction
of the support vector machine (SVM) with Gaussian kernel function,
which has a broad capacity for two-class classification owing to its
ability to find a nonlinear boundary in multidimensional data (Vapnik,
2000). This also enabled us to easily increase the number of descriptors
used for our prediction, and therefore, three SVM-based predictors were
established using the following descriptors: 1) 4 default physicochemical
parameters used in our previous study; 2) 881 descriptors predicted in
silico from the chemical structures of drugs in addition to the 4 default
parameters; and 3) selected descriptors from a pool of 881 descriptors
extracted by a feature selection using the greedy algorithm with the 4
default parameters.
In this study, to compare the prediction performance of different
systems based on 10-fold cross-validation, the effect of introducing
SVMs with an increased number of descriptors on the predictability of
major clearance pathways of drugs in humans was verified, and ultimately, a better prediction system was proposed in comparison with
the previous rectangular-based predictor.
1813
Clearance Pathway Prediction by Support Vector Machines
F-measure ¼
2 precision recall
precision þ recall
ð2Þ
TP
TP þ FP
ð3Þ
precision ¼
and
recall ¼
TP
TP þ FN
ð4Þ
1. Divide drug data (141 drugs) into training and test data for 10-fold
cross-validation.
2. Conduct random oversampling to generate 200 positive and 200 negative
examples.
3. Train the SVM with the 400 examples and predict the test data.
4. Repeat steps 2 and 3 30 times to ensure the randomness of oversampling.
5. Repeat steps 1–4 for each set of training and test data.
Fig. 2. Schematic diagram of a feature selection based on a greedy algorithm. See
Feature Selection of Descriptors Based on the Greedy Algorithm in Materials and
Methods for details.
numeral. A compound would be considered to belong to a certain clearance
pathway if the averaged distance of the 300 SVMs for the corresponding clearance
pathway was positive.
Feature Selection of Descriptors Based on the Greedy Algorithm. We
constructed three types of predictors using SVMs with only the default descriptors
(“default descriptor SVM”), with the default descriptors and the 881 descriptors
(“885 descriptor SVM”), and with the default descriptors and the feature-selected
descriptors from the 881 descriptors (“feature selection SVM”). To incorporate the
most relevant descriptors and consequently increase the accuracy, we performed
feature selection using the greedy algorithm (Caruana and Freitag, 1994), as
theoretically there are 2881 combinations of input data, making it impossible to
investigate all possible combinations in an exhaustive manner.
The process of feature selection based on the greedy algorithm is as follows (Fig. 2):
1. Randomly select 1 descriptor from the 881 descriptors.
2. Run the SVM machine with the default descriptors and the descriptor
selected in step 1 as input parameters.
3. Evaluate the performance of the feature selection by the F-measure of
each predictor as previously explained.
4. Repeat the aforementioned procedures for all 881 descriptors to verify
which descriptor was associated with the maximum F-measure.
Each SVM would yield the distance of a test compound from the borderline.
Tenfold cross-validation was repeated 10 times, and a final prediction result was
determined as the average of 300 SVM outputs for each drug. If a test compound
is located in the incorrect territory, then the distance was yielded as a negative
TABLE 2
Actual and predicted clearance pathways of 10-fold cross-validation using the rectangular-based predictor
The shaded entries represent the numbers of drugs classified into a correct clearance pathway by the classification system.
Predicted (Classified) Pathway
Actual (observed pathway)
3A4
2C9
2D6
Renal
OATP
Total
Precisiond
Overall precisione
Accuracy f
a
Single
3A4
2C9
2D6
Renal
OATP
38
0
6
0
0
44
0.86
0
2
0
0
2
4
0.50
1
0
2
0
0
3
0.67
0.72
2
0
5
23
5
35
0.66
0
5
0
1
4
10
0.40
Duala
Noneb
6
1
4
3
5
19
5
4
1
14
2
26
Total
Recallc
52
12
18
41
18
141
0.73
0.17
0.11
0.56
0.22
0.49
Dual: the number of drugs classified into two pathways.
None: the number of drugs classified into none of the pathways.
Recall: the proportion of drugs correctly classified into a pathway to that of the total number of drugs that are actually eliminated
through that pathway.
d
Precision: the proportion of drugs correctly classified into a pathway to that of the total number of drugs classified into that pathway.
e
Overall precision: the ratio of the number of drugs correctly classified to that of those predicted to be in the same pathway.
f
Accuracy: the proportion of drugs correctly classified to that of the total number of drugs in the training data set.
b
c
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
where TP, FP, and FN are the numbers of true-positive, false-positive, and
false-negative predictions, respectively.
As an SVM classifies data into only two classes, five independent SVMbased classifiers that corresponded to each clearance pathway were prepared.
Thus, the model is similar to the one-versus-the-rest method (Hsu and Lin,
2002), but we take multiple positives rather than choosing only an argument of
the maximum. As each SVM corresponding to a specific clearance pathway
independently tells us whether a compound is mainly eliminated via a specified
pathway (“Yes” or “No”) in our current system, our prediction system is
allowed to conclude that a compound has no major clearance pathway or
multiple major clearance pathways.
It is not appropriate to train an SVM only with a direct input of data for 141 drugs,
as the drugs were not uniformly distributed among the clearance pathways (Table 1).
Thus, we adopted a random oversampling technique (Liu et al., 2006). In the random
oversampling process, we increased the number of examples in the minority class by
replicating data randomly. We evaluated SVM by 10-fold cross-validation, which
tests trained classifiers with subdivided test data within the training data set. Briefly,
a sample of data partitioned into 10 subsets and a model was trained using 9 subsets.
Then validation was performed on the model using the remaining 1 subset. In our
analysis, to further reduce the fluctuation of outcomes caused by a random selection
of subsets, 10 rounds of cross-validation were performed using different partitions
and the validation results were averaged over the rounds.
The procedure of training an SVM with random oversampling and evaluating
the SVM by 10-fold cross-validation is as follows (Fig. 1):
1814
Toshimoto et al.
TABLE 3
Actual and predicted clearance pathways of 10-fold cross-validation using the default descriptor SVM
The shaded entries represent the numbers of drugs classified into a correct clearance pathway by the classification system.
Predicted (Classified) Pathway
Actual (observed pathway)
3A4
2C9
2D6
Renal
OATP
Total
Precisiond
Overall precisione
Accuracy f
Single
3A4
2C9
2D6
Renal
OATP
42
1
5
3
0
51
0.82
0
6
0
2
2
10
0.60
1
0
6
0
0
7
0.86
0.77
4
0
4
30
3
41
0.73
0
0
0
0
1
1
1.00
Dual a
Noneb
3
5
3
4
12
27
2
0
0
2
0
4
Total
Recallc
52
12
18
41
18
141
0.81
0.50
0.33
0.73
0.06
0.60
a
Dual: the number of drugs classified into two pathways.
None: the number of drugs classified into none of the pathways.
Recall: the proportion of drugs correctly classified into a pathway to that of the total number of drugs that are actually eliminated
through that pathway.
d
Precision: the proportion of drugs correctly classified into a pathway to that of the total number of drugs classified into that pathway.
e
Overall precision: the ratio of the number of drugs correctly classified to that of those predicted to be in the same pathway.
f
Accuracy: the proportion of drugs correctly classified to that of the total number of drugs in the training data set.
b
c
Prediction Performance of the Entire Predictor. Prediction performance
of the entire predictor for five clearance pathways was evaluated using two
criteria, accuracy and overall precision, which were calculated using eqs. 5 and
6, respectively, as follows:
accuracy ¼
overall precision ¼
# correctly predicted data
and
# data
ð5Þ
# correctly predicted data
ð6Þ
# data which are predicted as a single clearance pathway
Construction of a Rectangular-Based Predictor. We also constructed
a rectangular-based predictor on the basis of four default descriptors
(Kusama et al., 2010). Model evaluation was performed using accuracy
and overall precision in a 10-fold cross-validation performed for 10 iterations. A
drug would be considered to belong to a certain clearance pathway if it
was categorized into a specified clearance pathway more than 6 of 10
times.
Validation of the Effectiveness of Feature Selection Based on the
Greedy Algorithm. To confirm whether the prediction accuracy according to
feature selection based on the greedy algorithm is superior to that performed
by random feature selection, the prediction performance of 1.0 107
predictors with random feature selection was evaluated by 10-fold crossvalidation. Then the probability density distribution of the accuracy in 1.0 107
predictors constructed with the random feature selection was calculated.
Results
Prediction Performance of Rectangular-Based Predictor, Default Descriptor SVM, 885 Descriptor SVM, and Feature Selection
SVM. Prediction results of the rectangular-based predictor, default
descriptor SVM, 885 descriptor SVM, and feature selection SVM are
TABLE 4
Actual and predicted clearance pathways of 10-fold cross-validation using the 885 descriptor SVM
The shaded entries represent the numbers of drugs classified into a correct clearance pathway by the classification system.
Predicted (Classified) Pathway
Actual (observed) pathway
3A4
2C9
2D6
Renal
OATP
Total
Precisiond
Overall precisione
Accuracy f
a
Single
3A4
2C9
2D6
Renal
OATP
40
1
1
2
2
46
0.87
0
4
0
0
0
4
1.00
3
0
14
3
0
20
0.70
0.86
1
1
0
29
0
31
0.94
1
1
0
0
14
16
0.88
Duala
Noneb
2
0
2
2
0
6
5
5
1
5
2
18
Total
Recallc
52
12
18
41
18
141
0.77
0.33
0.78
0.71
0.78
0.72
Dual: the number of drugs classified into two pathways.
None: the number of drugs classified into none of the pathways.
Recall: the proportion of drugs correctly classified into a pathway to that of the total number of drugs that are actually eliminated
through that pathway.
d
Precision: the proportion of drugs correctly classified into a pathway to that of the total number of drugs classified into that pathway.
e
Overall precision: the ratio of the number of drugs correctly classified to that of those predicted to be in the same pathway.
f
Accuracy: the proportion of drugs correctly classified to that of the total number of drugs in the training data set.
b
c
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
5. Add the aforementioned descriptor to the default descriptors.
6. Repeat the aforementioned procedure until the F-measure was not
further improved by the addition of the descriptor selected in step 5.
1815
Clearance Pathway Prediction by Support Vector Machines
TABLE 5
Actual and predicted clearance pathways of 10-fold cross-validation using the feature selection SVM
The shaded entries represent the numbers of drugs classified into a correct clearance pathway by the classification system.
Predicted (Classified) Pathway
Actual (observed) pathway
3A4
2C9
2D6
Renal
OATP
Total
Precisiond
Overall precisione
Accuracy f
Single
3A4
2C9
2D6
Renal
OATP
47
1
2
0
0
50
0.94
0
8
0
0
0
8
1.00
0
0
13
0
0
13
1.00
0.97
0
0
0
37
0
37
1.00
0
1
0
0
17
18
0.94
Duala
Noneb
0
1
0
0
1
2
5
1
3
4
0
13
Total
Recallc
52
12
18
41
18
141
0.90
0.67
0.72
0.90
0.94
0.87
a
Dual: the number of drugs classified into two pathways.
None: the number of drugs classified into none of the pathways.
Recall: the proportion of drugs correctly classified into a pathway to that of the total number of drugs that are actually eliminated
through that pathway.
d
Precision: the proportion of drugs correctly classified into a pathway to that of the total number of drugs classified into that pathway.
e
Overall precision: the ratio of the number of drugs correctly classified to that of those predicted to be in the same pathway.
f
Accuracy: the proportion of drugs correctly classified to that of the total number of drugs in the training data set.
b
c
Prediction Performance of Random Feature Selection. To confirm
that the significance of the feature selection based on the greedy
algorithm is not likely to occur by chance, we compared the accuracy of
1.0 107 predictors prepared by random feature selection with that of
the feature selection SVM. Figure 3 shows the histogram of the probability density distribution of the accuracy in 1.0 107 predictors
constructed with the random feature selection.
Discussion
In this study, we succeeded in improving the prediction performance
of an in silico classification method to predict the major clearance
pathways of drugs by identifying the multidimensional boundaries of
molecular descriptors calculated from chemical structures using SVMs
TABLE 6
Description of the descriptors selected on the basis of feature selection with a greedy algorithm (other than the four default descriptors) for each
clearance pathway
Pathway
Ordera
CYP3A4
1
DPSA3
2
3
4
No. alcohol groups primary
AlogP98 054 H
E state SdssC
CYP2C9
1
Auto Geary 05 AlogP98
CYP2D6
2
1
No. rotatable bonds
Fraction of 2D VSA H-bond acceptor
2
Auto Moran 02 polarizability
3
1
AlogP98 070 N
Fraction of 2D VSA polar
2
Auto Moran 05 electronegativity
3
4
1
2
AlogP98 117 P
2D VSA polar
BCUT lowest eigenvalue 02 AlogP98
Auto Moreau Bruto 05 electronegativity average
Renal
OATP
a
b
Name of Descriptorb
Order: the order of addition of descriptors based on the greedy algorithm.
Data source: Manual of PreADMET v2.0 (http://www.bmdrc.org/).
Explanation
Difference between the sum of the product of VDW surface area
and partial charge of all positively charged atoms and that of
all negatively charged atoms
Number of primary alcohol groups
Log P calculated by Ghose’s atom additive method (type 54)
Topological environment of atoms and the atom-atom electronic
interactions in the molecule
Autocorrelation used to transform AlogP98 into a uniform
fixed-length representation
Number of rotatable bonds
Fraction of surface area of H-bond acceptor atoms to total VDW
surface area
An autocorrelation used to transform polarizability into
a uniform fixed-length representation
Log P calculated by Ghose’s atom additive method (type 70)
Fraction of surface area of polar atoms to the total VDW surface
area
Autocorrelation used to transform electronegativity into
a uniform fixed-length representation
Log P calculated by Ghose’s atom additive method (type 117)
VDW surface area of polar atoms
Burden-CAS-University of Texas eigenvalues
Autocorrelation used to transform electronegativity into
a uniform fixed-length representation
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
shown in Table 2, Table 3, Table 4, and Table 5. The feature selection
SVM displayed the best prediction performance; accuracies for the
rectangular-based predictor, default descriptor SVM, 885 descriptor
SVM, and feature selection SVM were 0.49, 0.60, 0.72, and 0.87,
respectively, and the overall precision values for the four predictors
were 0.72, 0.77, 0.86, and 0.97, respectively.
Selected Parameters Based on Greedy Algorithm. The number of
descriptors including the four default descriptors for each clearance
pathway was six for CYP2C9 and OATP, seven for CYP2D6, and
eight each for CYP3A4 and renal excretion (Table 6). Some of the
feature-selected descriptors were associated with van der Waals (VDW)
surface area or lipophilicity. The prediction results of the feature selection SVM for 141 drugs are shown in Table 7 together with their
actual clearance pathways.
1816
Toshimoto et al.
TABLE 7
Actual major clearance pathways for all drugs and the pathways as classified by the feature descriptor SVM
Results of 10-Fold Cross-Validation
Compound Name
Actual Pathway
Predicted Pathway
CYP3A4
Renal
CYP3A4
Renal
CYP3A4
CYP3A4
Renal
CYP3A4
CYP3A4
Renal
CYP2D6
OATP
OATP
CYP3A4
CYP3A4
OATP
CYP3A4
Renal
Renal
Renal
Renal
Renal
CYP2C9
Renal
OATP
Renal
CYP2D6
CYP2D6
Renal
Renal
Renal
CYP3A4
Renal
Renal
CYP3A4
Renal
Renal
Renal
CYP3A4
CYP2D6
Renal
CYP3A4
CYP2D6
Renal
OATP
OATP
CYP3A4
CYP3A4
Renal
CYP3A4
CYP3A4
CYP2D6
Renal
CYP2D6
CYP2D6
OATP
Renal
Renal
CYP2D6
Renal
CYP2C9
CYP2C9
CYP2C9
CYP3A4
Renal
CYP2C9
CYP3A4
CYP3A4
CYP2D6
CYP2C9
CYP2C9
CYP3A4
Renal
CYP3A4
Renal
None
CYP3A4
Renal
CYP3A4
CYP3A4
Renal
CYP2D6
OATP
OATP
CYP3A4
CYP3A4
OATP
CYP3A4
Renal
Renal
Renal
Renal
Renal
CYP3A4
Renal
OATP
None
CYP2D6
CYP2D6
Renal
Renal
Renal
CYP3A4
Renal
None
CYP3A4
None
Renal
None
CYP3A4
CYP2D6
Renal
CYP3A4
CYP2D6
Renal
OATP
OATP/renal
CYP3A4
None
Renal
CYP3A4
CYP3A4
None
Renal
CYP2D6
None
OATP
Renal
Renal
CYP3A4
Renal
CYP2C9
CYP2C9
CYP2C9
None
Renal
CYP2C9
None
CYP3A4
CYP2D6
None
CYP2C9/OATP
CYP3A4
CYP2D6
Renal
OATP
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
(continued )
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
Acyclovir
Alprazolam
Amikacin
Amiodarone
Amlodipine
Amoxicillin
Aprepitant
Atazanavir
Atenolol
Atomoxetine
Atorvastatin
Bosentan
Buprenorphine
Buspirone
Candesartan
Carbamazepine
Cefazolin
Cefdinir
Cefixime
Cefotetan
Cefuroxime
Celecoxib
Cephalexin
Cerivastatin
Chloroquine
Chlorpheniramine
Chlorpromazine
Chlorthalidone
Cidofovir
Cimetidine
Clonazepam
Clonidine
Daptomycin
Diazepam
Dicloxacillin
Didanosine
Digoxin
Diltiazem
Diphenhydramine
Dofetilide
Donepezil
Doxepin
Doxycycline
Enalapril
Enalaprilat
Eplerenone
Erythromycin
Ethambutol
Felodipine
Finasteride
Flecainide
Fluconazole
Fluoxetine
Fluphenazine
Fluvastatin
Foscarnet
Furosemide
Galantamine
Ganciclovir
Glibenclamide
Glimepiride
Glipizide
Granisetron
Hydrochlorothiazide
Ibuprofen
Ifosfamide
Imatinib
Imipramine
Indomethacin
Irbesartan
Itraconazole
CYP2C9
1817
Clearance Pathway Prediction by Support Vector Machines
TABLE 7—Continued
Results of 10-Fold Cross-Validation
Actual Pathway
Ketoconazole
Ketorolac
Lamivudine
Lansoprazole
Leflunomide
Levetiracetam
Lopinavir
Lovastatin
Meloxicam
Metformin
Methadone
Methotrexate
Methylprednisolone
Metoprolol
Midazolam
Montelukast
Nelfinavir
Nevirapine
Nifedipine
Nortriptyline
Olmesartan
Ondansetron
Oxybutynin
Oxycodone
Paroxetine
Penicillin G
Phenytoin
Pitavastatin
Pramipexole
Pravastatin
Prednisolone
Prednisone
Procainamide
Propranolol
Quetiapine
Quinidine
Quinine
Ranitidine
Repaglinide
Rifampin
Risedronate
Risperidone
Ritonavir
Rosuvastatin
Sildenafil
Simvastatin acid
Sirolimus
Tacrolimus
Tadalafil
Tamoxifen
Tamsulosin
Telmisartan
Temocapril
Temocaprilat
Tenofovir
Timolol
Tolbutamide
Tolterodine
Topiramate
Topotecan
Toremifene
Trazodone
Trimethoprim
Valsartan
Venlafaxine
Verapamil
Vincristine
Vinorelbine
Warfarin
Zolpidem
CYP3A4
Renal
Renal
CYP3A4
CYP3A4
Renal
CYP3A4
CYP3A4
CYP2C9
Renal
CYP3A4
Renal
CYP3A4
CYP2D6
CYP3A4
CYP2C9
CYP3A4
CYP3A4
CYP3A4
CYP2D6
OATP
CYP3A4
CYP3A4
CYP3A4
CYP2D6
Renal
CYP2C9
OATP
Renal
OATP
CYP3A4
CYP3A4
Renal
CYP2D6
CYP3A4
CYP3A4
CYP3A4
Renal
OATP
OATP
Renal
CYP2D6
CYP3A4
OATP
CYP3A4
OATP
CYP3A4
CYP3A4
CYP3A4
CYP3A4
CYP3A4
OATP
OATP
OATP
Renal
CYP2D6
CYP2C9
CYP2D6
Renal
Renal
CYP3A4
CYP3A4
Renal
OATP
CYP2D6
CYP3A4
CYP3A4
CYP3A4
CYP2C9
CYP3A4
Predicted Pathway
CYP3A4
CYP3A4
Renal
Renal
CYP3A4
CYP3A4
Renal
CYP3A4
CYP3A4
CYP2C9
Renal
CYP3A4
Renal
CYP3A4
CYP2D6
CYP3A4
OATP
CYP3A4
CYP3A4
CYP3A4
CYP2D6
OATP
CYP3A4
CYP3A4
None
CYP2D6
Renal
CYP2C9
OATP
Renal
OATP
CYP3A4
CYP3A4
Renal
CYP2D6
CYP3A4
CYP3A4
CYP3A4
Renal
OATP
OATP
Renal
CYP3A4
CYP3A4
OATP
CYP3A4
OATP
CYP3A4
CYP3A4
CYP3A4
CYP3A4
CYP3A4
OATP
OATP
OATP
Renal
None
CYP2C9
CYP2D6
Renal
Renal
CYP3A4
CYP3A4
Renal
OATP
CYP2D6
CYP3A4
CYP3A4
CYP3A4
CYP2C9
CYP3A4
CYP2C9
CYP2D6
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s , classified pathway(s); shaded columns, actual major clearance pathway.
s
s
s
s
s
s
s
s
s
s
s
s
OATP
s
s
s
s
s
s
Renal
s
s
s
s
s
s
s
s
s
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
Compound Name
1818
Toshimoto et al.
and feature selection based on the greedy algorithm compared with our
previous rectangular method (Kusama et al., 2010). Moreover, we
confirmed that improved prediction performance could be realized using
our current descriptor selection method compared with random feature
selection.
The rectangular method focuses mainly on empirical knowledge
and intuitiveness. However, because it required a heavy calculation,
some difficulties were encountered when the number of clearance
pathways to be predicted was expanded or the number of parameters
used for prediction was expanded. Concerning the algorithms used for
such predictors, SVM was found to be superior to a rectangular
method, which uses linear boundaries, whereas SVM uses nonlinear
boundaries for classification. In silico prediction software (PreADMET) can generate a large number of physicochemical parameters
in addition to the four default descriptors we employed in the
rectangular-based predictor, and we aimed to include these parameters
as well as other structural information. The greedy algorithm tends to
yield local optimal solutions, but it is a rapid method. This greedy
algorithm–based feature selection was one of the most appropriate
means because we repeatedly performed the learning process, which is
time-consuming (12 minutes per SVM). Some of the selected descriptors based on the greedy algorithm were associated with VDW
surface area and lipophilicity (Table 6). VDW power is an important
parameter to determine the affinity between an enzyme and a compound,
as reported in protein-ligand docking software (Ewing et al., 2001).
Lipophilicity measures, such as log D and protein binding, represent one
of the parameters used in the rectangular-based predictor as well as in
this current investigation as a default descriptor. It is reasonable to
understand that lipophilicity plays a key role in determining the clearance pathways of drugs because the octanol-water partition coefficient
(log P) was reported to be selected as an important descriptor for the
prediction of metabolic enzymatic activity (Sakiyama et al., 2008) and
the separation of drugs cleared through metabolism from those cleared
via urinary or biliary excretion (Obach et al., 2008; Varma et al., 2009;
Benet et al., 2011).
The probability density of accuracy of 1.0 107 predictors prepared
by random feature selection exhibited a bimodal distribution with
Authorship Contributions
Participated in research design: Toshimoto, Wakayama, Kusama, Maeda,
Sugiyama, Akiyama.
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
Fig. 3. Histogram of the probability density of prediction accuracy of SVM by
random feature selection. Ten million SVMs were generated by randomly selecting
the same number of descriptors for each clearance pathway to verify the statistical
significance of a feature selection based on a greedy algorithm compared with
random selection. The histogram exhibited a bimodal distribution, and an apparent
averaged value of this distribution is 0.49 6 0.082 (mean 6 S.D.) when fitted to
a normal distribution. The arrow denotes the accuracy of the feature selection SVM.
a mode of 0.57, which is similar to that of the default descriptor SVM
(0.60). This indicated that random feature selection did not result in
improved prediction performance. Conversely, the accuracy of the
feature-selected SVM was 0.87, which suggests that it is an outlier in
the probability density distribution of the accuracy in 1.0 107 predictors constructed using random feature selection, and thus, feature
selection based on the greedy algorithm effectively contributes to the
improvement of prediction performance.
As a future perspective, we must first expand the number of predictable clearance pathways. In addition to the five clearance pathways
targeted in this research, CYP1A2, CYP2C8, CYP2C19, and glucuronide conjugation are recognized as other important clearance
pathways, and they will be included in our predictor. However, it is
difficult to accomplish satisfactory learning with a small number of
training data. Under such a situation, it may be necessary to improve
the algorithm by excluding unnecessary data that make classification
boundaries indistinct (Tomek, 1976) and by introducing the distribution correction (Sugiyama and Müller, 2005). Second, we need to
carefully assess the possibility of overlearning. Overlearning is also
called overfitting, in which a learning system excessively fits its learning data, resulting in suboptimal predictions for new external data. This
occurs when the learning procedure becomes longer because the
machine starts to memorize the data and not learn from it or when the
learning data are atypical. Random oversampling tends to cause
overlearning due to training for several copied data (Batista et al.,
2004). Because there are more negative answers than positive ones
regarding clearance pathway prediction, oversampling will be likely
to occur for positive data. This may result in an increased number
of false-negative predictions. To mitigate overfitting, we performed
10-fold cross-validation 10 times, but further verification could be
provided using external data.
In the future, we must also account for the drugs cleared by multiple
clearance pathways, as 39 of 230 drugs were found to be cleared
through multiple clearance pathways in our data set. Regulatory guidance from the US Food and Drug Administration and European Medicines Agency stated that if the contribution of a clearance pathway was
.25%, sponsors were encouraged to test the effect of inhibitors for that
pathway on the pharmacokinetics of that drug (European Medicines
Agency, 2010; Food and Drug Administration, 2012).
Although there are in vitro data on individual enzymes and transporters, there are few data on the overall clearance pathways in humans. Clearance pathway determination is important for assessing
pharmacokinetic variability in the clinical setting. However, the fact
that we could collect data for only 141 drugs for our data set indicates
that such data are not readily available. Moreover, investigations on
transporters have only been conducted for recently developed drugs.
In such cases, industrial scientists, as well as regulatory and clinical
scientists, can use this prediction system to evaluate the pharmacokinetic properties of a compound.
In conclusion, we developed a highly accurate in silico prediction
system for major clearance pathways of drugs based on SVM and
a feature selection based on the greedy algorithm. This method will
enable drug discovery researchers to predict possible elimination
clearance mechanisms using only chemical structures prior to chemical synthesis, and it can be used to examine whether an intensive
investigation of the predicted major clearance pathways of drugs
should be performed by regulatory scientists for new chemical
entities.
Clearance Pathway Prediction by Support Vector Machines
Conducted experiments: Toshimoto, Wakayama, Kusama, Maeda.
Contributed new reagents or analytic tools: Toshimoto, Akiyama.
Performed data analysis: Toshimoto, Wakayama, Kusama, Maeda, Sugiyama,
Akiyama.
Wrote or contributed to the writing of the manuscript: Toshimoto, Wakayama,
Kusama, Maeda, Sugiyama, Akiyama.
References
Mishra NK, Agarwal S, and Raghava GP (2010) Prediction of cytochrome P450 isoform responsible for metabolizing a drug molecule. BMC Pharmacol 10:8.
Obach RS, Lombardo F, and Waters NJ (2008) Trend analysis of a database of intravenous
pharmacokinetic parameters in humans for 670 drug compounds. Drug Metab Dispos 36:
1385–1405.
Sakiyama Y, Yuki H, Moriya T, Hattori K, Suzuki M, Shimada K, and Honma T (2008) Predicting human liver microsomal stability with machine learning techniques. J Mol Graph
Model 26:907–915.
Sugiyama M and Müller KR (2005) Input-dependent estimation of generalization error under
covariate shift. Stat Decis 23:249–279.
Thummel KE and Shen DD (2001) Design and optimization of dose regimens: pharmacokinetic
data, in Goodman & Gilman’s The Pharmacological Basis of Therapeutics (Gilman A
and Goodman LS eds) 10th ed, pp 1917–2024, McGraw-Hill, New York.
Thummel KE, Shen DD, Isoherranen N, and Smith HE (2006) Design and optimization of dosage
regimens: pharmacokinetic data, in Goodman & Gilman’s The Pharmacological Basis of
Therapeutics (Gilman AG, Brunton LL, Gilman A, Goodman LS, Lazo JS, and Parker KL eds)
11th ed, pp 1787–1888, McGraw-Hill, New York.
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772.
van de Waterbeemd H and Gifford E (2003) ADMET in silico modelling: towards prediction
paradise? Nat Rev Drug Discov 2:192–204.
Vapnik VN (2000) The Nature of Statistical Learning Theory, 2nd ed, Springer-Verlag, New
York.
Varma MV, Feng B, Obach RS, Troutman MD, Chupka J, Miller HR, and El-Kattan A (2009)
Physicochemical determinants of human renal clearance. J Med Chem 52:4844–4852.
Venkatakrishnan K, von Moltke LL, Court MH, Harmatz JS, Crespi CL, and Greenblatt DJ
(2000) Comparison between cytochrome P450 (CYP) content and relative activity approaches
to scaling from cDNA-expressed CYPs to human liver microsomes: ratios of accessory proteins
as sources of discrepancies between the approaches. Drug Metab Dispos 28:1493–1504.
Wu CY and Benet LZ (2005) Predicting drug disposition via application of BCS: transport/
absorption/elimination interplay and development of a biopharmaceutics drug disposition
classification system. Pharm Res 22:11–23.
Yamashita F, Hara H, Ito T, and Hashida M (2008) Novel hierarchical classification and visualization method for multiobjective optimization of drug properties: application to structureactivity relationship analysis of cytochrome P450 metabolism. J Chem Inf Model 48:364–369.
Yoshida K, Maeda K, and Sugiyama Y (2013) Hepatic and intestinal drug transporters: prediction
of pharmacokinetic effects caused by drug-drug interactions and genetic polymorphisms. Annu
Rev Pharmacol Toxicol 53:581–612.
Yu LX, Amidon GL, Polli JE, Zhao H, Mehta MU, Conner DP, Shah VP, Lesko LJ, Chen ML,
and Lee VH, et al. (2002) Biopharmaceutics classification system: the scientific basis for
biowaiver extensions. Pharm Res 19:921–925.
Address correspondence to: Dr. Yutaka Akiyama, Department of Computer
Science, Graduate School of Information Science and Engineering, Tokyo Institute
of Technology, 2-12-1-W8-76, Ookayama, Meguro-ku, Tokyo 152-8552, Japan.
E-mail: [email protected]
Downloaded from dmd.aspetjournals.org at ASPET Journals on June 18, 2017
Batista GE, Prati RC, and Monard MC (2004) A study of the behavior of several methods for
balancing machine learning training data. SIGKDD Explorations 6:20–29.
Benet LZ, Broccatelli F, and Oprea TI (2011) BDDCS applied to over 900 drugs. AAPS J 13:
519–547.
Benet LZ, Oie S, and Schwartz JB (1996) Design and optimization of dosage regimens: pharmacokinetic data, in Goodman & Gilman’s The Pharmacological Basis of Therapeutics
(Goodman LS, Gilman A, Hardman JG, Gilman AG, and Limbird LE eds) 9th ed, pp
1707–1792, McGraw-Hill, New York.
Caruana R and Freitag D (1994) Greedy Attribute Selection, ICML, New Brunswick, NJ.
Cheng F, Li W, Zhou Y, Shen J, Wu Z, Liu G, Lee PW, and Tang Y (2012) admetSAR:
a comprehensive source and free tool for assessment of chemical ADMET properties. J Chem
Inf Model 52:3099–3105.
Emoto C, Murayama N, Rostami-Hodjegan A, and Yamazaki H (2010) Methodologies for investigating drug metabolism at the early drug discovery stage: prediction of hepatic drug
clearance and P450 contribution. Curr Drug Metab 11:678–685.
European Medicines Agency (2010) Guideline on the Investigation of Drug Interactions, European Medicines Agency, London. http://www.ema.europa.eu/docs/en_GB/document_library/
Scientific_guideline/2010/05/WC500090112.pdf.
Ewing TJ, Makino S, Skillman AG, and Kuntz ID (2001) DOCK 4.0: search strategies for automated
molecular docking of flexible molecule databases. J Comput Aided Mol Des 15:411–428.
Food and Drug Administration (2012) Guidance for Industry: Drug Interaction Studies—Study
Design, Data Analysis, Implications for Dosing, and Labeling Recommendations, Food and
Drug Administration, Silver Spring, MD. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM292362.pdf.
Hsu CW and Lin CJ (2002) A comparison of methods for multiclass support vector machines.
IEEE Trans Neural Netw 13:415–425.
Kusama M, Toshimoto K, Maeda K, Hirai Y, Imai S, Chiba K, Akiyama Y, and Sugiyama Y
(2010) In silico classification of major clearance pathways of drugs with their physiochemical
parameters. Drug Metab Dispos 38:1362–1370.
Lewis DF and Ito Y (2010) Human CYPs involved in drug metabolism: structures, substrates and
binding affinities. Expert Opin Drug Metab Toxicol 6:661–674.
Lipinski CA, Lombardo F, Dominy BW, and Feeney PJ (2001) Experimental and computational
approaches to estimate solubility and permeability in drug discovery and development settings.
Adv Drug Deliv Rev 46:3–26.
Liu Y, An A, and Huang X (2006) Boosting prediction accuracy on imbalanced datasets with
SVM ensembles, in Advances in Knowledge Discovery and Data Mining (Ng WK, Kitsuregawa
M, Li J, and Chang K eds), pp 107–118, Springer-Verlag, Berlin.
1819