Challenge Kaggle: Transfer Learning on Stack

Challenge Kaggle: Transfer Learning on Stack Exchange Tags
KHOUFI Asma
MATMATI Hadhami
THIERRY Clément
A . KHOUFI @ GMAIL . COM
HADHAMI . MAT @ GMAIL . COM
CLEMENT. THIERRY @ U - PSUD . FR
Abstract
Stack exchange is a network of question and
answers websites that relies on a topic tagging
system helps the users find the most relevant
topics they are looking for. We have been
provided with data related to six different topics
and have been challenged to predict the tag of a
seventh one by learning on the first six.
The originality of the challenge lie in the topic
difference: Instead of an usual text classification problem, we have to solve a transfer learning
problem by learning an unrelated environment.
As we will see, it is a difficulty that at the moment make all our attempt at transfer learning
worse than simply doing a natural language processing method and analysis of the target topic.
1. Challenge description
1.1. Background
Founded in 2008 by Joel Spolsky and Jeff Atwood, Stack
exchange is a network of question and answers websites.
It all started with the now famous Stack Overflow Q&A
site about computer programming, and then extended to
different fields. The sites are based on a reputation award
process which allows them to be self-moderating. The
network relies also on a topic tagging system to connect
experts with questions they will be able to answer by
sorting the questions into specific, well-defined categories,
and helps the user to find the most relevant topics he is
looking for.
For this competition Stack Exchange provides us with data
from six different platforms with different topics: biology,
cooking, cryptography, DIY (do it yourself) , robotics, and
travel and challenge us to predict appropriate physics tags
by learning from those different data sets.
At first glimpse it seems like a regular text classification
problem, but the difficulty lies in the topic difference: How
do we predict tags from models trained on unrelated topics?
This competition started October 28, 2016. The current
best score is 0.67049, the second best is 0.29412, which is
closer to the next ones. The gap suggest a scoring error
for the first one, which is plausible for a competition with
public data. Because of the intent of our project and for the
spirit of the competition, we will try to create a strategy that
won’t use any external information related to the physic
stack exchange site.
1.2. Evaluation metric
The evaluation metric for this competition is Mean F1Score. The F1 score measures accuracy using the statistics
precision p and recall r. Precision is the ratio of true positives (tp) to all predicted positives (tp + fp). Recall is the
ratio of true positives to all actual positives (tp + fn). The
F1 score is given by:
F1 = 2
tp
tp
p.r
where p =
,r =
p+r
tp + f p
tp + f n
(1)
We will use the same metric for the validation. We will
test our different methods by applying them on the others
subjects.
2. Data
2.1. Data description
The corpus set is provided from the Kaggle competition
(https://www.kaggle.com/c/transfer-learning-on-stackexchange tags/data) and it is composed of data sets from
different fields. We have the question titles, question
contents, and tags associated for each one provided from
Stack Exchange sites. The content of each question
is given as HTML. The tags are words or phrases that
describe the topic of the question.
There are six different topics: biology, cooking, cryptography, diy, robotics and travel. The table 1 and the figure 1
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
Topic
Cooking
Biology
Cryptography
DIY
Robotics
Travel
Number of questions
15404
13196
10432
25918
2771
19279
Table 1. Number of questions per topic
show the number of the question per topic. We notice that
there are some domains more popular than others and the
number of entries varies from a domain to another.
Figure 1. Percentage of questions per topic
The training set is composed of 87 000 questions. Overall
225138 tags have been used, of which 4268 unique ones.
We have an average number of tags per question of 2.59.
In addition, we are provided with the test set
which is comprised of 81926 questions from the
physics.stackexchange.com. It is composed of 6352313
words. In order to predict the potential tags for each
question, we should use the title and question content.
Figure 2. Number of words per content and per tag
2.2. Data pre-processing
We applied pre-processing operations to the data before experimenting with them. The content of each question is
given as HTML, as such we removed every html tag and uri
from the content of the questions using the BeautifulSoup
module. Other pre-processing operations were removing
the punctuation from the titles and contents and removing
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
the stop words from the titles and contents using the Stopwords module for english. Also, we have splited the tags
string in a list of tags.
2.3. Data exploration
The figure 2 illustrate the number of words per topic, per
content and per tag after the pre-processing.
The word clouds illustrated in figures 3, 4, 5, 6, 7, 8, show
that for each topic there are keywords. The word that
appears the most frequently is placed as the largest. For
example, when the topic is robots the most frequent words
are: robot, motor, sensor. We notice that many words than
appear in the question title and the question content appear
also in the tags.
Figure 3. Word cloud of the cryptography topic (1:Question title,
2:Question content, 3:Tags)
Figure 4. Word cloud of the cooking topic (1:Question title,
2:Question content, 3:Tags)
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
Table 2. Number of words per tag for each tagged set
S ET
Figure 5. Word cloud of the robots topic (1:Question title, 2:Question content, 3:Tags)
B IOLOGY
C OOKING
T RAVEL
ROBOTICS
C RYPTO
D IY
1 WORDS
2 WORDS
3 WORDS
291
403
282
112
119
387
782
848
1691
268
460
848
22
12
99
3
67
34
4
WORDS
0
2
12
0
2
0
Figure 6. Word cloud of the travel topic (1:Question title, 2:Question content, 3:Tags)
Figure 7. Word cloud of the DIY topic (1:Question title, 2:Question content, 3:Tags)
Figure 10. Tags’ occurrence
The figure 9 shows the word cloud of the physics topic
in the question title and the question content. The word
”time” is frequent so we can suppose that it can be used in
tags. Moreover, we notice that same words appear in different topics. So, we decide to compute the occurrence of
each tag for each topic.
Figure 8. Word cloud of the biology topic (1:Question title,
2:Question content, 3:Tags)
The figure 10 shows that, for example the word ”untagged”
appears in all the topics. However, the word ”lemonade”
appears only in the tags of the cooking topic. So, we can
say that there are some words that characterize a special
topic.
We also ran an analysis of the number of words per tag,
as a tag can be composed of multiple words. For example
”quantum-physic” is composed of two words. As we can
see in the table 2 and on the figure 11, most of the tags are
made of one or two words, more often two than one.
Figure 9. Word cloud of the physics topic (1:Question title,
2:Question content)
Finally, we plotted the percentage of words in the tags that
are exactly in the title or in the content. As we can see on
the figure 12, most of the topic have half of their questions
without the tags words in the title. More of them have their
tags in the content of the question, as seen on figure 13,
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
but even then a lot of tags are missing in the content, for
example crypto still have no tags in it’s content for 40% of
it’s questions.
Figure 12. Percentage of tag exactly in the title of the question
Figure 11. Word per tags for each of the topics
We plotted the percentage of tags that are either in the question or the content in the figure 14, the result show that except for crypto and biology, only 30% of questions have
their tag in neither the title or the content, most having
20% of their tag in either one. Biology have almost 60%
of it’s question not containing any of their tags, crypto almost 40%. We can also see that except for robotics, all the
topics have less than 10% of their tags all included in either the title or the content. Thanks to this analysis, we can
deduce that we will most certainly won’t be able to deduce
tags only from the words contained in the content and title
of the questions, as we may be wrong at least 30% of the
time in doing so, and even if some of them are right, we
will often won’t get all of them right as usually only 10%
of the time all the tags are included in the question.
Figure 13. Percentage of tag exactly in the content of the question
3. Baseline
3.1. Tag extraction through tfidf
As a baseline, we tried to get results based on the most
common methods utilized currently on the challenge,
which use word frequency. We tried calculating the tf-idf
using the titles of each entry, the content, and both at the
same time. The baseline results we obtained are in the
table 5. We can see that only using the title seem to give
better result with the tf-idf.
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
4.1. Data preprocessing
In this section we will describe the preprocessing pipeline
of our input data. After loading our different training sets,
we combine them into a single data frame, and we apply the
cleaning process, in which we filter the symbols, lower and
lemmatize the words. We also remove html tags, combine
the question titles and content when needed, and separate
the tags into lists. As we use deep learning models, we
transform our data set into a corpus [words x documents]
so it can be accepted as the input of our models.
4.2. Processing
4.2.1. C ONVOLUTION N EURAL N ETWORK
Figure 14. Percentage of tag exactly in the content or the title of
the question
Table 3. Baseline results for title, content, and both for the test set
using the mean F1 metric.
DATA
T ITLE
T ITLE + CONTENT
C ONTENT
S CORE
0.08271
0.05719
0.05021
We used only one words tags, as we saw from the table 2,
most of the tag are made of two words. A possible amelioration for this strategy could be to include word pairs in our
words list.
This strategy doesn’t actually use transfer learning, as we
only do an analysis of the physic question corpus and ignore all the others data set to tag each question.
4. Solution
As we want to use a transfer learning approach to the challenge, we tried mainly the deep learning approaches. This
approach is applicable to domain adaptation as it can be
trained on large amount of labeled data from the source domain, which would be here the training set, the 6 tagged
topics, and large amount of unlabeled data from the target
domain, which is here the test set, the untagged physics
topic, by saving the layers and especially the weights values of the training data.
A Convolution Neural Network or CNN or ConvNet, is a
specific type of neural networks. The CNN architecture
is an ensemble of multiple layers, alternating between
convolution and pooling layers, and ending with a fully
connected layer. A classic CNN architecture would look
like this:
Input → Conv → ReLU → Conv → ReLU →
P ool → ReLU → Conv → ReLU → P ool →
F ully Connected
The convolution layer we slide, or convolve a filter (ngrams) , around the input data, it is multiplying the values
in the filter with the original data(aka computing element
wise multiplications). Another important concept of CNNs
is pooling, which is a form of non-linear down-sampling.
The pooling layer serves to progressively reduce the spatial
size of the representation, to reduce the number of parameters and amount of computation in the network. This fully
connected layer basically takes an input volume (whatever
the output is of the conv or ReLU or pool layer preceding it)
and outputs an N dimensional vector where N is the number of classes that the program has to choose from. The
ReLU or Rectified Linear Units is a function that increases
the nonlinear properties of the decision function.
The idea behind using the CNN, is to use the pre-trained
CNN models and then remove the last output layer to extract the features from CNN that were trained on ImageNet
datasets. This is usually called transfer learning because
you can use layers from other CNNs as feature extractors
for different problems, since the first layer filters of the
CNNs works as edge detectors, they can be used as general feature detectors for other problems. In our case, the
main features will be the keywords of each question, thus
the tags.
In our processing pipeline, we used the following architecture for the CNN model:
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
Table 4. CNN results for title, content, and both for the test set
using the mean F1 metric.
Table 5. LDA results with each method we tried
M ETHOD
DATA
T ITLE
T ITLE + CONTENT
C ONTENT
S CORE
S CORE
0.07325
0.05620
0.05018
Input → 250 ∗ Conv → M axP ooling → Dropout →
ReLU → F ully Connected
Do we use 250 successive convolution layers with 1 dimensional filter with a length of 3, as we noticed that the
questions are generally tagged with three words. Then we
apply a max pooling feature, as we select the max value
for final convolution matrix of every three words. We use
then a dropout layer to prevent overfitting, and a ReLU activation function. We train this CNN on our training data
and then extract the weights of the ReLU layer to detect the
important feature/words of our target topic. The results of
this method are displayed in Table 4.
4.2.2. L ATENT D IRICHLET A LLOCATION
Faced with the lack of original method to do transfer learning on this problem, we wanted to try more classical methods to find our tags.
Latent Dirichlet allocation (LDA) is a generative statistical
model that allows us to give to each stack exchange post
from the data a topic. Each topic has a list of words that
has a probability to appear for this topic. By using this
model we have a way to get words related to the question
and more useful as tags but that may not actually be in the
question. This method only need an analysis of the physics
corpus so it is not actually transfer learning, as we would
learn unrelated topics and words from the other text.
LDA give us topics and words related to them. We tried
to tags each post with the most probable words from each
of theses topics, and got unsurprisingly worse results than
with the tfidf, see table 5 the method ”best 5 tags” where
we take the 5 most probable tags and ”random from best
15” where we take 5 random tags from the best 15, to get
different tags for each questin. With this method we obviously had tags that would be unrelated to the question
as we would always get the same tags for each question
tagged with theses topics. This would solve the problem of
tags that are words not in the question, as we would have
the most popular words from a topic related to the question, but we still need to find words closely related to the
question. The computational time of LDA being too long to
make much experiment, we couldn’t make as many topics
B EST 20 WORDS IN TEXT
B EST 5 WORDS IN TEXT
RANDOM FROM BEST 15
BEST 5 TAGS
0.03861
0.02866
0.00862
0.00824
as we wanted, settling for 14 which is the biggest number
we could have in a reasonable time (less than a day).
Another method we tried was to select from the most popular words of each topics only words that are also in the
question, see table 5 method ”Best 5 words in text”. This
method have a bigger computational time that increase with
the number of words for each topic we select from (if we
select from the best 5 it is a lot quicker than if we select
from the best 20 for each topic), but give us better results
than the first method we used with LDA. It still give us
worse results than our baseline. We still use words that are
in the text, which kind of defeat the purpose of LDA as we
can’t get related tags that aren’t in the text anymore. A solution would again be to get more topics in order to have
more specific tags and to be able to use them for tagging.
4.2.3. F UTURES IDEAS
Currently we think that simple natural language processing
algorithms only applied on the target data would be better
than transfer learning from the learning data to the target
data in most case, and this opinion is shared by multiple
people on the kaggle discussions forums of this challenge.
But an idea we had but didn’t have time to explore and that
we think may give better results is to learn a generic way
to find the position of the tags in a question relative to the
words around it. Basically, our idea was to use Markov
Model like they are usually used for part of speech tagging,
but instead of tagging the different grammatical parts of
sentences, we would have tagged the proximity to the tags
in the text for each word and differentiating the sentence
position in the text, as the tags may be more often in the title
or in the first sentence of the question than in the last one.
If this kind of approach had worked, maybe we could have
had a better result, as the learning process would have been
totally independent of the nature of the words and topics.
5. Conclusion
The problem of learning the tags of a forum from unrelated
topics is a difficult one, there aren’t many methods that can
be applied transfer learning for this kind of problem, and
an analysis of the target forum post is usually preferred and
give better results. A part of the problem is the unknown
Challenge Kaggle: Transfer Learning on Stack Exchange Tags
nature of the possible labels, as we aren’t given the possible
tags for the physics topic, and apparently the better results
in the leaderboard all use a list of tags taken from the stack
exchange website. We think that it is against the competition spirit and didn’t use this. We compared our results
to ones we easily obtained by calculating the tfidf of the
physics topic, to see if transfer learning method would beat
the simplest approach. We tried using Convolution Neural
Network, but got worse results than with the tfidf. We may
have better results by fine tuning the CNN but we lack the
infrastructure to do that in any reasonable time. We tried to
use LDA and got a bigger choice of tags for each question,
coming with a lack of precision in the topic and giving us
worse results for a computational time really inferior. With
more time to compute a result or a new method to select
appropriate tags from the topics that lda give us, it would
be a better tagging method. There are some other methods
we thought about that we couldn’t try, like using Markov
model to tag the closeness to the tags,
References
https://www.kaggle.com/c/transfer-learning-on-stackexchange tags/data. Kaggle.