Challenge Kaggle: Transfer Learning on Stack Exchange Tags KHOUFI Asma MATMATI Hadhami THIERRY Clément A . KHOUFI @ GMAIL . COM HADHAMI . MAT @ GMAIL . COM CLEMENT. THIERRY @ U - PSUD . FR Abstract Stack exchange is a network of question and answers websites that relies on a topic tagging system helps the users find the most relevant topics they are looking for. We have been provided with data related to six different topics and have been challenged to predict the tag of a seventh one by learning on the first six. The originality of the challenge lie in the topic difference: Instead of an usual text classification problem, we have to solve a transfer learning problem by learning an unrelated environment. As we will see, it is a difficulty that at the moment make all our attempt at transfer learning worse than simply doing a natural language processing method and analysis of the target topic. 1. Challenge description 1.1. Background Founded in 2008 by Joel Spolsky and Jeff Atwood, Stack exchange is a network of question and answers websites. It all started with the now famous Stack Overflow Q&A site about computer programming, and then extended to different fields. The sites are based on a reputation award process which allows them to be self-moderating. The network relies also on a topic tagging system to connect experts with questions they will be able to answer by sorting the questions into specific, well-defined categories, and helps the user to find the most relevant topics he is looking for. For this competition Stack Exchange provides us with data from six different platforms with different topics: biology, cooking, cryptography, DIY (do it yourself) , robotics, and travel and challenge us to predict appropriate physics tags by learning from those different data sets. At first glimpse it seems like a regular text classification problem, but the difficulty lies in the topic difference: How do we predict tags from models trained on unrelated topics? This competition started October 28, 2016. The current best score is 0.67049, the second best is 0.29412, which is closer to the next ones. The gap suggest a scoring error for the first one, which is plausible for a competition with public data. Because of the intent of our project and for the spirit of the competition, we will try to create a strategy that won’t use any external information related to the physic stack exchange site. 1.2. Evaluation metric The evaluation metric for this competition is Mean F1Score. The F1 score measures accuracy using the statistics precision p and recall r. Precision is the ratio of true positives (tp) to all predicted positives (tp + fp). Recall is the ratio of true positives to all actual positives (tp + fn). The F1 score is given by: F1 = 2 tp tp p.r where p = ,r = p+r tp + f p tp + f n (1) We will use the same metric for the validation. We will test our different methods by applying them on the others subjects. 2. Data 2.1. Data description The corpus set is provided from the Kaggle competition (https://www.kaggle.com/c/transfer-learning-on-stackexchange tags/data) and it is composed of data sets from different fields. We have the question titles, question contents, and tags associated for each one provided from Stack Exchange sites. The content of each question is given as HTML. The tags are words or phrases that describe the topic of the question. There are six different topics: biology, cooking, cryptography, diy, robotics and travel. The table 1 and the figure 1 Challenge Kaggle: Transfer Learning on Stack Exchange Tags Topic Cooking Biology Cryptography DIY Robotics Travel Number of questions 15404 13196 10432 25918 2771 19279 Table 1. Number of questions per topic show the number of the question per topic. We notice that there are some domains more popular than others and the number of entries varies from a domain to another. Figure 1. Percentage of questions per topic The training set is composed of 87 000 questions. Overall 225138 tags have been used, of which 4268 unique ones. We have an average number of tags per question of 2.59. In addition, we are provided with the test set which is comprised of 81926 questions from the physics.stackexchange.com. It is composed of 6352313 words. In order to predict the potential tags for each question, we should use the title and question content. Figure 2. Number of words per content and per tag 2.2. Data pre-processing We applied pre-processing operations to the data before experimenting with them. The content of each question is given as HTML, as such we removed every html tag and uri from the content of the questions using the BeautifulSoup module. Other pre-processing operations were removing the punctuation from the titles and contents and removing Challenge Kaggle: Transfer Learning on Stack Exchange Tags the stop words from the titles and contents using the Stopwords module for english. Also, we have splited the tags string in a list of tags. 2.3. Data exploration The figure 2 illustrate the number of words per topic, per content and per tag after the pre-processing. The word clouds illustrated in figures 3, 4, 5, 6, 7, 8, show that for each topic there are keywords. The word that appears the most frequently is placed as the largest. For example, when the topic is robots the most frequent words are: robot, motor, sensor. We notice that many words than appear in the question title and the question content appear also in the tags. Figure 3. Word cloud of the cryptography topic (1:Question title, 2:Question content, 3:Tags) Figure 4. Word cloud of the cooking topic (1:Question title, 2:Question content, 3:Tags) Challenge Kaggle: Transfer Learning on Stack Exchange Tags Table 2. Number of words per tag for each tagged set S ET Figure 5. Word cloud of the robots topic (1:Question title, 2:Question content, 3:Tags) B IOLOGY C OOKING T RAVEL ROBOTICS C RYPTO D IY 1 WORDS 2 WORDS 3 WORDS 291 403 282 112 119 387 782 848 1691 268 460 848 22 12 99 3 67 34 4 WORDS 0 2 12 0 2 0 Figure 6. Word cloud of the travel topic (1:Question title, 2:Question content, 3:Tags) Figure 7. Word cloud of the DIY topic (1:Question title, 2:Question content, 3:Tags) Figure 10. Tags’ occurrence The figure 9 shows the word cloud of the physics topic in the question title and the question content. The word ”time” is frequent so we can suppose that it can be used in tags. Moreover, we notice that same words appear in different topics. So, we decide to compute the occurrence of each tag for each topic. Figure 8. Word cloud of the biology topic (1:Question title, 2:Question content, 3:Tags) The figure 10 shows that, for example the word ”untagged” appears in all the topics. However, the word ”lemonade” appears only in the tags of the cooking topic. So, we can say that there are some words that characterize a special topic. We also ran an analysis of the number of words per tag, as a tag can be composed of multiple words. For example ”quantum-physic” is composed of two words. As we can see in the table 2 and on the figure 11, most of the tags are made of one or two words, more often two than one. Figure 9. Word cloud of the physics topic (1:Question title, 2:Question content) Finally, we plotted the percentage of words in the tags that are exactly in the title or in the content. As we can see on the figure 12, most of the topic have half of their questions without the tags words in the title. More of them have their tags in the content of the question, as seen on figure 13, Challenge Kaggle: Transfer Learning on Stack Exchange Tags but even then a lot of tags are missing in the content, for example crypto still have no tags in it’s content for 40% of it’s questions. Figure 12. Percentage of tag exactly in the title of the question Figure 11. Word per tags for each of the topics We plotted the percentage of tags that are either in the question or the content in the figure 14, the result show that except for crypto and biology, only 30% of questions have their tag in neither the title or the content, most having 20% of their tag in either one. Biology have almost 60% of it’s question not containing any of their tags, crypto almost 40%. We can also see that except for robotics, all the topics have less than 10% of their tags all included in either the title or the content. Thanks to this analysis, we can deduce that we will most certainly won’t be able to deduce tags only from the words contained in the content and title of the questions, as we may be wrong at least 30% of the time in doing so, and even if some of them are right, we will often won’t get all of them right as usually only 10% of the time all the tags are included in the question. Figure 13. Percentage of tag exactly in the content of the question 3. Baseline 3.1. Tag extraction through tfidf As a baseline, we tried to get results based on the most common methods utilized currently on the challenge, which use word frequency. We tried calculating the tf-idf using the titles of each entry, the content, and both at the same time. The baseline results we obtained are in the table 5. We can see that only using the title seem to give better result with the tf-idf. Challenge Kaggle: Transfer Learning on Stack Exchange Tags 4.1. Data preprocessing In this section we will describe the preprocessing pipeline of our input data. After loading our different training sets, we combine them into a single data frame, and we apply the cleaning process, in which we filter the symbols, lower and lemmatize the words. We also remove html tags, combine the question titles and content when needed, and separate the tags into lists. As we use deep learning models, we transform our data set into a corpus [words x documents] so it can be accepted as the input of our models. 4.2. Processing 4.2.1. C ONVOLUTION N EURAL N ETWORK Figure 14. Percentage of tag exactly in the content or the title of the question Table 3. Baseline results for title, content, and both for the test set using the mean F1 metric. DATA T ITLE T ITLE + CONTENT C ONTENT S CORE 0.08271 0.05719 0.05021 We used only one words tags, as we saw from the table 2, most of the tag are made of two words. A possible amelioration for this strategy could be to include word pairs in our words list. This strategy doesn’t actually use transfer learning, as we only do an analysis of the physic question corpus and ignore all the others data set to tag each question. 4. Solution As we want to use a transfer learning approach to the challenge, we tried mainly the deep learning approaches. This approach is applicable to domain adaptation as it can be trained on large amount of labeled data from the source domain, which would be here the training set, the 6 tagged topics, and large amount of unlabeled data from the target domain, which is here the test set, the untagged physics topic, by saving the layers and especially the weights values of the training data. A Convolution Neural Network or CNN or ConvNet, is a specific type of neural networks. The CNN architecture is an ensemble of multiple layers, alternating between convolution and pooling layers, and ending with a fully connected layer. A classic CNN architecture would look like this: Input → Conv → ReLU → Conv → ReLU → P ool → ReLU → Conv → ReLU → P ool → F ully Connected The convolution layer we slide, or convolve a filter (ngrams) , around the input data, it is multiplying the values in the filter with the original data(aka computing element wise multiplications). Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters and amount of computation in the network. This fully connected layer basically takes an input volume (whatever the output is of the conv or ReLU or pool layer preceding it) and outputs an N dimensional vector where N is the number of classes that the program has to choose from. The ReLU or Rectified Linear Units is a function that increases the nonlinear properties of the decision function. The idea behind using the CNN, is to use the pre-trained CNN models and then remove the last output layer to extract the features from CNN that were trained on ImageNet datasets. This is usually called transfer learning because you can use layers from other CNNs as feature extractors for different problems, since the first layer filters of the CNNs works as edge detectors, they can be used as general feature detectors for other problems. In our case, the main features will be the keywords of each question, thus the tags. In our processing pipeline, we used the following architecture for the CNN model: Challenge Kaggle: Transfer Learning on Stack Exchange Tags Table 4. CNN results for title, content, and both for the test set using the mean F1 metric. Table 5. LDA results with each method we tried M ETHOD DATA T ITLE T ITLE + CONTENT C ONTENT S CORE S CORE 0.07325 0.05620 0.05018 Input → 250 ∗ Conv → M axP ooling → Dropout → ReLU → F ully Connected Do we use 250 successive convolution layers with 1 dimensional filter with a length of 3, as we noticed that the questions are generally tagged with three words. Then we apply a max pooling feature, as we select the max value for final convolution matrix of every three words. We use then a dropout layer to prevent overfitting, and a ReLU activation function. We train this CNN on our training data and then extract the weights of the ReLU layer to detect the important feature/words of our target topic. The results of this method are displayed in Table 4. 4.2.2. L ATENT D IRICHLET A LLOCATION Faced with the lack of original method to do transfer learning on this problem, we wanted to try more classical methods to find our tags. Latent Dirichlet allocation (LDA) is a generative statistical model that allows us to give to each stack exchange post from the data a topic. Each topic has a list of words that has a probability to appear for this topic. By using this model we have a way to get words related to the question and more useful as tags but that may not actually be in the question. This method only need an analysis of the physics corpus so it is not actually transfer learning, as we would learn unrelated topics and words from the other text. LDA give us topics and words related to them. We tried to tags each post with the most probable words from each of theses topics, and got unsurprisingly worse results than with the tfidf, see table 5 the method ”best 5 tags” where we take the 5 most probable tags and ”random from best 15” where we take 5 random tags from the best 15, to get different tags for each questin. With this method we obviously had tags that would be unrelated to the question as we would always get the same tags for each question tagged with theses topics. This would solve the problem of tags that are words not in the question, as we would have the most popular words from a topic related to the question, but we still need to find words closely related to the question. The computational time of LDA being too long to make much experiment, we couldn’t make as many topics B EST 20 WORDS IN TEXT B EST 5 WORDS IN TEXT RANDOM FROM BEST 15 BEST 5 TAGS 0.03861 0.02866 0.00862 0.00824 as we wanted, settling for 14 which is the biggest number we could have in a reasonable time (less than a day). Another method we tried was to select from the most popular words of each topics only words that are also in the question, see table 5 method ”Best 5 words in text”. This method have a bigger computational time that increase with the number of words for each topic we select from (if we select from the best 5 it is a lot quicker than if we select from the best 20 for each topic), but give us better results than the first method we used with LDA. It still give us worse results than our baseline. We still use words that are in the text, which kind of defeat the purpose of LDA as we can’t get related tags that aren’t in the text anymore. A solution would again be to get more topics in order to have more specific tags and to be able to use them for tagging. 4.2.3. F UTURES IDEAS Currently we think that simple natural language processing algorithms only applied on the target data would be better than transfer learning from the learning data to the target data in most case, and this opinion is shared by multiple people on the kaggle discussions forums of this challenge. But an idea we had but didn’t have time to explore and that we think may give better results is to learn a generic way to find the position of the tags in a question relative to the words around it. Basically, our idea was to use Markov Model like they are usually used for part of speech tagging, but instead of tagging the different grammatical parts of sentences, we would have tagged the proximity to the tags in the text for each word and differentiating the sentence position in the text, as the tags may be more often in the title or in the first sentence of the question than in the last one. If this kind of approach had worked, maybe we could have had a better result, as the learning process would have been totally independent of the nature of the words and topics. 5. Conclusion The problem of learning the tags of a forum from unrelated topics is a difficult one, there aren’t many methods that can be applied transfer learning for this kind of problem, and an analysis of the target forum post is usually preferred and give better results. A part of the problem is the unknown Challenge Kaggle: Transfer Learning on Stack Exchange Tags nature of the possible labels, as we aren’t given the possible tags for the physics topic, and apparently the better results in the leaderboard all use a list of tags taken from the stack exchange website. We think that it is against the competition spirit and didn’t use this. We compared our results to ones we easily obtained by calculating the tfidf of the physics topic, to see if transfer learning method would beat the simplest approach. We tried using Convolution Neural Network, but got worse results than with the tfidf. We may have better results by fine tuning the CNN but we lack the infrastructure to do that in any reasonable time. We tried to use LDA and got a bigger choice of tags for each question, coming with a lack of precision in the topic and giving us worse results for a computational time really inferior. With more time to compute a result or a new method to select appropriate tags from the topics that lda give us, it would be a better tagging method. There are some other methods we thought about that we couldn’t try, like using Markov model to tag the closeness to the tags, References https://www.kaggle.com/c/transfer-learning-on-stackexchange tags/data. Kaggle.
© Copyright 2026 Paperzz