SMS Spam Filtering Through Optimum-path Forest-based Classifiers Dheny Fernandes Kelton A.P. da Costa Tiago A. Almeida João Paulo Papa Department of Computing São Paulo State University Bauru, SP, Brazil [email protected] Department of Computing São Paulo State University Bauru, SP, Brazil [email protected] Department of Computer Science Federal University of São Carlos Sorocaba, SP, Brazil [email protected] Department of Computing São Paulo State University Bauru, SP, Brazil [email protected] Abstract—In the past years, SMS messages have shown to be a profitable revenue to the cell-phone industries, being one of the most used communication systems to date. However, this very same scenario has led spammers to concentrate their attentions into spreading spam messages through SMS, thus achieving some success due to the lack of proper tools to cope with this problem. In this paper, we introduced the Optimum-Path Forest classifier to the context of spam filtering in SMS messages, as well as we compared it against with some state-of-the-art supervised pattern recognition techniques. We have shown promising results with an user-friendly classifier, which requires minimum user interaction and less knowledge about the dataset. I. I NTRODUCTION The Short Message Service (SMS) is a text-based communication system that operates on standard communication protocols. The service, started on 1992, obtained a considerable success, and today is one of the most used tools for message interchanging. The growth of this service is sustained by high profits, since the world revenues related to SMS achieved around 128 billion dollars in 2011 [1], and the estimative for 2016 is about 153 billion dollars. Besides, 9.5 trillion SMS were sent in the whole world in 2009. The aforementioned scenario of widespread use and profitability became propitious to exploit SMS-based systems in order to obtain personal advantages. In addition, the price for sending SMS is very low in several countries, or even priceless in a number of locations worldwide. To this picture, we can add the lack of proper spam filtering softwares, thus making this practice feasible. Actually, some reports indicate the amount of spam in SMS systems correspond to 30% of the entire traffic volume [2]. such messages automatically. Gunal et al. [4], for instance, employed Bayesian classifiers to identify spam messages in mobile environments. Their approach used information gain and chi-square metrics to select the best set of features, and then two different Bayesian classifiers were used for classification purposes. Nuruzzaman et al. [5] employed the well-known Support Vector Machines (SVMs) to detect spam in SMS messages as well, achieving reasonable accuracy (around 90% of recognition rate). Later on, Almeida et al. [6] presented promising results in a new dataset made available by the very same group of authors. Very recently, Al-Hasan et al. [7] employed the Dendritic Cell Algorithm for feature selection and decision fusion aiming at spam detection in SMS messages. Despite the efforts to date, there is still no definitive solution to this problem, because established techniques for automatic spam filtering has their performance degraded when dealing with SMS messages. It is mainly due to the fact that such messages are usually very short and often rife with idioms, slang, symbols, emoticons and abbreviations which make even tokenization a challenge task. Furthermore, classification errors should be considered differently, since an incorrectly blocked ham (legitimate message) often is more harmful than a non-caught spam. In order to cope with this threat, researchers have focused on different approaches. A method based on CAPTCHA was presented by Shirali-Shahreza and Shirali-Shahreza [3], being the idea to prepare a CAPTCHA test as soon as an SMS gets ready to be sent. Such test contains a message with an image and a multiple-choice question, in which the sender shall choose the answer that represents the image, and send it back to the receiver. If the answer is correct, the sender is considered legitimate and the original SMS is thus delivered. However, such approach might not be scalable due to the amount of SMS sent daily. Some years ago, Papa et al. [8], [9], [10] introduced the Optimum-Path Forest (OPF) classifier, which models the task of pattern recognition as a graph partition problem, in which samples are encoded as graph nodes, and a predefined adjacency relation connects all of them. The graph partition is ruled by a competition process among some key nodes (prototypes) that aim at conquering the remaining samples offering to them optimum-path costs. Nowadays, the reader can find two variants of the supervised OPF classifier: (i) OPF with complete graph [9], [10] (OPFcpl ), and (ii) OPF with k-NN graph [8] (OPFknn ). Such techniques employ the very same OPF algorithm, but with different configurations. However, as far as we know, OPF has never been applied to the context of SMS spam classification. Therefore, the main contribution of this paper is to validate both versions of the supervised OPF for the task of spam identification in SMS messages. Since OPFcpl is parameterless and OPFknn has one parameter only, we showed here they can obtain very promising results with minimum user interaction. Machine learning-based approaches have also been employed to learn the behavior of spammers, and thus detect The reminder of this paper is organized as follow: Sections II and III present the OPF background theory and the methodology, respectively. Section IV discusses the experimental results, while Section V states conclusions and guidelines for future works. II. O PTIMUM - PATH F OREST The OPF classifier models the pattern recognition problem as a graph partition task, in which the feature vector extracted from each sample is considered a graph node. The arcs are defined by some predefined adjacency relation, and weighted by a distance metric applied to their corresponding feature vectors. The main idea of OPF-based classifiers is to rule a competition process among some key samples (i.e., prototypes) in order to conquer the remaining nodes. Such process partitions the graph into optimum-path trees (OPTs) rooted at each prototype node, thus defining discrete influence regions for each OPT. The competition process is performed by means of path-cost functions offered to each sample by the prototype nodes. Notice different path-cost functions may lead to different graph partitions. Papa et al. [8], [9], [10] presented two different versions of the supervised OPF classifier, being the one that makes use of a complete graph (OPFcpl ) [9], [10] and the another that employs a k-nn graph (OPFknn ) [8]. Both versions work similarly, i.e., they employ the very same OPF algorithm, but with the following modifications: (i) adjacency relation, (ii) methodology to estimate prototypes, and (iii) path-cost function. The next sections present in more details the two variants employed in this work. A. OPF with Complete Graph S Let Z be a dataset such that Z = Z1 Z2 , where Z1 and Z2 stand for the training and test sets, respectively. Each − sample s ∈ Z can be represented by its feature vector → v (s) ∈ <n and mapped into a graph node. Let G = (V, A) be such graph, in which A is an adjacency relation that connects all pairs of nodes, i.e., a full connectedness graph, and V contains − the feature vectors → v (s), ∀s ∈ Z. For the sake of explanation, → − we will refer to v (s) as being s. In addition, let λ(·) be a function that assigns the true label for each sample in Z. Consider a path πs in G with terminus in s, and a function f (πs ) that associates a value to this path. The OPFcpl algorithm aims at minimizing f (πs )to every sample s using a function that computes the maximum arc-weight along a path, i.e.: fmax (hsi) = fmax (π · hs, ti) = 0 if s ∈ S, +∞ otherwise max{fmax (π), d(s, t)}, 1) Training: Let S ∗ ⊂ S a set of prototype samples that start the competition process. Papa et al. [9] proposed to select such samples as being the nearest elements from different classes, i.e., the idea is to place such key samples in the regions more prone to classification errors (boundaries of the classes). In order to fulfil such purpose, we can compute a minimum spanning tree (MST) on the graph derived from the training set, i.e., G1 = (V1 , A), where V1 contains all feature vectors extracted from training samples. Further, we can simply mark the connected elements from different classes in the MST as the prototype nodes, thus composing the final S ∗ . Notice such set is composed of training samples only. The training step thus outputs an optimum-path forest over G1 as implemented by Algorithm 1. Algorithm 1: – OPFcpl T RAINING S TEP I NPUT: O UTPUT: AUXILIARY: 1. 2. and 3. 4. 5. 6. 7. 8. 9. 10. A training graph V1 λ-labeled, prototypes S ∗ ⊂ V1 and the pair (v, d) for feature vector and distances computation. Optimum-path Forest P , path-value map V and label map L Priority queue Q, and variable cst. For each s ∈ V1 , set P (s) ← nil e V (s) ← +∞. For each s ∈ S ∗ , set V (s) ← 0, P (s) ← nil, L(s) = λ(s) insert s in Q. While Q is not empty, do Remove from Q a sample s such that V (s) is minimum. For each t ∈ V1 such that s 6= t e V (t) > V (s), do Compute cst ← max{V (s), d(s, t)}. If cst < V (t), then If V (t) 6= +∞, then remove t from Q. P (t) ← s, L(t) ← L(s) e V (t) ← cst. Insert t in Q. Lines 1 - 2 initialize all samples s ∈ V1 and prototypes s ∈ S ∗ , the path-value and label maps. The precursors are initialized as nil. From Lines 5 - 10, the algorithm checks whether the path that reaches an adjacent sample t through s has a lower cost with respect to the path that ends up in t. In such case, t is conquered by s, thus making s the precursor of t. Therefore, sample t receives the label of s, and its cost value is updated. This very same process is repeated for all training samples in Lines 3 - 10. 2) Classification: Let G2 = (V2 , A) be the graph derived from the test set Z2 . For any sample t ∈ V2 , we consider all arcs connecting t with samples s ∈ V1 , thus making t part of the original graph. Considering all possible paths between S ∗ and t, we wish to find an optimum path P ∗ (t) from S ∗ to t with the class λ(R(t)) of their prototype R(t) ∈ S ∗ more strongly connected. This path can be incrementally identified by assessing the optimum cost value V (t), as follows: (1) V (t) = min max{V (s), d(s, t)}. where π · hs, ti is the concatenation of the path πs and the arc hs, ti, and d(s, t) measures the dissimilarity between adjacent nodes. The set S ⊆ V stands for the prototype samples. Since we usually have a training and a test sets, the OPFcpl also contains a training and a test phase, being the former in charge of building an OPF by means of a competition process using fmax , and the latter phase is thus used to evaluate its performance. ∀s∈V1 (2) B. OPF with k-nn Graph Papa and Falcão [8] introduced a variant of the supervised OPF presented in the previous section, in which both nodes and arcs are weighted. In this variant, say that OPFknn , the adjacency relation is now defined over a k-neighbourhood graph, i.e., G = (V, Ak ), where Ak stores the k-nearest neighbours of each sample in V. Such approach can be seen as a dual problem of the OPFcpl , since we are looking now to maximize the cost of each sample in V, and the prototype samples are now placed in the regions with high concentration of samples. Once again, such approach is composed of a training and test phases, as described below. 1) Training: The training step of OPFknn aims at building the optimum-path forest using a similar algorithm to the one presented in the previous section (Algorithm 1), but using a different path-cost function and adjacency relation. As aforementioned, OPFknn places the prototypes in the regions with high density of samples. In order to fulfil this purpose, we first associate a probability density function (pdf) to each node of the training graph G1 = (V1 , Ak ), as follows: ρ(s) = p 1 X 2πσ 2 |A(s)| ∀t∈A(s) exp −d2 (s, t) 2σ 2 , (3) ∗ fmin (hti) ∗ fmin (πs · hs,ti) ρ(t) ρ(t) − δ −∞ if λ(t) 6= λ(s) (5) min{fmin (πs ), ρ(t)} otherwise. = = if t ∈ S otherwise The above equation weights all arcs (s, t) ∈ Ak such that λ(t) 6= λ(s) with −∞ thus preventing such arcs to belong to any optimum path (i.e., it avoids sample t to be conquered by a sample from another class s). Finally, OPFknn has the parameter k to be fine-tuned, which controls the size of the window used to compute the pdf of each training node. Papa and Falcão [11] proposed an exhaustive search for K ∈ [1, kmax] that maximizes the accuracy over the training set, being kmax a parameter defined by the user. In short, the idea is to use the value K ∗ that maximizes the OPFknn accuracy over the training set. Algorithm 3 implements such procedure. Algorithm 3: – OPFknn T RAINING S TEP d where σ = 3f and df is the largest arc length in G1 . The OPFknn aims at maximizing the path-cost function for all training samples according to fmin , as follows: I NPUT: O UTPUT: AUXILIARY: fmin (hti) = fmin (πs · hs,ti) = ρ(t) if t ∈ S, ρ(t) − δ otherwise { min{fmin (πs ), ρ(t)}, (4) where δ is a sufficient small number that avoids the division of the influence zone into multiples zones of influence. Algorithm 2 implements the OPFknn main concepts. Algorithm 2: – OPFknn A LGORITHM I NPUT: O UTPUT: AUXILIARY: A training graph G1 = (V1 , Ak ), λ(s) for all s ∈ V1 and a path-value function fmin . Label map L, path-value map V and the Optimumpath Forest P . Priority queue Q and the variable tmp. 1. For each s ∈ V1 , set P (s) ← nil, V (s) ← ρ(s) − δ L(s) ← λ(s) and insert s in Q. 2. While Q is not empty, do 3. Remove from Q a sample s such that V (s) is maximum. 4. If P (s) = nil, then 5. V (s) ← ρ(s). 6. For each t ∈ Ak (s) e V (t) < V (s), do 7. tmp ← min{V (s), ρ(t)}. 8. If tmp > V (t) then 9. L(t) ← L(s), P (t) ← s, V (t) ← tmp. 10. update position of t in Q. As aforementioned, OPFknn works similarly to OPFcpl , but now the idea is to maximize the cost map, instead of minimizing it. However, the main problem with naïve fmin concerns with the fact it can not guarantee one prototype per class. In order to cope with such shortcoming, we can used a modified version of fmin , as follows: Training graph G1 , λ(s) for all s ∈ V1 , kmax and ∗ path-values functions fmin e fmin . Label map L, path-value map V and optimum-path forest P . variables i, k, k∗ , M axAcc ← −∞, Acc and arrays F P e F N of size c. 1. For k = 1 to kmax do 2. Create a graph G1 = (V1 , Ak ) with weighted nodes trough Equation 3. 3. Compute (L, V, P ) using the Algorithm 2 with fmin . 4. For each class i = 1, 2, . . . , c, do F P (i) ← 0 e F N (i) ← 0. 5. 6. For each sample t ∈ Z1 , do If L(t) 6= λ(t), then 7. 8. F P (L(t)) ← F P (L(t)) + 1. 9. F N (λ(t)) ← F N (λ(t)) + 1. Compute accuracy. 10. 11. If Acc > M axAcc, then 12. k∗ ← k e M axAcc ← Acc. 13. Destroy graph G1 = (V1 , Ak ). 14. Create graph G1 = (V1 , Ak∗ ) with weighted nodes trough Equation 3. ∗ 15. Compute (L, V, P ) using the algorithm 2 with fmin . 2) Classification: A sample t ∈ V2 can be associated to a given class i, i = 1, 2, . . . , c simply identifying which root (prototype) offer the optimum-path as if this sample was part of the forest. Considering the closest k-neighbours of t, we then use Equation 3 to computes ρ(t), for futher evaluating the following equation: V (t) = max min{V (s), ρ(t)}. ∀(s,t)∈A∗ k (6) The label of t will be the same of its precursor P ∗ (t). III. M ETHODOLOGY In order to evaluate OPF-based classifiers for the task of SMS spam filtering, we employed the SMS Spam Colletcion dataset [12], which is composed of 5,574 messages, being 747 spams and 4,827 hams. Firstly, we extracted the tokens of all messages in the dataset, creating a dictionary (bag of words) D with 12,622 words, but we did not perform any other pre-processing technique such as stemming and stopwords removal, since such techniques tend to hurt the spam-filtering accuracy [13], [14]. Based on such dictionary, we then assign “1" to the corresponding position of that token in the feature vector of each message if the token of the message belongs to the tokens of the dictionary, and “0" otherwise. As spam filtering is sensitive to chronological order, crossvalidation is not recommended to address the algorithms performances [6]. Therefore, we have used the same protocol employed by Almeida et al. [6]: a stratified holdout validation with the first 30% messages for training (1,674 messages) and the remainder 70% for testing (3,900 messages), so that the earliest messages were used to train the algorithms and the newest ones to test them. In regard to the classification techniques, we compared OPFcpl and OPFknn against with SVM, Artificial Neural Networks with Multilayer Perceptrons (ANN-MLP), and the well-known k-Nearest Neighbour classifier (k-nn). To implement the SVM, we have employed the LibSVM [15] with a radial basis kernel with parameters optimized through a 5fold cross validation, and with C ∈ {2−5 , 2−3 . . . , 213 , 215 }, and γ ∈ {23 , 21 , . . . , 2−13 , 2−15 }. With respect to ANN-MLP, we used the FANN library [16] and a neural architecture of 12,622:8:8:2, i.e., 12,622 input neurons (dictionary size), two hidden units with 8 neurons each, and two output neurons (each one corresponds to one class, i.e., to be spam or not). In regard to OPF-based classifiers, we used LibOPF [17] and kmax = 10 for OP Fknn . Finally, with respect to k-nn, we used our own implementation with the k ∈ {1, 2, . . . , 334} value that maximized the accuracy over the training set1 . In order to evaluate the performance of all classifiers, we used the following well-known performance measures: • Spam Caught (SC%): amount of spam correctly classified as spam; • Blocked Ham (BH%): amount of ham incorrectly classified as spam; • Matthews Correlation Coefficient (M CC): the correlation coefficient between the observed and predicted binary classifications, returning values between -1 and +1; -1 indicates total disagreement between prediction and observation, 0 means the results is not better than random prediction, and +1 represents a perfect prediction; • Accuracy (ACC%): quantity of correct classification divided by the total number of samples. IV. R ESULTS In this section, we present the experimental results using the methodology previously described. Table I presents the results considering SC, BH, ACC and MCC measures. The most accurate result (in bold) concerning ACC was achieved 1 Notice such values were empirically set. The experiments were conducted on a PC equipped with an Intel Pentium i3 3.07 Ghz processor, and Ubuntu 14.04.2 LTS as the operational system. The value 334 in the k interval regarding k-nn stands for a fifth of the training set size. through SVM classifier, followed by k-nn and OPFcpl . Although the idea of SVM is to map the original data to a higher dimensional space, in this paper we can observe the opposite situation, since we have less training samples than features. Therefore, it seems SVM benefited from a lower dimensional space. TABLE I. Techniques ANN-MLP K-nn OPFcpl OPFknn SVM SC(%) 0.9941 0.4204 0.4047 0.3215 0.8389 E XPERIMENTAL RESULTS . BH(%) 0.2958 0.0000 0.0000 0.0000 0.0012 ACC(%) 0.7420 0.9243 0.9223 0.8618 0.9779 MCC 0.4830 0.6219 0.6095 0.5418 0.9000 OPF-based classifiers correctly classified all ham messages (BH = 0%), although they misclassified a considerable number of spam (i.e., they obtained a low SC value). However, the high recognition rates of SVM come with the price of a high computational burden, as displayed in Table II. Notice the training time also considers the parameter optimization step. OPFcpl has been about 720 times faster than SVM for the training step, which might be interesting in situations we need an online learning system, i.e., we can learn the behavior of new spammers very quickly. Actually, OPFcpl requires 0.023s per training sample only. TABLE II. T RAINING TIME [ S ]. Techniques ANN-MLP k-nn OPFcpl OPFknn SVM Time 33,599.875 6,474.248 38.426 1,639.422 27,703.853 Although SVM has been the most accurate technique, we still have room for improvements regarding OPF-based classifiers, and with a very considerable time frame to exploit OPF efficiency. A very recent article by Saito et al. [18] highlighted the importance of choosing classifiers based on their effectiveness and efficiency, which means we may use larger training sets for faster techniques. V. C ONCLUSION The amount of spam messages in SMS-based communication systems has increased in the last years, thus leading to the loss of productivity and network load. In this paper, we introduced the Optimum-Path Forest framework to the context of spam detection in SMS messages. We validated two distinct supervised OPF techniques against with SVM, Artificial Neural Networks and k-nn classifier in a recent developed dataset. The experimental section showed promising results for both OPF-based classifiers, although SVM has obtained the best results, but with the price of a high computational load. In regard to future works, we aim to apply feature reduction techniques in order to obtain lowerdimensional feature spaces, since such characteristic seemed to lead SVM to very good results. Moreover, we also intend to employ established NLP techniques for text normalization and semantic indexing, because SMS messages are short and often rife with idioms, slang, symbols, emoticons and abbreviations, which can hurt the classifiers performance. VI. ACKNOWLEDGMENTS [9] The authors thank to Capes, CNPq grants #306166/2014-3 and #470571/2013-6, as well as FAPESP grants #2014/162509 and #2015/00801-9 for the financial support. [10] R EFERENCES [11] [1] [2] [3] [4] [5] [6] [7] [8] Portio Research, “Analysis and Growth Forecasts for Mobile Messaging Markets Worldwide: 6th Edition,” Tech. Rep., Portio Research, 2011. T. Landesman, “ Cloudmark’s 2013 Annual Global Messaging Threat Report,” Tech. Rep., CloudMark, 2014. M.H. Shirali-Shahreza and M. Shirali-Shahreza, “An Anti-SMSSpam Using CAPTCHA,” in International Colloquium on Computing, Communication, Control, and Management, 2008. S. Gunal, A.P. Uysal, S. Ergin, and E.S Gunal, “A Novel Framework for SMS Spam Filtering,” in International Symposium on Innovations in Intelligent Systems and Applications, 2012. M. Taufiq Nuruzzaman, C. Lee, and D. Choi, “Independent and Personal SMS Spam Filtering,” in IEEE International Conference on Computer and Information Technology, 2011. Almeida T. A., Hidalgo J. M. G., and Silva T. P., “Towards SMS Spam Filtering: Results under a New Dataset,” International Journal of Information Security Science, vol. 2, pp. 1–18, 2013. A. A. Al-Hasan and E.-S. M. El-Alfy, “Dendritic cell algorithm for mobile phone spam filtering,” Procedia Computer Science, vol. 52, pp. 244–251, 2015, The 6th International Conference on Ambient Systems, Networks and Technologies. J. P. Papa and A. X. Falcão, “A New Variant of the Optimum-Path Forest Classifier,” in Proceedings of the 4th International Symposium on Advances in Visual Computing, Berlin, Heidelberg, 2008, pp. 935– 944, Springer-Verlag. [12] [13] [14] [15] [16] [17] [18] J. P. Papa, A. X. Falcão, and C. T. N. Suzuki, “Supervised pattern classification based on optimum-path forest,” International Journal of Imaging Systems and Technology, vol. 19, no. 2, pp. 120–131, 2009. J. P. Papa, A. X. Falcão, V. H. C. Albuquerque, and J. M. R. S. Tavares, “Efficient supervised optimum-path forest classification for large datasets,” Pattern Recognition, vol. 45, no. 1, pp. 512–520, 2012. J. P. Papa and A. X. Falcão, “A learning algorithm for the optimumpath forest classifier,” in Graph-Based Representations in Pattern Recognition, A. Torsello, F. Escolano, and L. Brun, Eds., vol. 5534 of Lecture Notes in Computer Science, pp. 195–204. Springer Berlin Heidelberg, 2009. T.A. Almeida, J.M. Gómez Hidalgo, and A. Yamakami, “Contributions to the study of SMS Spam Filtering: New Collection and Results,” in ACM Symposium on Document Engineering (ACM DOCENG’11), 2011. G. Gormack, “Email Spam Filtering: A Systematic Review,” in Foundations and Trends in Information Retrieval, vol. 1, pp. 335–455. Now Publishers, 2008. L. Zhang, J. Zhu, and T. Yao, “An Evaluation of Statistical Spam Filtering Techniques,” ACM TALIP, vol. 3, 2004. Chih-Chung Chang and Chih-Jen Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. S. Nissen, Implementation of a Fast Artificial Neural Network Library (FANN), 2003, Department of Computer Science University of Copenhagen (DIKU). Software available at http://leenissen.dk/fann/. J. P. Papa, C. T. N. S., and A. X. Falcão, LibOPF: A library for the design of optimum-path forest classifiers, 2014, Software version 2.1 available at http://www.ic.unicamp.br/~afalcao/LibOPF. Saito P. T. M., Nakamura R. Y. M., Amorim W. P., Papa J. P., Rezende P. J., and Falcão A. X., “Choosing the most effective pattern classification model under learning-time constraint,” PLoS ONE, vol. 10, 2015, (accepted for publication).
© Copyright 2026 Paperzz