Classification Of Resumes Using Text Mining and Machine Learning Techniques Danielle Gaal & Lindsey Register Supervisors: Dr. Cuixian Chen and Dr. Yishi Wang UNCW Department of Mathematics and Statistics Introduction Businesses advertising open positions often receive hundreds, or even thousands of resumes from those seeking employment. The task then falls on the Human Resources Department to sift through the applicants for prospective hires. Due to the large quantity of resumes though, this can easily overwhelm the resources of the department if they wish to carry out the search by hand. In this case, the use of a more automated approach becomes an increasingly attractive option, even if only to reduce the candidate pool to a more reasonable size for an HR team to handle. Text mining, or the process of deriving useful information from text, can be used to help create and train a model used in the classification of resumes. As such, introducing text analysis to the candidate search can help to expedite the selection process and further promote business efficiency. Data Cleaning & Pre-processing Random Forest Initial Document Term Matrix (documents: 100, terms: 14786) Non-/sparse entries: Sparsity: Maximal term length: Weighting: Remove “white space” 34431 / 1444169 98% 89 term frequency (tf) Remove punctuation Data Collection & Human Perception Remove numbers https://ask100people.wordpress.com/2010/04/13/grocery-shopping-decision-tree/ Remove “stop words” Coming in to this study, we knew we wanted to create a model that would help to categorize resumes that any company might receive. To further specialize our model, we primarily focused on categorizing resumes related to “data science”. Since we did not have a resume bank of our own to start with, we scraped resumes from online sources. This process resulted in a sample of 100 different resumes, that we individually saved to .txt files. After collecting our sample, our lab rated each resume to serve as a “ground truth” for later on in the analysis. Word Frequencies Random forest is a classification method that uses decision trees to cluster observations into groups. It is based on the idea that while each individual tree is a "weak classifier" the entire forest is a "strong classifier". In our case, each tree will be the inclusion or exclusion of a word in the text document. Though the presence of a single word is not a strong classifier, the combination of inclusion and exclusion of all of the words the random forest chooses can be a very strong classifier. Preliminary Results Rating Dispersion Revised Document Term Matrix (documents: 100, terms: 10562) Non-/sparse entries: Sparsity: Maximal term length: Weighting: 29075 / 1027125 97% 65 term frequency (tf) Trial 1 2 3 4 5 6 7 8 9 Min. Occur. 10 30 30 30 20 35 35 35 40 # of Trees 20 20 100 200 200 100 150 175 175 OOB Error 48.33% 51.67% 28.33% 33.33% 30.00% 31.67% 20.00% 21.67% 8.33% Confusion Matrix: good okay poor good 14 1 1 okay poor 2 0 13 0 1 28 class.error 0.1250 0.1875 0.0000 Only keep terms that occur in 40* or more resumes. *Note: this value can be changed Next Steps Revised Document Term Matrix (documents: 100, terms: 34) The word cloud represents the most frequent terms from the sample of resumes, with a maximum of 100 terms plotted. The pie chart indicates the breakdown of ratings given to the sample of resumes. Non-/sparse entries: Sparsity: Maximal term length: Weighting: 1865 / 1535 45% 11 term frequency (tf) While preliminary results were obtained, the focus was primarily on understanding and being able to perform the process of text mining for the purpose of resume classification. Now that we have established a general procedure, we want to hone the individual steps to produce more meaningful results. This will include but is not limited to: implementing a k-fold sampling method, using alternative methods to reduce the document-term-matrix size, and writing a function that will aid in model selection.
© Copyright 2026 Paperzz