Classification Of Resumes Using Text Mining and Machine Learning

Classification Of Resumes Using Text Mining and
Machine Learning Techniques
Danielle Gaal & Lindsey Register
Supervisors: Dr. Cuixian Chen and Dr. Yishi Wang
UNCW Department of Mathematics and Statistics
Introduction
Businesses advertising open positions often receive hundreds, or even thousands of resumes
from those seeking employment. The task then falls on the Human Resources Department to
sift through the applicants for prospective hires. Due to the large quantity of resumes though,
this can easily overwhelm the resources of the department if they wish to carry out the search
by hand. In this case, the use of a more automated approach becomes an increasingly
attractive option, even if only to reduce the candidate pool to a more reasonable size for an
HR team to handle. Text mining, or the process of deriving useful information from text, can
be used to help create and train a model used in the classification of resumes. As such,
introducing text analysis to the candidate search can help to expedite the selection process
and further promote business efficiency.
Data Cleaning & Pre-processing
Random Forest
Initial Document Term Matrix (documents: 100, terms: 14786)
Non-/sparse entries:
Sparsity:
Maximal term length:
Weighting:
Remove “white space”
34431 / 1444169
98%
89
term frequency (tf)
Remove punctuation
Data Collection & Human Perception
Remove numbers
https://ask100people.wordpress.com/2010/04/13/grocery-shopping-decision-tree/
Remove “stop words”
Coming in to this study, we knew we wanted to create a model that would help to categorize
resumes that any company might receive. To further specialize our model, we primarily
focused on categorizing resumes related to “data science”. Since we did not have a resume
bank of our own to start with, we scraped resumes from online sources. This process resulted
in a sample of 100 different resumes, that we individually saved to .txt files. After collecting
our sample, our lab rated each resume to serve as a “ground truth” for later on in the analysis.
Word Frequencies
Random forest is a classification method that uses decision trees to cluster observations into
groups. It is based on the idea that while each individual tree is a "weak classifier" the entire
forest is a "strong classifier". In our case, each tree will be the inclusion or exclusion of a
word in the text document. Though the presence of a single word is not a strong classifier, the
combination of inclusion and exclusion of all of the words the random forest chooses can be
a very strong classifier.
Preliminary Results
Rating Dispersion
Revised Document Term Matrix (documents: 100, terms: 10562)
Non-/sparse entries:
Sparsity:
Maximal term length:
Weighting:
29075 / 1027125
97%
65
term frequency (tf)
Trial
1
2
3
4
5
6
7
8
9
Min.
Occur.
10
30
30
30
20
35
35
35
40
# of
Trees
20
20
100
200
200
100
150
175
175
OOB
Error
48.33%
51.67%
28.33%
33.33%
30.00%
31.67%
20.00%
21.67%
8.33%
Confusion Matrix:
good okay poor
good
14
1
1
okay
poor
2
0
13
0
1
28
class.error
0.1250
0.1875
0.0000
Only keep terms that occur in 40* or more resumes.
*Note: this value can be changed
Next Steps
Revised Document Term Matrix (documents: 100, terms: 34)
The word cloud represents the most
frequent terms from the sample of resumes,
with a maximum of 100 terms plotted.
The pie chart indicates the breakdown of
ratings given to the sample of resumes.
Non-/sparse entries:
Sparsity:
Maximal term length:
Weighting:
1865 / 1535
45%
11
term frequency (tf)
While preliminary results were obtained, the focus was primarily on understanding and being
able to perform the process of text mining for the purpose of resume classification. Now that
we have established a general procedure, we want to hone the individual steps to produce
more meaningful results. This will include but is not limited to: implementing a k-fold
sampling method, using alternative methods to reduce the document-term-matrix size, and
writing a function that will aid in model selection.