Deterring Password Sharing: User Authentication via
Fuzzy c-Means Clustering Applied to Keystroke Biometric Data
Salvador Mandujano
Rogelio Soto
Instituto Tecnológico y de Estudios Superiores de Monterrey
Center for Intelligent Systems
Monterrey, Mexico
{smv, rsoto}@itesm.mx
Abstract
therefore can be used to strengthen password-based authentication.
This paper describes a clustering-based system to enhance user authentication by applying fuzzy techniques to
biometric data in order to deter password sharing. Fuzzy
c-Means is used to train personal, per-keyboard profiles
based on the keystroke dynamics of users when entering
passwords on a keyboard. These profiles use DES encryption taking the actual passwords as key and are read at logon time by the access control mechanism in order to further
validate the identity of the user. Fuzzy values obtained from
membership functions applied to the input (i.e., keystroke
latencies) are compared against profile values, and a match,
within a certain precision threshold γ, will grant access to
the user. With this technique, even when user A shares password PA with user B, B will still be denied access unless
he is capable of mimicking the keystroke dynamics of A.
We describe the motivation, design, and implementation of
a prototype whose results indicate the accuracy level and
feasibility of the approach.
Password systems have been the favorite authentication
method for years in electronic systems due to several reasons: they are straightforward to implement, easy to use
and maintain, their precision can be adjusted by enforcing
password-structure policies or by changing cryptographic
algorithms depending on the security level desired, and they
are an inexpensive, scalable way of validating users, both
locally and remotely, to all sorts of services [10, 2]. If
a username or password does not match the information
stored on the access control repository at log-on time, the
user will be denied access; otherwise, he will be able to use
the system.
1. Introduction
Biometric mechanisms represent the strongest means to
authenticate people [1, 3, 13]. As human beings we have
characteristics that help identify us from others. Our genetic code, fingerprints, handwriting, and ocular retinal pattern are examples of biometric features that make us unique
and distinguishable as individuals. There is another source
of biometric data which has not been exploited for the purpose of strengthening user identification: the typing patterns
of a person when using a computer keyboard [12, 5]. The
keystroke frequency of a user is a distinctive feature that,
even when it is not as precise as others in terms of entropy
and classification power [1], has de advantage of not requiring costly equipment and software to be implemented, and
The security of traditional password systems resides in
the actual ciphered string containing the password and the
inability to decipher it while it is stored on the filesystem (shadow-password models hide password information
from users in order to avoid dictionary attacks and password
cracking, although many systems typically allow users read
this information given it is encrypted [2]). These authentication mechanisms are based on something the user knows,
in this case, a password. If someone gets to know the password, he will be perfectly (although, inappropriately) able
to log on into the system. There exist other models based
on something the user owns, for instance, a lock, a key, or a
batch. Any person who gets one of these objects will be able
to access the protected resource with no trouble. A third
type of system is based on something the user is, meaning,
biometric features of the user. These methods convey much
more information for user validation and are more difficult
to break as they are based on something that is more difficult to share (unlike passwords and access batches, e.g.)
[10].
The system being introduced in the present paper fuses
two of these security mechanisms in order to fortify user authentication. It employs a password string as something the
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
user must know, and complements it with the corresponding
keystroke pattern of the user which represents something
the user must be. If he knows the password but cannot type
it on the keyboard at the right pace, he will be unable to log
on. Similarly, if the user does not remember the password
but he is actually the legitimate user, he will still be denied
access. Both components need to be present for the user
to be let in: the password and an “good-enough” keystroke
pattern (see Section 4 for details on the matching mechanism). If any of those is missing, no access is permitted.
This paper comprises the following sections. Section
2 describes the problem of user authentication and password sharing. Section 3 reviews background information on
clustering and password systems including related research
projects. Section 4 outlines the design and implementation
of the fuzzy c-Means prototype. Section 5 describes the experimental results obtained with the prototype. Conclusions
and references are at the end of the document.
2. Problem description
Users share their personal passwords with others in order to give them access to individual or corporate electronic
accounts [11, 12, 6]. If a user wants someone else to enter
the system on his behalf, he just needs to let that someone
know the password. This weakness can also be observed
if a password has been sniffed on a connection line: someone tapping on a network is potentially able to capture all
the passwords that travel in the clear [13]. If passwords
are not protected with cryptography or with a secondary
mechanism like application wrappers that hide all humanlyreadable strings by scrambling the data, they will give access to an intruder who will be able to abuse the privileges
of the account and, perhaps, to extend the break-in to other
areas of the compromised host. These incidents have caused
serious losses over the last years and constitute a priority to
the information security teams of many governments and
corporations around the globe [11].
For system administrators, if more than one user is logging on into a host using the same user account, they will
be unable to tell which of those users should be held accountable for what actions – especially when it comes to
anomalous activity. Multi-user systems require mechanisms
to make sure that all the accounting is done correctly and
hardening password-based user authentication is a way of
guaranteeing the integrity of system records.
By improving authentication through biometrics, in our
case doing keystroke pattern analysis, it is possible to prevent people from utilizing passwords that do not belong to
them. Consequently, an intruder will have to do two things
in order to get access to a system account: 1) get the password of one of the users, and 2) guess the typing patterns the
owner of that password. This additional security layer on
top of password strings makes security stronger as biometric data is something difficult to imitate and even to communicate over the phone or through email [1]. With this
mechanism we deter intentional password sharing and reduce the threat posed by a compromised password.
3. Background
This section covers three topics. It first describes crisp
clustering methods and then goes on comparing them with
fuzzy clustering. Toward the end of the section, we cite
other projects related to the keystroke approach followed
by this paper.
3.1
Data clustering
Clustering algorithms are a form of non-supervised
learning used to identify groupings among a population of
individuals [7, 8]. They analyze the similarity of a set of
samples in order to identify possible groupings to split up
the set. Once these groupings or clusters are defined, a new
incoming sample point can be classified according to its particular features and can be put into one of the clusters. A
point describing the members of a cluster is calculated so
that new points can be compared against [14]. Some methods call this point the centroid of the cluster and is recomputed during the learning phase and/or as new members are
received.
Let G be a set of points and let W (G) be the power set
of G. C will be a cluster or partition of G if and only if
C ∈ W (G) (i.e., C is a possible subset of G). We can
define a binary membership function for C as follows:
1 if x ∈ C
uC (x) =
0 if x ∈
/C
where x is an incoming sample. Now suppose there are
two clusters, C1 and C2 . We can apply a function like the
one above to a group of three input points p, q and r, and
build a partition matrix U with their crisp membership values. U will indicate to which of the clusters every sample
belongs:
1 0 1
U=
0 1 0
A 1 value denotes absolute membership and a 0 means
absolute non-membership. Any given point will belong
to one and only one cluster depending on the outcome of
a similarity measure used to group similar individuals together. For evaluating similarity, the point to classify is
compared with the centroids. In the case of data points
on an n-dimensional space, the similarity measure could be
certain type of distance function between them [8, 14]. The
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
closer they are, the stronger the probability of belonging in
the same cluster. Figure 1 shows three clusters and how they
split data points. Note that there is no overlapping since a
point will belong exclusively to the cluster that is closest in
distance.
X
2
C1
C
2
C3
0
X1
Figure 1. Three crisp-membership clusters (
C1 , C2 , and C3 ) on a bidimensional space defined by elliptic functions. Crosses represent
centroids.
These clusters classify points according to their coordinate values in x1 and x2 and the position of the centroid. In
circular clusters, for instance, the centroid is located at the
geometrical center of the set and its coordinates correspond
to the mean values of the coordinates of the cluster’s members (for other cluster shapes, the centroid and similarity
function may vary [14]).
When a new point pi needs to be classified, it is compared against current cluster centroids. In the case of the
example in Figure 1, three membership values will be computed for each point pi : uC1 (pi ), uC2 (pi ), and uC3 (pi ).
The highest membership value will determine the right cluster for pi .
3.2
Fuzzy c-Means clustering
The c-Means algorithm [14] is a fuzzy clustering technique that works something like the above method but provides additional flexibility regarding membership. An individual will belong to one or more classes or clusters with
different membership degrees. This idea arises from the fact
that it is ambiguous to tell whether a point must go into a
certain cluster and not into another (consider points with
equal membership for two clusters, for instance).
To deal with this ambiguity, it is necessary to introduce
some fuzziness into the formulation of the problem. Instead
of having precise, crisp boundaries for a cluster represent-
ing a binary threshold which indicates whether a point definitely belongs to a cluster or not, fuzzy membership functions compute a membership degree of each point for every cluster. c-Means will define clusters from a set of input
points using this loose membership strategy which constitutes the most famous algorithm that has been developed
for this purpose [14].
The output of a fuzzy membership function will be a real
value between 0 and 1, for instance, uF (x) = [0, 1] for a
fuzzy cluster F and an input point x. The partition matrix
for two fuzzy clusters and three input points will look something like this:
0.24 0.15 0.93
Uf =
0.76 0.85 0.03
Each number represents the membership degree of
a point with respect to a cluster. The first of the three
points will belong to the first cluster with membership 0.24
and, at the same time, it will also belong to the second
cluster but with a membership of 0.76. The c-Means algorithm will build the clusters, compute their corresponding
centroids and maintain Uf . The algorithm works as follows:
Step 1. Given an input data set X = (x1 , x2 , ..., xn ),
where xi ∈ Rk , fix the number c of clusters with c ∈
(2, 3, ..., n − 1) (c is the variable that gives name to the algorithm). Set m ∈ (1, ∞) to 1 and initialize partition P0 .
Step 2. At iteration l, with l ∈ (N ∪ 0), compute c
mean vectors vi with i ∈ {1, 2, ..., c} – these are the average
points of the c clusters. Being uik the proximity function of
xi with respect to cluster k:
n
l m
k=1 (uik ) xk
,1 ≤ i ≤ c
vil = n
l m
k=1 (uik )
Step 3. Update U l = [ulik ] to U l+1 = [ul+1
ik ] :
ulik = c
1
|xk −vil | 2/m−1
j=1 ( |xk −vjl | )
, 1 ≤ i ≤ c, 1 ≤ k ≤ n
Step 4. If |U (l+1) − U l | < e, where e is the error, stop;
otherwise let l = l+1 for the next iteration and go to Step 2.
The algorithm will converge to a set of c clusters and a
partition matrix U which contains the membership values
of each point with respect to the clusters.
3.3
Related projects
This approach to user authentication has not been widely
explored and just a few projects involving enhanced security through keystroke dynamics have been developed, the
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
main difference among them being the type of technique
used [12, 4, 9].
Ru et al. used fuzzy classes to characterize the typing
behavior of system users but they did not apply any sort of
clustering [12]. In addition to keystroke information, they
incorporated a password complexity value based on the distances between keys on the keyboard. Joyce and Gupta used
the same sort of “variables that make a handwritten signature a unique human identifier” in order to define a stream
of latency values that make up a profile [4]. No clustering was used here either but the results obtained from this
project clearly support the use of keystroke biometrics for
password-based authentication. Yasuhiro et al. created a
variant of the traditional keystroke-speed model [9]. They
generate “user rhythms” which capture a broader pattern
describing a user’s keystroke frequencies regardless of any
password. The accuracy of this model is not as precise as
the others but it can be certainly used in a more elaborate
sort of authentication (perhaps, challenge-response authentication using phrases).
Given that all of these solutions use individual variables
to capture different features from the user, clustering lends
itself naturally to this purpose as it can be used to extract
information from those variables in order to learn the behavior of each biometric aspect they capture. We explore
the fuzzy version of this method as an alternate solution by
defining considerably more variables per key which lets us
increase detection accuracy at the key level and not at the
password level as the other models.
Input variables
This system is designed to learn the keystroke patterns of
users. When entering a password, there exist two variables
that will be considered: 1) the fraction of time a key stays
pressed, and 2) the time interval between releasing a key
and pressing the next one (see Figure 2).
p
1
X
p
2
r
1
Y
r
2
0
d
1
d2
d3
c
1
c2
c
3
t
centroids
Figure 3. Fuzzy clusters for a single password
character: d1 corresponds to slow latency, d2
to correct latency, and d3 to fast latency. Each
cluster has a centroid denoted by ci .
4. Design and implementation
4.1
every time a key is pressed and they constitute the input
data to the clustering module. If we have, for example,
a password length of three characters, we will have two
released-key latencies, r1 and r2 (which are intervals), and
three pressed-key latencies, p1 , p2 , and p3 . In general, for
an n-character password, there will be n pressed-key values
and n − 1 released-key values.
A password needs to be entered k times by the user during training. For convenience, the prototype offers four values of k: 5, 10, 20, or 30. From the training phase we will
get kn pressed-key values and k(n − 1) released-key values. The c-Means algorithm is then applied to define clusters that will capture the speed at which each key is being
pressed and released.
Every pi and ri variable will have three latency clusters
attached to it: a slow latency, correct latency, and fast latency. Each one represents a fuzzy range that describes how
accurate the entered value is. Figure 3 depicts three clusters
and their corresponding centroids. Unlike crisp clustering,
there is overlapping in fuzzy models.
p
3
These values are based on time and, as such, are onedimensional. The computation of centroids, in this case, is
equivalent to computing the arithmetic mean of input times.
Clusters are created with the purpose of finding the actual
centroids, which, coupled with standard deviation, represent a biometric feature captured from the user.
Z
4.2
t
Figure 2. Pressed-key and released-key variables
for a three-letter password
These two values often referred to as latencies are read
Authentication with fuzzy values
Once the system is trained and it has learned a keystroke
pattern, the information is ciphered with DES [13] using
the actual password as encryption and decryption key. (If
the password is wrong, regardless of the keystroke pattern,
the profile will not be accessed. If it is correct, the profile
is deciphered and compared against the observed pattern.)
Clustering data stored on the profile will be used to decide
whether or not a user is the person he says he is.
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
If the user wants to authenticate to the system, he will
type in his password and keystroke latencies will be read by
the security system (see the modules of the prototype in Figure 5). For instance, if the second letter of a three-character
password is “Y” then the following six centroids will be
computed: p(2,s) , p(2,c) , and p(2,f ) which correspond to
pressed-key values slow, correct, and fast, and r(2,s) , r(2,c) ,
and r(2,f ) corresponding to released-key values (the time interval previous to character “Y” has been already computed
with the first input character).
The observed keystroke latencies are computed and then
compared to the corresponding centroid using the standarddeviation σi of the cluster computed during training.
The number of correct matches (that is, matches that
fall into the correct latency cluster) for each variable are
summed up into a variable h which is divided by the maximum number of possible correct matches m. The quotient
is compared to the access threshold α to determine whether
the user will be granted access.
h
If q ≤ α, then “access granted”
q=
If q > α, then “access denied”
m
There is a precision constant γ which helps fine tune the
evaluation of profiles. This value is multiplied by all σi in
order to adjust the acceptance range of clusters (see Figure 4). A small γ will make a very narrow range within
which a variable can be considered correct, whereas a larger
γ will provide more flexibility when entering the password
(i.e., the matching will not be as tight as the input needs not
be too close to the centroids and standard deviation values
stored on profile). α and γ are configuration settings that the
administrator can use to regulate how narrow the matching
area will be.
The above computations determine profile values which
characterize the keystroke latencies of the user. If the user
types in his password at the usual keystroke rhythm, that
will mean to the system he is probably the legitimate owner
of the account. If γ is too tight, the user might have to try
several times before successfully logging in. Since latencies
may differ among keyboards, the prototype allows the user
to define per-keyboard profiles.
The proposed structure comprises three fuzzy sets that
correspond to the three fuzzy clusters defined for each variable (Figure 4). Every key entered by the user will generate
two or three variables (depending whether it is an initial
or an intermediate character in the password) which generate three membership values each. This increases the detail
level used to characterize keystroke behavior. If a variable
has a higher membership for the correct set, it will mean it
was typed at the right pace. If it belongs to any other set,
the matching algorithm will determine whether that value,
along with the rest of the characters, conform, as a whole,
an acceptable password.
u(t)
1
slow
correct
0
fast
t
γσ
Figure 4. Fuzzy functions for a single character. Standard deviation values (σ) are computed during training, and are adjusted by
tuning variable γ
5. Experimental results
The prototype was developed in Java and is composed of
two modules: the training module and the test module. The
training module is designed to learn the keystroke patterns
from users. Values for keyboard number, user name, and
number of training rounds need to be defined before starting
the training. Keyboard number is used to let the user have
several profiles given that the typing patterns of a person
may vary from keyboard to keyboard.
Figure 5. The training and testing modules
communicate by reading from and writing to
keystroke profiles
In Figure 5, the lower panel of the training module
plots keystroke latencies read from the keyboard. This
gives an intuitive idea of how a keystroke pattern looks
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
like but is displayed for visualization purposes only. Once
the user has entered the password the requested number of
times (only correct passwords are taken into account for the
computation of clusters; latencies corresponding to wrong
passwords are discarded), a profile is created. This profile stores the centroids and standard deviations computed
with c-Means. This profile is then ciphered with the actual password using DES which is implemented by the
org.logi.crypto.keys.DESkey Java class (the prototype
system was developed in Java 2 using a 1.4.2 build).
Once the user has trained the system, he can launch
the test module to evaluate the learned patterns. A logon screen will send the user an “access granted” or an “access denied” message indicating whether the authentication
succeeded or not (matching variables α and γ are currently
hard-coded into the program).
For experimentation purposes, one keyboard was used
by 15 users who trained the system with their passwords
and created their profiles. They were asked to type the password of the other 14 users 15 times each (that counts for
roughly 152 training samples, and 152 tests which generate
2n latency variables each, where n is the length of the password). The results are captured in the following table (see
Table 1).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
User ID
gork
cowboy
alex
donna
watcher
vrije
hicss
three
lucky
domino
fam
edward
walta
iaomi
Duff
Password
Kennedy1
Winding96
blacknight
asdfasdf
w2567
r011ing$tone$
anthropos
aebgtw1
wanfcp0e
DOMINO
haml00
e2mv
Haba11owd00$
iaomi76
d123998
Success
0.95
0.91
0.95
0.94
0.89
0.94
0.93
0.92
0.89
0.96
0.97
0.94
0.94
0.98
0.89
Failure
0.17
0.21
0.32
0.31
0.11
0.06
0.24
0.09
0.11
0.23
0.09
0.21
0.04
0.22
0.10
Table 1. The Success column denotes the
success rate at identifying legitimate owners
(high is good). The F ailure column denotes
the percentage of failure to detect an impostor (high is bad).
The overhead of this fuzzy authentication mechanism is
quite small. It behaves O(n) with the length of the password and, considering the salt variation used by many password modules of Unix systems [2], this overhead would be
equivalent to the use of a salt variable from a busy-waiting
perspective.
It can be noticed that longer passwords provide a better
means to learn a user’s keystroke pattern. The number of
variables increases with the length of the password and this
allows for increased accuracy. It can be also inferred that
passwords containing dictionary words are weaker, and that
an unauthorized user can correctly type passwords that are
short.
The failure rate to detect an impostor is high for easy-totype passwords, but the inclusion of special characters and
numbers provides additional security to the password and
its corresponding keystroke profile. An important point to
make is that, if an attacker is not aware of the password system featuring this biometrics support, he will probably try a
password a few times before giving up (in the experiments,
all users were requested to try each password the 15 times).
This considerably increases the success of our approach.
From the learning perspective, the success rate obtained
with fuzzy clustering is high resulting in the positive identification of legitimate users. Failure rates are low if we
consider 14 users trying to break into a system knowing the
password beforehand and attempting to log on 15 times with
each password. It will be convenient to combine a support
authentication module like this with a password policy that
eliminates the use of passwords that are easy to guess and
type [2].
6. Conclusions
Password authentication can be conveniently enhanced
through keystroke pattern monitoring. The proposed fuzzy
method using c-Means clustering provides an extra-level
of security that makes password authentication stronger.
The main benefit of this approach is limiting the effects
of password sharing and password stealing by including
additional variables into the authentication equation. Our
experimental results show that this sort of biometric measure effectively identifies legitimate users and impostors,
and the prototype can be fine-tuned to regulate the level of
accuracy required for gaining access to the system.
References
[1] R. Bolle. Guide to Biometrics. Springer-Verlag, 1st edition,
December 2003.
[2] S. Garfinkel and E. H. Spafford. Practical UNIX Security.
OReilly, 2nd edition, April 1996.
[3] R. Hsu, M. Abdel-Mottaleb, and A. Jain. Face detection in
color images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 25(5):696–706, March 2002.
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
[4] R. Joyce and G. Gupta. Identity authentication based
on keystroke latencies. Communications of the ACM,
33(2):168–176, 1990.
[5] S. Kumar. Classification and Detection of Computer Intrusions. PhD thesis, PhD thesis, Department of Computer Sciences, Purdue University, West Lafayette, IN, 1995.
[6] A. K. Lenstra and E. R. Verheul. Selecting cryptographic
key sizes. Journal of Cryptology: the journal of the International Association for Cryptologic Research, 14(4):255–
293, 2001.
[7] D. Matula. Graph theoretic techniques for cluster analysis
algorithms. Classification and Clustering, 1977.
[8] T. M. Mitchell. Machine Learning. McGraw Hill, 1st edition, 1997.
[9] Y. Ogoshi, A. Hinata, S. Hirose, and H. Kimura. Improving
user authentication based on keystroke intervals by using intentional keystroke rhythm. IPSJ Journal, 44(2–21), March
2003.
[10] C. P. Pfleeger. Security in Computing. Prentice Hall Inc.,
Upper Saddle River, NJ, 2nd edition, 1997.
[11] R. Richardson. Computer crime & security survey 2003.
Technical report, Computer Security Institute, CSI and Federal Business of Investigations, FBI, 2003.
[12] W. G. Ru and J. H. Eloff. Enhanced password authentication
through fuzzy logic. In IEEE Expert, volume 12, pages 38–
45, Nov/Dec 1997.
[13] B. Schneier. Applied Cryptography. John Wiley and Sons,
New York, NY, 2nd edition, 1996.
[14] L.-X. Wang. A Course in Fuzzy Systems and Control. Prentice Hall, Inc, Upper Saddle River, NJ, 1997.
Proceedings of the Fifth Mexican International Conference in Computer Science (ENC’04)
0-7695-2160-6/04 $20.00 © 2004 IEEE
© Copyright 2026 Paperzz