Using Hidden Markov Model to Predict the Surfing User’s Intention
of Cyber Purchase on the Web
Chun-Jung Lin 1, 2, Fan Wu 1, I-Han Chiu 1
1
Dep. of Management of Information System, National Chung-Cheng University, Taiwan
2
Dep. of Computer Science and Information System, HungKuang University, Taiwan
ABSTRACT
To predict the intention of the user on the internet is more important for the e-business. This paper is the first one
applying the Hidden Markov Model, the stochastic tool used in information extraction, in predicting the behavior of the
users on the web. We collect the log of web servers, clean the data and patch the paths that the users pass by. Based on
the HMM, we construct a specific model for the web browsing that can predict whether the users have the intention to
purchase in real time. The related measures, such as speeding up the operation, kindly guide and other comfortable
operations, can take effects when a user is in a purchasing mode. The simulation shows that our model can predict the
purchase intention of uses with a high accuracy.
Keyword: Hidden Markov Model; CRM; e-Commerce; purchase intention.
INTRODUCTION
With the development of the Internet, virtual distribution such as e-Shop had brought an upsurge of on-line
shopping. Although innumerable Internet companies (so called dot com) went bubble, the on-line shopping is getting
more and more popular. Within 2004, the scale of the B2C e-Marketplace in U.S. reaches 72.6 billion dollars and grows
at the rate near 32%. At the same time, the number of online shopper raises to 86.5 million (eMarketer, 2003). From the
facts, we can realize that e-Commerce is growing to maturity and getting more important. Clearly, the e-Commerce is a
hot research domain getting more significant (Quinlan, 2000). However, it is a tough work to design, operate, and
maintain e-Commerce websites.
The goal of an e-Commerce website is to earn profits to managers and to provide better service quality,
convenience, secure trade platform, easy-of-use interface, etc. to consumers. A critical task in e-Commerce is how to
attract and entice the customers into purchasing. There are plenty of researches facilitating the e-Commerce transaction
activities. A well known solution uses the Data Mining technology, which is an information extraction technique and
whose goal is to discover hidden facts contained in database. Combining the technologies of machine learning,
statistical analyses, modeling techniques and database technology, Data Mining can find subtle relationships among
data and then infer rules of buying behavior. A valuable knowledge that can be derived from Data Mining is useful in
Customer Relationship Management (CRM). Clearly, the more precise the prediction of the consumer‟s behavior on the
web is the more profits the enterprise will earn.
year
amount
Table 1: The amount of sales in B2C e-Commerce in U.S. (billion dollars)
2000
2001
2002
2003
2004
28.15
34.38
45.54
55.03
72.60
2005
88.10
PREVIOUS WORK
A common method used to predict consumer‟s behavior is to collect the profile and their routine activities of
clients on the Web. From the collection of data, an intelligent agent can produce valuable and customized
recommendation to individual user. However, to collect user data is a hard work, since it will suffer the privacy issues
greatly. Though the toolkit, cookie, enables the Web server to identify user‟s data, it may not work in some cases that
the cookies are deleted by the end user. With the rapid growth in Data Mining, researchers use the techniques, such as
classification, clustering, decision trees, association rules, and sequential patterns, etc. to analyze the observed
phenomenon to discover the hidden relationships among events. As far as Data Mining is concerned, the mostly used
technology is to find the frequent pattern whose occurring times are higher than a threshold. The mining results are
derived through complex statistical computation and can be seen as the explanation of the phenomenon. The results are
used for recommendations but not suitable to predict the user behaviors in real time.
The term Web Usage Mining was firstly appearance in 2000 (SIGKDD, 2000). Web Usage Mining is an
application of Data Mining techniques to large Web data repositories (Cooley et al., 1999). The mining result can be
used to aid the navigation by providing a list of “popular” homepages or sites from a particular Web page or used to
restructure a website to serve the needs of users of the site more convenient. Many existing web analysis tools (Wang
and Liu, 2003) provide mechanisms to report user activities in a statistical way, such as the visiting frequency of most
popular page and the quantity of visitors within an hour under different filter of the visiting data. Most of these tools
calculate the hits on the server, each files accessed, the times of visits for each visitor, and the user‟s demography. They
are designed for the statistical data but provide little or no analysis of data relationships among the accessed files or
pages.
For immediately providing the prediction with a higher precision, we proposed a novel method based on Hidden
Markov Mode (HMM). It has proven that Hidden Markov Mode is an excellent method in the field of pattern
recognition (Rabiner and Huang, 1993; Wilpon et al., 1993; Nakao et al., 2001). Hidden Markov model (HMM)
(Mobasher et al., 1996) is a theory method to describe statistical signal, and have been evaluated as a useful tool for
information extraction fields. HMM is a powerful model derives from Markov chains by Andrei A. Markov. It has been
applied to several fields such as Biological Sequence Analysis (Cooley et al., 1997), signal processing ( Mobasher et al.,
1996; Durbin et al., 2002), modeling spoken language (Rabiner and Huang, 1993), Handwriting Recognition (Rabiner,
1989; Wilpon et al., 1990), Face Recognition (Huang et al., 2001), Indexing of Multimedia Documents (Hu et al., 1996;
Nakao et al., 2001), and so on. We can utilize the log data of a website as the training data to form our model. The
model is used to guess the user‟s intentions when he or she is surfing over an e-Shop. This method is an effective
prescription to the e-commerce website manager and can assist in modifying the website structure to provide easy and
smooth steps. In addition, the method can analyze the behavior of consumers easily to make a good CRM and then
increase more revenues.
Using Hidden Markov Model to Predict User’s Browsing Behavior
While surfing on the Internet, it is possible to differentiate the behavior between intentional and unintentional of
purchase. Based on this point of view, we can roughly divide the consumers into two groups: one with the intention of
buying and the other without the intention of buying. From the log of some websites, we observe that there exists many
identical browsing histories (session) but finally reaching different business results (the purchasing behavior). Since the
HMM is good at information extraction and has the ability to guess the hidden states of each observation symbol with
high accuracy, we adapt the concepts of HMM to investigate the log of the website. We preprocess the raw log data into
individual user session and judge if the user has the intention to buy something or not by the appearance of
shopping-cart page in that session. Next, we treat all sessions as training data to adjust our HMM. Finally, we use the
HMM to predict and discover the user‟s behaviors over website. Our purpose is to conjecture the buying intention
behind the consumer. The process is shown in Figure 1. Before training our model, we must clean out the
higgledy-piggledy name of the Web page as a well order form and give rules to unify individual sessions into a unit. We
first collect the raw log of the server and do the preprocessing steps. In the pre-process, we need the Web structure
topology to acquire the user‟s session (or called unit).
Web structure
topology
Raw Log
File
Start
End
Data Preprocessing
Prediction of
Intention of
Behavior
(sessions)
A
Predictable
HMM
Unit Partition
Training of the
Hidden Markov
Model
Prediction & Analysis
Figure 1: The flowchart of using HMM to predict the intention of consumer
After the data unit is processed, we have to construct a suitable model, which includes the structure of parameters
of the model. The precision of the model mainly depends on the state transition diagrams and the associated parameters.
Initially we randomly assign the model parameters (transition probability A) a value between 0 and 1, satisfying that a11
+ a12 = 1 and a21 + a22 = 1 (see Figure 2). And the initial state probability distribution, π1 and π2 (π1 + π2 = 1). We use
Baum-Welch algorithm to obtain a suitable parameter for them. Then we use the parameters λ = (A, B, π) to discover the
hidden states simultaneously when consumers browse the website through the Viterbi algorithm.
Figure 2: The training phase of HMM
In our approach, there are M different web pages and two hidden states, named S1 and S2. The notation b1(P1)
denotes the probability that page P1 is under state S1. According to Baum-Welch algorithm, the equation will be defined
as:
ξt(S1, S2) = t ( S1 )a1 2b2 (ot 1 ) t 1 ( S 2 )
P(O | )
Furthermore,
T 1
(S1 , S2 ) is the expectation probability of all state transit from S1 to S2. Thus the probability
t 1 t
of state S1 at time t, γt(S1), can be represented as:
S2
j S1
t ( S1 , j ) . We can estimate the new model parameter by the
following equations:
S
π1* = γ1(S1) = 2 1 ( S1 , j ) = ξ1(S1, S1) + ξ1(S1, S2)
j S
1
π2* = γ1(S2) =
jS 1 (S2 , j) = ξ1(S2, S1) + ξ1(S2, S2)
S2
1
The estimated transition probability
as1 ,s2
*
(S , S )
(S )
T 1
thus is: expected numbers of transition from s1 to s 2 =
expected numbers of transition from s1
T
And b1(pageM)* = expected number of times in state s1 with page M =
expected number of times in state s1
t 1,ot p a g eM
T
t 1 t
t ( S1 )
( S1 )
t 1 t
T 1
t 1
1
t
2
1
When a user is browsing the website, every click he did will be taken down by the server. Our HMM can
immediately predict that if he has the intention of buying or not.
Figure 3: An illustration of prediction of the buying intention of consumer
In Figure 3, if a consumer is entering the website (log server records o 1). Our method will immediately guess the
intention of the consumer. In a similar way, when he clicks for the first page, our method will guess the intention
simultaneously and so forth.
In this paper, we make simulations to verify our method. We use the data of DIGICAKE
(http://www.digicake.com.tw) as the benchmark. This data is a log file of e-commerce website named Digicake.com
which is the biggest on-line cake-ordering e-shop in Taiwan sells birthday cakes and biscuits. Because this is the
prediction of the buying intention of consumer, we realize that it is out of the question to take down each intention
behind consumers when he browses the website. Therefore, we concern the last observation symbol (page) determining
if our method can truly discover its hidden state. We evaluate our method accuracy in the following way:
Number of relevant intention predicted
TP
Precision
Totalnumber of intention predicted
TP FP
There are three types of the prediction results:
1. Target in interesting state and correctly predicted (TP).
2. Target in interesting state but erroneously predicted (TN).
3. Target in uninteresting state but erroneously predicted (FP).
Number of relevant intention predicted
TP
Recall
Totalnumber of relevant intentionsin collection TP TN
DATA CLEANING
The raw log files comprise everything you browse on the site. Including the pictures you saw, the sound you
heard, the action you made. In other words, the log server took down everything you download from the site
automatically. Thus we must clean out the raw log files to retain only HTML, ASP or JSP file types in the log file. First,
we load the log files into MS-SQL then filter out those irrelative data. These logs are in NCSA Combined log file
formats. It is able to judge whether a page is accessed successfully or not by a column named Protocol Status. We keep
those log data with Protocol Status=200 so as to make sure that every accessed page in database is meaningful. Also we
delete those actions made by site-administrator and bots from search-engine such as Google or Yahoo etc.. The above
procedures can increase the accuracy of the date to be treated.
Then We follow the WEBMINER SYSTEM which was proposed in (Cooley et al., 1997) and dividing the log
files into individual user‟s sessions. There is a noticeable thing that the Website DIGICAKE was composed of ASP so
we won‟t suffer the local cache problem greatly but instead of the variant content appearing in the content. Hence we
take plenty of time to classify those enormous web pages. Second, we set up a threshold with length 3 of the unit to
lower the interference with accuracy by those shorter and less valuable sessions. There is an article describing that
average on-line buying takes 3.67 minutes and 4.6 clicks (website: clickz). So we prune those page-length equal less
than „2‟ but not „3‟ due to that step 3 is approaching the decision making point.
SIMULATION & RESULTS
We have built strongly needed representative training and test sets for the prediction of the intention of purchase.
These data do not contain only a log of page access sequences ending with an intention of purchase, but also carefully
choose a log of page access sequence but with an un-intention of purchase. The data set 1 in Table 2 is extracted from
the log file of DIGICAKE website consisted of 5000 units after preprocessing. We assign random variables to the state
transition probability distribution-A and the observation symbol emission probability distribution-B. We adopt the
Baum-Welch algorithm (Rabiner and Huang , 1993; Rabiner, 1989) in MATLAB 7 to train and converge our HMM in
order to obtain the optimum model parameters A, B. Then we make use of the Viterbi algorithm (Rabiner, 1989; Huang
et al., 2001; Forney, 1973) with A and B to recover the hidden states of the data set 2.
We obtained the following experiment for the prediction. First, Table 2 describes that the intention of purchase
can be predicted from our model. By the HMM model we can find the most probable path to approximate the behavior
of the users. When this path goes through the + states, the intent of purchase is predicted. For the testing data, the
number of having intention of purchase and is correctly recovered (True Positive) is 88, and 83 new ones are predicted
(False Positives).
Statistical models have used in a wide range of e-business. A common problem in these problems is to find a
probabilistic model and a set of model parameters, which can account for sequences of given data. Hidden Markov
models have been particularly successful at solving them. In this paper, we provide a stochastic mechanism for the
modeling of an intention of purchase. According to the experiments mentioned above, it can also obtain the well
performance to predict the intension of purchase. The major contributions of this paper can be summarized as that a
well-defined representation for the prediction of the intention of purchase has been proposed with the famous graphical
model, hidden Markov model. Furthermore, we have constructed successfully HMM model to predict the intention of
purchase and the results are accurate.
Table 2: Results of the HMM model to detection intention and
non-intention of purchase with precision and recall.
Data Set 1
Training Data
5,000
Initial A = [ 0.5 0.5 ; 0.5 0.5 ]
Initial B = [1/323 , …, 1/323]
Data Set 2
Testing Data
1,020
Convergence A = [0.95 0.05 ; 0.001 0.99]
Convergence B = [Appendix 2.]
TP = 88, FP = 83, TN = 32
Precision = 51%
Recall = 73%
CONCLUSION
It is proven that Hidden Markov Model has the ability to predict the buying intention of web user. Although the
percentage of precision and recall are not very outstanding, there are still many methods to improve the output value.
Site manager can adopt our method to implement a customized shopping environment. If we detect that the consumer
has the intention of buying in advance, we can offer him some more considerate utilities such as calculator, shopping
agent or provide relative sales information with a preferential price to retain the customer and increase better sales
revenue.
Append 1
The formal definition of Hidden Markov model:
N : the number of all hidden states in model
S : individual states {s1, s2, … , sN }
qt : the state at time t
∴ qt = S i
M : the number of distinct observation symbols per state
V : individual symbols {v1, v2, … , vM }
A : the state transition probability distribution {aij }
B : the observation symbol emission probability distribution {bj(k)}
π : the initial state probability distribution {π1, π2, … , πN}
λ : the model can shorten in the compact format λ = (A,B, π)
O : the observation O = o1,o2, …, oT
REFERENCES
B. Mobasher, N. J., Han, E., and Srivastava, J. (1996), “Web Mining: Pattern discovery from World Wide Web transactions”, Technical Report TR
96-05 0, University of Minnesota, Dept. of Computer Science, Minneapolis.http://www.w3.org
B2C e-Commerce in the US (2003) , eMarketer.
Cooley, R., Mobasher, B., and Srivastava, J. (1997). “Grouping web page references into transactions for mining world wide web browsing patterns”.
Cooley, R., Mobasher, B., and Srivastava. J. (1997), “Web mining: Information and pattern discovery on the World Wide Web.” In International
Conference on Tools with Artificial Intelligence, pages 558-567, Newport Beach, CA.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G.. (2002) “Biological sequence analysis”, Cambridge University press.
Forney, G.D. (1973), The Viterbi algorithm, Proc. IEEE, vol.61, pp. 268-278.
Freitag, D., and McCallum, A. (1999), “Information Extraction with HMMs and Shrinkage,” Just Research.
Freitag, D., and McCallum, A. (2000), “Information Extraction with HMM Structures Learned by Stochastic Optimization”, Just Research.
http://www.iisfaq.com/
Hoel, P.G., Sidney, C.P., and Stone, C.J.(1987), Introduction to Stochastic Processes. Prospect Heights, IL: Waveland Press, Inc.
Hu, J., Brown , M. K. and Turin, W. (1996), “HMM Based On-line Handwriting Recogniton,” IEEE Trans. Pattern Analysis and Machine Intelligence.
Huang, X. et al., (2001), “Spoken Language Processing”, Chapter 8.
Leonardi, R., and Migliorati, P. (2002), “Semantic indexing of multimedia documents,” IEEE Multimedia, vol.9, no.3.
Lin, H.C., Wang, L.L., and Yang, S.N. (1997), “Color Image Retrieval Based on Hidden Markov Models,” IEEE Trans. On Image Processing, vol.6.
Liu , J., and You, J. (2003), “Smart Shopper: An Agent-Based Web-Mining Approach to Internet Shopping.” IEEE TRANSACTIONS ON FUZZY
SYSTEMS, vol. 11, NO. 2.
http://www.clickz.com/stats/sectors/retailing/article.php/6061_3307261
Nakao, M., Akira, N., Shimodaira, H. and Sagayama, S. (2001), “Substroke Approach to HMM-Based On-line Kanji Handwriting Recognition,” IEEE.
Nefian, A.V., Hayes, M.H. (1998), “Hidden Markov Models for Face Recognition,” Proceeding of the IEEE, Acoustics, Speech, and Signal Process, vol. 5.
R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns.
Rabiner, L. (1989), “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2.
Rabiner, L. and Huang, B. (1993), Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall.
Ross Quinlan. KDD-99 Panel on Last 10 and Next 10 years. In SIGKDD Explorations, 2000.
SIGKDD (2000), Explorations Issues volume 1.
Wang, B., and Liu, Z. (2003), “Web Mining Research”, In Proceedings of the Fifth International Conference on Computational Intelligence and
Multimedia Applications (ICCIMA).
http://www.bluemartini.com/
http://www-306.ibm.com/software/websphere/
Wilpon, J.G.. et al., (1990), “Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models,” IEEE Trans. On ASSP,
vol. 38, No.11, pp.1878-1970.
© Copyright 2026 Paperzz