Context-Aware Online Learning for Course

Context-Aware Online Learning for Course
Recommendation of MOOC Big Data
arXiv:1610.03147v1 [cs.LG] 11 Oct 2016
Yifan Hou, Member, IEEE, Pan Zhou, Member, IEEE, Ting Wang, Yuchong Hu, Member, IEEE, Dapeng Wu, Fellow, IEEE
Abstract—The Massive Open Online Course (MOOC) has
expanded significantly in recent years. With the widespread
of MOOC, the opportunity to study the fascinating courses
for free has attracted numerous people of diverse educational
backgrounds all over the world. In the big data era, a key
research topic for MOOC is how to mine the needed courses
in the massive course databases in cloud for each individual
(course) learner accurately and rapidly as the number of courses
is increasing fleetly. In this respect, the key challenge is how to
realize personalized course recommendation as well as to reduce
the computing and storage costs for the tremendous course
data. In this paper, we propose a big data-supported, contextaware online learning-based course recommender system that
could handle the dynamic and infinitely massive datasets, which
recommends courses by using personalized context information
and historical statistics. The context-awareness takes the personal preferences into consideration, making the recommendation
suitable for people with different backgrounds. Besides, the
algorithm achieves the sublinear regret performance, which
means it can gradually recommend the mostly preferred and
matched courses to learners. Unlike other existing algorithms,
ours bounds the time complexity and space complexity linearly.
In addition, our devised storage module is expanded to the
distributed-connected clouds, which can handle massive course
storage problems from heterogenous sources. Our experiment
results verify the superiority of our algorithms when comparing
with existing works in the big data setting.
Index Terms—MOOC, big data, context bandit, course recommendation, online learning
I. I NTRODUCTION
MOOC is a concept first proposed in 2008 and known to the
world in 2012 [1] [2]. Not being accustomed to the traditional
teaching model or being desirous to find a unique learning
style, a growing number of people have partiality for learning
on MOOCs. Advanced thoughts and novel ideas give great
vitality to MOOC, and over 15 million users have marked in
Coursera [3] which is a platform of it. Course recommender
system helps the learner to find the requisite course directly
in the course ocean of numerous MOOC platforms such like
Coursera, edX, Udacity and so on [4]. However, due to the
rapid growth rate of users, the amount of needed courses
Yifan Hou and Pan Zhou are with School of Electronic Information and
Communications, Huazhong University of Science and Technology, Wuhan
430074, China.
Ting Wang is with Computer Science and Engineering, Lehigh University,
PA 18015, USA.
Yuchong Hu is with School of Computer Science and Technology,
Huazhong University of Science and Technology, Wuhan, 430074 China.
Dapeng Oliver Wu is with Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA.
Contacting email: [email protected]
This work was supported by the National Science Foundation of China
under Grant 61401169, Grant CNS-1116970, and Grant NSFC 61529101.
have been expanding continuously. According to the survey
about the completion rate of MOOC, only 4% people finish
their chosen courses. Therefore, finding a preferable course
resources and locating them in the massive data bank, e.g.,
cloud computing and storage platforms, would be a daunting
“needle-in-a-haystack” problem.
One key challenge in future MOOC course recommendation
are processing tremendous data that bears the feature of volume, variety, velocity, variability and veracity [5] of big data.
Precisely, the recommender system for MOOC big data needs
to handle the dynamic changing, heterogonous sources, priorily unknown scale and nearly infinite course data effectively.
Moreover, since the Internet and clouding computing services
are turning in the direction of supporting different users
around the world, cultural difference, geographic disparity and
education level, the recommender system needs to consider
the features of students (learners), i.e., one has his/her unique
preference in evaluating a course in MOOC. For example,
someone pays more attention to the quality of exercises while
the other one focuses the classroom rhythm more. We use the
concept of context to represent those mentioned features as
the learners’ personalized information. The context space is
encoded as a multidimensional space (dX dimensions), where
dX is the number of features. As such, the recommendation
becomes student-specific, which improves the recommending
accuracy. Hence, appending context information to the model
of handling the courses is ineluctable [7] [8].
Previous context-aware algorithms such as [29] only perform well with the known scale of recommendation datasets.
Specifically, the algorithm in [29] would rank all courses in
MOOC as leaf nodes, then it clusters some relevance courses
together as course nodes to build their parent nodes based
on the historical information and current users’ features. The
algorithm keeps clustering the course nodes and build their
parent nodes until the root node (bottom-up design). If there
comes a new course, all the nodes and clusters are changed and
needed to compute again. As for the MOOC big data, since
the number of courses keeps increasing and becoming fairly
tremendous, algorithms in [29] are prohibitive to be applied.
Our main theme in this paper is recommending courses in
tremendous datasets to students in real-time based on their
preferences. The course data are stored in course cloud(s) and
new courses can be loaded at any time. We devise a top-down
binary tree to denote and record the process of partitioning
course datasets, and every node in the tree is a set of courses.
The course scores feedback from students in marking system
are denoted as rewards. Specifically, there is only one root
course node in the binary tree at first. Every time a course is
recommended, a reward is fed back from the student which
is used to improve the next recommending accuracy. The
reward structure consists as a unknown stochastic function
of context and course features at each recommendation, and
our algorithm concerns the expected reward of every node.
Then the course binary tree divide the current node into two
child nodes and select one course randomly in the node with
the current best expected value. It omits most of courses in
the node that wouldn’t be selected to greatly improve the
learning performance. It also supports incoming new courses
to the existing nodes as unselected items without changing the
current built pattern of the tree.
However, other challenges that influence on the recommending accuracy still remain. In practice, we observe that
the number of courses keeps increasing and the in-mummery
storage cost of one online course is about 1GB in average,
which is fairly large. Therefore, how to store the tremendous
course data and to process the course data effectively become
a challenge. Most previous works [29] [30] could only realize
the linear space complexity, however it’s not promising for
MOOC big data. We propose the distributed storage scheme to
store the course data with many small storage units separately.
For example, the storage units may be divided based on the
platforms of MOOC or the languages of courses. On the one
hand, this method can make invoking process effectively with
little extra costs on course recommendation. On the other hand,
we prove the space complexity can be bounded sublinearly
under the optimal condition which is better than [29].
In summary, we propose an effective context-aware and
online learning algorithm for course big data recommendation to offer the courses to learners in MOOCs. The main
contributions are listed as follows:
•
•
•
•
by experiment results and compare with relevant previous
algorithms [29] [31]. Section VIII concludes the paper.
II. R ELATED W ORKS
A plethora of previous works of exists on recommending
algorithms. As for MOOC, the two major tactics to actualize the algorithm are filtering-based approaches and online
learning methods [11]. Apropos of filtering-based approaches,
there are some branches such like collaborative filtering [12]
[9], content-based filtering [13] and hybrid approaches [14]
[15]. The collaborative filtering approach gathers the students’
learning records together and then classifies them into groups
based on the characteristics provided, recommending a course
from the group’s learning records to new students [12] [9].
Content-based filtering recommends a course to the learner
which is relevant to the learning records before [13]. Hybrid
approach is the combination of the two methods. The filteringbased approaches can perform better at the beginning than
online learning algorithm. However, when the data come
to very large-scale or become stochastic, the filtering-based
approaches lose the accuracy and become incapable of utilizing the history records adequately. Meanwhile, not having
the context makes the method unable to recommend courses
precisely by taking every learner’s preference into account.
Online learning can overcome the deficiencies of filteringbased approaches. Most of the previous works of recommending a course utilize the adaptive learning [16] [17] [18]. In
[16], the CML model was presented. This model combines the
cloud, personalized course map and adaptive MOOC learning
system together, which is quite comprehensive for the course
recommend with context-awareness. Nevertheless, as for big
data, the model is not efficient enough since the works couldn’t
handle the dynamic datasets and these may have a high cost in
time with near “infinite” datasets. The similar works are widely
distributed in [19]–[24] as contextual bandit problems. In these
works, the systems know the rewards of selected ones and
record them every time, which means the course feedback can
be gathered from learners after they receive the recommended
course. There is no work before that realize contextual bandits
with infinitely increasing datasets. Our work is motivated from
[24] for big data support, however we consider the contextaware online learning for the first time with delicately devised
context partition schemes for MOOC big data.
The algorithm can accommodate to highly-dynamic increasing course database environments, realizing the real
big data support by the course tree that could index nearly
infinite and dynamic changing datasets.
We consider context-awareness for better personalized
course recommendations, and devise effective context
partition scheme that greatly improves the learning rate
and accuracy for different featured students.
Our proposed distributed storage model stores data with
distributed units rather than single storage carrier, allowing the system to utilize the course data better and
performing well with huge amount of data.
Our algorithms enjoy superior time and space complexity.
The time complexity is bounded linearly, which means
the algorithms have a higher learning rate than preious
methods in MOOC. For the space complexity, we prove
that it is linear in the primary algorithm and sublinear in
distributed storage algorithm under optimal condition.
III. P ROBLEM F ORMULATION
In this section, we present the system model, context model,
course model and the regret definition. Besides, we define
some relevant notations and preliminary definitions.
Fig. 1 illustrates the model of operation. The senior users
such as professors or network platforms upload the course
resources including the pre-recorded videos for open classed,
live videos for online course and class tests, etc. of MOOC
to course cloud. Then the system collects learners’ essential information of features, for instance, the reputation of
professors, course language, the educational level concerning
courses with appropriate difficulty, the quality in presentations,
etc. When the system obtains the information, it recommends
The rest of paper are organized as follows: Section II
reviews related works and compares with our algorithms. Section III formulates the recommendation problem and algorithm
models. Section IV and Section V illustrate our algorithms and
bound their regret. Section VI analyzes the space complexity
of our algorithms and compares the theoretical results with
existing works. In Section VII, we verify the algorithms
2
Reward
Load the
course
Reward
Professor or online
learning platform
Course cloud with
recommender system
ᮽᵢ
Education level:
pupil or professor
A payoff is
fed back
Learners
Reward
nationality
A learner
with context
The contextawareness cloud
recommend a
course
Course
information
Context
information
Give a
course to
learner Recommended
course
Fig. 2: Context and Course Space Relation Schema
Courses: We model the set of courses as a dC -dimension
vector space, where each vector contains dC course features.
The number of courses is unknown thus we can’t determine dC
before attaining various dataset. Since the number of courses
grows exponentially, we manage the number to infinity to
adapt to development trend of MOOC big data. Specifically,
we consider the course model unknown distribution sets.
Similar to the context, we define the course space as C and
the correlation distance of courses as DC (c, c′ ) to indicate the
relativity between the two courses.
Fig. 1: MOOC Course Recommendation and Feedback System
courses which the users haven’t learned before to the learners
as the alternative choices. Then learners give courses satisfaction payoffs to the system based on their preferences such
like some learners prefer computer science and others prefer
classical music.
A. System Model
Definition 1. DC over C is a non-negative mapping:
DC (c, c′ ) ≥ 0 (C 2 → R). DC (c, c) = 0 when c = c′ .
We use time slots t = 1, 2, · · · , T to delegate rounds. In
each time slot, there are three running states: (1) a learner
with an exclusive context information comes into our model;
(2) the model recommends a course from the current course
cluster to the learner; (3) the learner provides a payoff due
to the newly recommended course to the system. Afterwards
the model will learn how to preform better in the next
recommended procedure on account of the rewards obtained.
We give concepts of three essential elements:
Learners: We denote the set of incoming learners as S =
1, 2, · · ·) during the run of the algorithm. When a learner
comes, the system will suggest a course based on the previou
records of payoffs and context information.
Contexts: Context is illustrated as set X , with the space
being continuous. Besides, we assume the context space is a
dX -dimension space which means the context x ∈ X is a vector of dX -dimension. For example, the context vector includes
educational levels from zero basis to PhD in related fields, the
geographical positions and the language backgrounds. With
the normalization of context information, we suppose the
context space is X = [0, 1]dX which is a unit hypercube. DX (x,
x′ ) is used to delegate the distance between context. We use
the Lipschitz condition to define the distance.
We assume that the two courses which are more relevant
have the smaller correlation distance (DC ) between them.
For example, the courses teached both in English have closer
distance than the courses with different languages as far as the
language feature of context.
Fig. 2 illustrates the relationship between context information and course information over reward. To better illustrate
the relations, we degenerate the dimensions of them to 1
dX = dC = 1. Thus we take the context information and
course details as two horizontal axes in rectangular coordinate
system. From the schematic diagram, the reward varies with
the context and course changing. To be more specific, for a determined learner whose context is constant, the reward differs
from courses shown in blue plane coordinate system. On the
other hand, for a determined course shown in crystal plane
coordinate system, different people have different attitudes.
Practically, we still take the reward dimension 1. Besides, we
expanding the dimension of context to dX and the dimension
of course to dC .
B. Context Model for Individualization
Since the context space is a dX -dimensional vector space,
we normalize every dimension of context range from 0 to 1.
The realistic meaning of the dimensions denotes dX unrelated
features such as ages, cultural backgrounds, nationalities etc,
representing the preferences of the learner. We define the
slicing number of context unit hypercube as nT , indicating
the number of grades in the dimension. To have a better
formulation, PT = {P1 ,P2 ,· · · ,P(nT )dX } is used to denote the
sliced chronological sub-hypercubes. As illustrated in Fig. 3,
we let dX = 3 and nT = 2. We divide every axis into 2
Assumption 1. There exists constant LX > 0 such that for
all context x, x′ ∈ X , we have DX (x,x′ ) ≤ LX ||x − x′ ||α ,
where || • || denotes the Euclidian norm in RdX .
Note that the Lipschitz constants LX are not required to be
known by our recommendation algorithm. They will only be
used in quantifying its performance. Assumption 1 indicates
the initial maximal context deviation whose threshold is defined by LX . We present the context distance mathematically
with LX and α.
3
Pt
Context
dimension 3
Selected
Context
Sub-hypercub
Reward
Center point
Assumption 3 uses the suboptimal gap and distance to
denote the mean reward gap between course c1 and course c2 .
With this assumption, we only need to consider the suboptimal
node’s gap and distance rather than the gap between two
normal nodes directly.
Optimal
node
Learnerÿs context
information point
࢔ࢀ =2
D. The Regret of Learning
Context
dimension 2
Optimal
Course
Courses
Simply, the regret indicates the loss of accuracy in the
recommended procedure due to the unknown dynamics. We
assume the reward rt ∈ [0, 1] is the indication of accuracy.
According to the reward we define the regret as
T
T
P
P
(2)
R(T ) =
(I(r̂t = rt )) ,
rPt ∗ − E
Fig. 3: Context and Course Partition Model
parts and the number of sub-hypercubes is (nT )dX = 8. For
the simplicity, we use the center point in the sub-hypercube
to represent the specific contexts in the cube. With this model
of context, we divide the different users into (nT )dX types.
t=1
t=1
where the rt represents the reward in time t and the subscript
∗ means the optimal solution, i.e., r∗ is the best reward.
Regret shows the convergence rate of the optimal recommended option. When the regret is sublinear R(T ) = O(T γ )
where 0 < γ < 1, the algorithm will finally converge to the
optimal solution. In the following section we will propose our
algorithms with sublinear regret.
C. Course Set Model for Recommendation
For the course model, we use the binary tree to index the
while course dataset. The courses are gathered together and
then it will be divided into two parts by the algorithm based
(on the regret information) which is exploring process of the
tree model. The number of nodes in the depth h is 2h , thus we
let Nh,i (1 ≤ i ≤ 2h ) represent the node in the depth of h. The
root node of the binary course tree is a set of the whole courses
N0,1 = C. And with the exploration of the tree, the two child
nodes contain all the courses from the parent node, and they
never intersect with each other, Nh,i = Nh+1,2i−1 ∪ Nh+1,2i ,
Nh,i ∩ Nh,j = ∅ for any i 6= j. Thus, the C can be covered by
IV. R EPARTIMIENTO H IERARCHICAL T REE
In this section we propose our main online learning algorithm to mine a course in MOOC big data.
A. Algorithm
The algorithm is called Repartimiento Hierarchical Trees
(RHT) and the pseudocode is given in Algorithm 1. In this
algorithm we first find the arrived learners’ context subhypercube Pt from the context space and replace the original
context with the center point xt in that sub-hypercube. And
we use the set Γ to denote all the nodes whose courses have
been recommended and set Ω to show the exploration path.
Pt
Besides, we introduce some new concepts, the Bound Bh,i
Pt
which is the upper bound of reward and the Estimation Eh,i
which is the estimated value of reward in the tree with depth h
of Pt context sub-hypercube. In each time, the algorithm finds
Pt
one course node whose Eh,i
is highest in the set Γ and walks
to the node with the route Ω, selecting one course from that
node and recommending it for the reward from learner. The
procedure is shown in Fig. 4. When the reward feeds back,
Pt
the algorithm refreshes Eh,i
of nodes of the current tree based
Pt
on Bh,i and rewards rt .
Since exploring is a top-down process, after we refresh
the upper bound of reward in course nodes, we update the
estimation value from bottom to the top. In the end of
Pt
algorithm 1, it refreshes the tree with Eh,i
values.
Algorithm 2 shows the exploration process in RHT. When
we turn to explore new course nodes, the model prefers to
select the nodes with higher estimation value. In other words,
the essence of selecting new node is reducing the size of
course cluster to make it easier to find the optimal courses for
learners. After the new nodes being chosen, it will be taken
in the sets ΓPt and ΩPt for the next calculation.
Pt
In algorithm 3 we use Bh,i
to indicate the upper bound of
t
highest reward. The first term µP
h,i is the average rewards, and
2h
the regions of Nh,i at any depth C = ∪ Nh,i .
1
To better describe the nodes, we define the diam(Nh,i ) to
indicate the size of course nodes.
diam(Nh,i ) = sup DC (c, c′ ) f or any c, c′ ∈ Nh,i .
c,c′ ∈Nh,i
The distance between courses can be represented as the gap
between course languages, course time length, course types
and any others which indicate the discrepancy. We denote the
size of nodes with the largest distance in the course dataset.
Note that the diameter is based on the distance, and that can be
adjusted by selecting different mappings. To the convenient my
analysis, we make some reasonable assumptions as follows.
Assumption 2. For any node Nh,i , there exists constant k1
and m ∈ (0, 1), where we can get diam(Nh,i ) ≤ k1 (m)h .
With assumption 2 we can limit the size of nodes with
k1 (m)h . We utilize exponential convergence to keep the
threshold of nodes diameter. Since the model tree is binary
tree, the number of nodes increases exponentially with the
depth rising. The data we get are stochastic (i.i.d.), so we use
the mean reward to handle the model. For each context subhypercube, there is only one overall optimal course in a node
of each depth and each node has a local optimal course. We
define the course gap with the help of mean reward f (r).
Assumption 3. For all courses c1 , c2 ∈ C in the same context
point, they satisfy
f (rcP1t ) − f (rcP2t ) ≤ max{f (rPt ∗ )−f (rcP1t ), DC (c1 , c2 )}. (1)
4
Algorithm 1 Repartimiento Hierarchical Trees (RHT)
Require : The constants k1 > 0, m ∈ (0, 1), the learner’s
context xt .
Auxiliary function : Exploration and Bound Updating
Initialization : For all context sub-hypercubes belonging to
PT
Pt
ΓPt = {N0,1
}
Pt
E1,i = ∞ for i = 1, 2.
1: for t = 1, 2, ...T do
2:
for dt = 0, 1, 2...dX do
3:
Find the context interval in dt dimension
4:
end for
5:
Get the context sub-hypercube Pt
6:
xt ← center point of Pt
Pt
Pt
7:
Nh,i
← N0,1
Pt
8:
ΩPt ← Nh,i
9:
Exploration (ΓPt )
Pt
10:
Select a course from the node Nh,i
randomly and
recommend to the learner
11:
Get the reward rt
12:
for all Pt ∈ PT do
13:
Bound Updating (ΩPt )
Pt
t
14:
ΩP
temp ← Ω
Pt
Pt
15:
for Ωtemp 6= N0,1
do
Pt
16:
Nh,i ← one n
leaf of ΩPt
o
17:
18:
19:
20:
21:
Courses
Depth 1
Current path
Depth 2
Depth 3
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
2:
3:
4:
5:
6:
7:
After many
rounds
Depth
(h,i)
Optimal nodes
Optimal course
Fig. 4: Algorithm Demonstration
Pt
(T Pt −1)µ
+r
Pt
t
h,i
t
they come from the learners’ payoffs µP
.
h,i =
T Pt
q
Pt
k2 ln T /Th,i indicates the deviation in
The second one
exploration-exploitation problem. It’s the tradeoff between the
exploration which ensures the scope and exploitation which
guarantees the depth. And the third term is the deviation in the
course nodes. As for the last term, since we substitute the subhyercube center point for the previous context, we utilizes the
Lipschitz
constant LX and the diagonal
of the sub-hypercube
√
√
dX
dX α
to
denote
the
deviation
L
(
)
in context.
X
nT
nT
Pt
Pt
Pt
Pt
Eh,i
← min Bh,i
, max{Eh+1,2i−1
, Eh+1,2i
}
Pt
t
Delete the Nh,i
from ΩP
temp
end for
end for
end for
B. Regret Analyze
According to the defintion of regret, all the nodes which
have been selected bring regret. We consider the regret in one
sub-hypercube Pt and get the sum of it at last. Since the regret
is the deviation from the optimal courses, we need to define
the optimal course at first.
Pt
while Nh,i
∈ ΓPt do
Pt
Pt
if Eh+1,2i−1 > Eh+1,2i
then
T emp ← 1
Pt
Pt
else if Eh+1,2i−1
< Eh+1,2i
then
T emp ← 0
else
T emp ∼ B(0.5)
end if
Pt
Pt
Nh,i
← Nh+1,2i−T
emp
Pt
ΩPt ← ΩPt ∪ Nh,i
end while
Pt
ΓPt ← Nh,i
∪ ΓPt
Definition 2. To locate the node with an affirmatory way we
h
Pt ∗
define the optimal track as ℓh,i
= ′∪ NhP′t,i′ .
′
h =1
h
The track is the aggregate of the optimal nodes whose
depth from 0 to h. To represent the regret precisely, we need
to define minimum suboptimality gap which indicates the
difference between the optimal course in that node and the
overall optimal course to better describe the model.
Definition 3. The Minimum Suboptimality Gap and the context
gap are
√
Pt
Pt ∗
(3)
Dh,i
= f (rPt ∗ ) − f (rh,i
), LX ( ndTX )α .
Theoretically speaking, the minimum suboptimality gap of
Pt
Nh,i
is the expected reward defference between overall optimal
Pt
course and the best one in Nh,i
, and the context gap is
the difference between the original point and center point in
context sub-hypercube Pt . However what we can get is mean
payoff rather the expectation value, so we will bound this loss
in the following lemma. As for the context regret, we take the
largest difference that it can bring about in sub-hypercube.
After the definitions, we can find a measurement to divide
all the nodes into two kinds for our following proof. Based
on the definition 3, we let the set φPt to be the 2[k1 (mPt )h +
Algorithm 3 Bound Updating (A sub-procedure used in RHT)
1:
Optimal path
Depth λ澳
Algorithm 2 Exploration (A sub-procedure used in RHT)
1:
Nodes with
highest E now
Depth 0
Pt
for all Nh,i
∈ ΩPt do
Pt
Pt
Th,i ← T
h h,i + 1
i
Pt
Pt
Pt
t
µ̂P
h,i ← (Th,i − 1)µ̂h,i + rt /T h,i
q
√
Pt
Pt
t
k2 ln T /Th,i
+k1 (mPt )h +LX ( ndTX )α
Bh,i
← µ̂P
h,i +
end for
Pt
Eh+1,2i−1
←∞
Pt
Eh+1,2i ← ∞
5
√
dX α
nT ) ]-optimal
φPt ∪ (φPt )c . For the convenience, we use the sets of depth
to illustrate the nodesets P Pt
P Pt c
Pt
Γ =
φh ∪
(φh ) .
(7)
nodes in the depth of h
i
h
√
Pt Pt ∗
dX α
Pt
Pt ∗
Pt h
.
φ = Nh,i f (r )−f (rh,i ) ≤ 2 k1 (m ) +LX ( nT )
LX (
Note that we call the nodes in set φPt as optimal nodes and
those out of it as suboptiaml nodes.
We define the regret when one node is selected above,
however we should determine how many times the node has
been chosen in the recommending process. We start it with
suboptimal nodes. Based on definition 2 and definition 3, the
Pt ∗
suboptimal nodes are divorced from ℓh,i
in some depth.
h
Definition 4. We lead to the concept of packing number.
Assuming the size of course space is 1, in the depth of h
the node size threshold is k1 (mPt )h , So the packing number
is
Pt
K
t
(8)
κP
∪N
,
Ra
= [Ra]
dC ,
h
h,i
where K is the adjustment constant and Ra is the packing
balls’ radius.
Pt
Nh,i
Lemma 1. Nodes
are suboptimal, and in the depth
j (1 ≤ j ≤ h − 1) the path is out of the best track. For
a random integer q, we can get the expect times of the node
Pt
Nh,i
and it’s descendants in Pt are
h
i
t′
P
Pt ′
Pt
Pt
E[Th,i
(t )] ≤ q+
P Bh,i
(n) > f (rPt ∗ ) and Th,i
(n) > q
n=q+1
h
i
Pt
Pt ∗
or Bj,ih′ (n) ≤ f (r ) f or j ∈ {q+1, ..., n−1} .
Lemma 4. hIn the same context
sub-hypercube,
the numi
√
dX α
Pt h
ber of the 2 k1 (m ) + LX ( nT ) -optimal nodes can be
bounded as i−dC
h
√
Pt (9)
.
φh ≤ K k1 (mPt )h + LX ( ndTX )α
Proof: Since the course number is infinite that we can’t
know the data exactly, the dimension of course can’t be
′
determined.
There
exists a constant d ,
√
Pt Pt
Pt
Pt
φh ≤ κh ∪{Nh,i ∈ φh }, k1 (mPt )h + LX ( ndTX )α
−d′
(10)
√
≤ K k1 (mPt )h + LX ( ndTX )α
.
Proof: See Appendix A.
We determine the threshold of the selected times of the
Pt
node Nh,i
and it’s descendants by lemma 1. Since we don’t
know in time T how many times this context sub-hypercube
Pt has been selected, we use context time t′ to represent
P ′ the
t = T.
total times in Pt . The sum of t′ is the total time
We take the minimal d′ as the dimension of course dC .
Since the MOOC big data has various data structures,
knowing the exact details of courses is impossible. Thus we
can’t determine the accurate dimension dC . After we know
the regret in selecting one node, the times when a node has
been selected and the nubmer of chosen nodes, we can bound
the whole regret in theory 1.
Pt
However, from lemma 1 we can’t get the expectation of node
chosen times directly, thus we lead to lemma 2 to modify the
loss between mean value and expectation.
Lemma 2. Based
on the Assumption
n
o 2, the inequation
Pt ′
Pt ∗
P Bh,i (t ) ≤ f (r ) ≤ (t′ )−2k2 +1
holds for all optimal nodes
Pt
Nh,i
(4)
Theorem 1. From the lemma above, we can get the regret of
RHT
d +d +1
1
X
C
(11)
E[R(T )] = O T dX +dC +2 (ln T ) dX +dC +2 .
in any time context time t′ .
Proof: See Appendix B.
Lemma 2 determines the threshold of optimal nodes, which
Pt
means the Bh,i
will more than the expectation of reward
Pt
finally. This guarantees the accuracy of our Bh,i
with strict
mathematical proofs. With the connection of lemma 1 and
lemma 2, we can get the precise results about the nodes chosen
times in lemma 3.
Proof: As for MOOC, the three terms represent three
process of recommending a course. E[R1 (T )] is brought
from the optimal course nodes during the recommendation.
E[R2 (T )] represents the regret in suboptimal course nodes at
first and E[R3 (T )] is brought from the follow-up procedures.
Thus E[R3 (T )] is much larger than E[R2 (T )], which means
E[R2 (T )] doesn’t matter for the whole regret.
Pt c
t
For the node diversion φP
h and (φh ) , we divide the regret
into three parts
E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )].
Pt
Pt
t
where ΓP
1 contains the descendants of φH , Γ2 contains the
Pt
Pt
nodes φh the depth from 1 to H and Γ3 contains nodes
Pt
Pt
t c
in (φP
h ) . Note that Γ3 is the child of nodes in φH since
the suboptimal nodes won’t be played deeper and we use the
nodes to represent itself and descendants. TheP
depth H is a
constant to be determined later. Note that T = t′ , when all
the times T are in the same context sub-hypercube, the regret
is smallest. And we consider the situation that all the times
T are distributed uniformly. In this extreme situation, all the
context sub-hypercube has the same times t′ .
For E[R1 (T )], the regret is generated from the optimal
course clusters whose courses have been recommended. Since
Pt
Lemma 3. For the suboptimal nodes Nh,i
in φPt , if q satisfies
q≥
4k2 ln T
P
t −k (mPt )h −L (
Dh,i
X
1
√
dX
nT
Then for all t′ ≥ 1, we have
Pt ′
4k2 ln T
E[Th,i
(t )] ≤ √
d
k1 (mPt )h +LX (
X
nT
)
)
α 2
α 2
,
+ M,
h
(5)
(6)
where the M is a constant less than 5.
Proof: See Appendix C.
We use the deviation of context and course to represent
played times in this lemma. Practically speaking, we find a
upper bound for the course nodes chosen times, which means
we can determine one nodes’ regret during the process. But
this is not sufficient to bound the whole regret, what we have
to know last is how many nodes are there. We divide the
nodes into two parts based on the course model as ΓPt =
6
When the two terms is equal,
we can get
i
h
√
K ln T (nT )dX +1
dX α
Pt H
T. (15)
)
)
+
L
(
=
k
(m
X
1
d
+1
H
C
n
T
[k1 (mPt ) ]
Since the context deviation and the course deviation is unknow,
to minimize the deviation we assume the
deviation is equal
√ two
α
H
too, which means k1 (mPt ) = LX ( ndTX ) .
To simplify the problem, we take α = 1 and k1 = 2, thus
1
wecan get nT = lnTT dX +dC+2 and the regret complexity is
the number of nodes can’t be bounded precisely, and the
number of all the nodes played is t′ , thus we suppose the
′
nodes have been played
h with maximum times t i.
E[R1 (T )] ≤ 4 k1 (mPt )H + LX (
√
dX α
nT )
T.
(12)
As for the second term whose depth is from 1 to H, these
nodes represent the regrets generated by primary nodes. For
the MOOC, it’s the regret in the worse course clusters which
have been selected at the beginning of the recommendation.
h
√
H
α i PP
h
t
4 k1 (mPt ) + LX ( ndTX ) φP
E[R2 (T )] ≤
h Pt h=1
h
√
H
αi
dX
P
h
4 k1 (mPt ) + LX ( ndTX ) .
≤ 4K(nPT )h dC
[k1 (m t ) ] h=0
From Lemma 4 we
can
know
i−dC
h the number of√ suboptimal
Pt nodes in depth h are φh ≤ K k1 (mPt )h +LX ( ndTX )α
,
dX
and the number of the context sub-hypercubes is (nT ) .
Thus the last inequation can be derived. There we take the
Pt whose regret deviation is most as the represented context
sub-hypercube and H in Pt , thus the sum of regret can be
presented in that way.
When it comes to last term, it’s the regret in the suboptimal
course nodes in the deeper recommendation process. And the
regret bound is
√
H h
αi
P
PP
h
Pt c 4 k1 (mPt ) +LX ( ndTX )
E[R3 (T )]≤
(φh ) .
P
Pt h=1
P
t ∈(φ t )
Nh,i
h
dX +dC +1
≤
P
P
8K(nT )
dX


P
t ∈(φ t )
Nh,i
h
V. D ISTRIBUTED S TORED C OURSE R ECOMMENDATION
T REE
A. Distributed Algorithm for Multiple Course Storages
In this section, we turn to present a new algorithm
called Distributed Storage Repartimiento Hierarchical Trees
(DSRHT), which we can improve the storage pressure by
using distributed units to store the course data in clouds. The
practical situation may be very complex, therefore we bound
the number of distributed nd units with our model of tree
2z−1 ≤ nd ≤ 2z , where z is the depth of the tree and 2z is
the number of nodes in that depth. In the next illustration we
substitute the number nd with 2z .
In algorithm 4 we still find the context sub-hypercube
at first. Then since there are 2z distributed units, we first
ascertain these top nodes. Based on the attained information
the algorithm can start to find the course by utilizing the
Bound and Estimation.
Determining the number of distributed units is needed, we
assume there are 2z storage nodes, and the number satisfies
dX +dC −1
(16)
2z ≤ lnTT dX +dC +2 , mPt ∈ [ 12 , 1).
Since the number of the distributed units is limit, and the
time is very large thus the assumption is resonable. As for
mPt , since we take the binary tree model, everytime the course
cluster is divided to two clusters, which means that the m is
larger than 0.5 is quite reasonable.
The algorithm selects the initial nodes at depth z at first,
and then process data similarly to RHT. When it comes to
MOOC, the algorithm will start at the distributed nodes such
as the website or other platforms. For the tree partiotion, the
difference is that we leave the course nodes whose depth is
less than out z to cut down the storage cost. In the complexity
section we will prove that the storage can be bounded linearly
in the best situation.
c
4k2 ln T
√
c
2


+M .

[k1 (mPt )h ]dC  k1 (mPt )h +LX ( ndX )α
T
i
h
√
And the nodes are 2 k1 (mPt )h + LX ( ndTX )α suboptimal,
h
√
.
For the conclusion we can make sure the average regret
will finally converge to zero, which means the algorithm can
find the optimal courses for the learners at last. Note that the
tree exists actually, we store the tree in the cloud and during
the recommending process. Since the dataset is fairly large in
the future, using the distributed storage technology to solve
storage problems is inescapable.
Pt
t
We note that the nodes in ΓP
3 is the child node of Γ2 , since
Pt
all the nodes in Γ3 is the root nodes of the suboptimal nodes
Pt
t
Thus the number of ΓP
3 is less than twice of Γ2 . Then we
can get
h
√
H
αi
P
PP
h
Pt c 4 k1 (mPt ) + LX ( ndTX )
(φh ) Pt h=1
1
O T dX +dC +2 (ln T ) dX +dC +2
Pt
we can get Dh,i
− k1 (mPt )h − LX ( ndTX )α ≥ k1 (mPt )h +
√
dX α
LX ( nT ) . Note that the second term is infinitesimal of
higher order of the third one mathematically, thus we focus
more on the first term and the last term since the decisive
factors of regret is the first one and last one. We notice that
with the depth increasing, E[R1 (T )] decreases but E[R3 (T )]
increases. When we let their complexity to be equal, we can
get the minimum.
As for a context sub-hypercube Pt , all the nodes have
√ been
played bring two kinds of regret: the context regret LX ( ndTX )α
and course node regret k1 (mPt )H . The complexity of first term
is
n h
i o
√
O 4 k1 (mPt )H + LX ( ndTX )α T .
(13)
As for
 complexity is

 the last term, the
P 8K(nT )dX
4k2 ln T


O
√
α 2 + M
Pt )h dC
d
k
(m
h
]
[
1
P
t
k1 (m ) +LX ( n X )
h
(14)
T
dX +1
ln T (nT )
.
=O
[k1 (mPt )H ]dC +1
B. Regret Analyze
In this subsection we prove the regret result in DSRHT can
be bounded sublinearly. Since we use the model to determine
the threshold of the number of distributed units, we can
7
Algorithm 4 Distributed Course Rcommendation Tree
Require : The constants k1 > 0, m ∈ (0, 1), the number of
the storage unit 2z and the learner’s context xt .
Auxiliary function : Exploration and Bound Updating
Initialization : For all context sub-hypercubes belonging to
PT
Pt
Pt
Pt
ΓPt = {Nz,1
, Nz,2
...Nz,2
z}
Pt
= ∞ f or i = 1, 2...2z
Ez,i
1: for t =1,2,...T do
2:
for dt = 0, 1, 2...dX do
3:
Find the context interval in dt dimension
4:
end for
5:
Get the context sub-hypercube Pt
6:
xt ← center point of Pt
7:
for j=1,2...2z − 1 do
Pt
Pt
then
< Nz,j+1
8:
if Nz,j
Pt
Pt
9:
Nz,j = Nz,j+1
10:
end if
11:
end for
Pt
Pt
12:
Nh,i
← Nz,j
Pt
Pt
13:
Ω ← Nh,i
14:
Exploration (ΓPt )
Pt
15:
Select a course from the node Nh,i
randomly and
recommend to the learner
16:
Get the reward rt
17:
for all Pt ∈ PT do
18:
Bound Updating (ΩPt )
Pt
t
19:
ΩP
temp ← Ω
Pt
20:
for Ωtemp 6= ∅ do
Pt
21:
Nh,i
← one n
leaf of ΩPt
o
22:
23:
24:
25:
26:
VI. S TORAGE C OMPLEXITY
The storage problem has been existing in big data for a
long time, so how to use the distributed storage technology
to handle the problem matters a lot. In this section, we
anlalyze the two algorithms’ space complexity mathematically.
We use S(T ) to represent the storage space complexity. For
RHT algorithm, since the storage is from the root node,
and it’s fairly simple to know the space complexity is linear
O(E[S(T )]) = O(T ).
Theorem 3. In the optimal condition, we take the number of
dX +dC −1
storage units satisfied 2z = lnTT dX +dC +2 , then we can get
the storage space
complexity
dX +dC −1
3
dX +dC +2
E[S(T )] = O T dX +dC +2 T dX +dC +2 −(ln T ) dX +dC −1
.
Proof: Every round t has to explore a new leaf node. To
get the optimal result,
we suppose thedepth is as deepest as
dX
+dC −1
T
dX +dC +2 ln( ln T )
. Under the condition
we can choose z =
ln 2
dX +dC −1
that t < 2z+1 , we have S1 (T ) ≤ 2z = lnTT dX +dC +2 , when
the time t ≥ 2z+1 , after one round there is one unplayed node
being selected, so the second part is S2 (T ) ≤ T − 2z+1 =
dX +dC −1
T −2 lnTT dX +dC +2 . Thus we can get the storage complexity
C −1
ddX +d
+dC +2
T
X
.
(18)
E[S(T )] = O T − ln T
Since the value of z is changeable, appropriate value can
make the space complexity sublinear. From (18), if the data
dimension is fairly large, the space complexity will be relative
small. However, the large dataset and tremendous distributed
units will make the algorithm learning too slow and make
regret very large. Thus taking a appropriate parameter is
crucial. There are many platforms about MOOC, and gathering
all the data together is fairly difficult and high-cost. Thus the
DSRHT can handle the practical situation better and improve
the storage condition in the same time.
Besides, we compare our algorithms with some similar
works which are all tree partition. In the table 1 we list the
theoretical analyze results. Since the ACR [29] has a ergodic
process, the time complexity and regret bound is relative
high compared with others. As for HCT [31], having no
context information makes the algorithm has a very low space
complexity and regret bound compared with our algorithms.
Pt
Pt
Pt
Pt
Eh,i
← min Bh,i
, max{Eh+1,2i−1
, Eh+1,2i
}
Pt
t
Delete the Nh,i
from ΩP
temp
end for
end for
end for
continue to use the lemma in RHT algorithm. To be more
specific, we transform the distributed problem into the RHT
starting from depth z rather than the root node.
Now we need to redivide the nodes contrast to the RHT
Pt
Pt
Pt
t
algorithm ΓPt = ΓP
1 + Γ2 + Γ3 + Γ4 , where the first third
partitions are the same and the last one is the node at depth
z which won’t be played based on the algorithm 1. For the
Pt c
t
node diversion φP
h and (φh ) . The depth H (z ≤ H) is a
constant to be selected later.
VII. N UMERICAL R ESULTS
In this section, we present: (1) the source of dataset;
(2) the sum of regrets are sublinear and the average regret
converges to 0 finally; (3) we compare the regret bounds of
our algorithms with other similar works; (4) distributed storage
method can reduce the space complexity. Fig. 5 illustrates the
MOOC operation pattern in edX [27]. The right side is the
teaching window and learning resources, and the left includes
lessons content, homepage, forums and other function options.
Theorem 2. The regret
distributed stored algorithm is
ofd the
1
X +dC +1
(17)
E[R(T )] = O T dX +dC +2 (ln T ) dX +dC +1 ,
Proof: See Appendix D.
Compared to the RHT algorithm, we notice that the regret
bound is slightly increases, and its beginning performance is
not as well as RHT. However, the algorithm can handle the
practical problem better. Considering the explosive data in
future, this alogrithm will perform better than RHT.
A. Description of the Dataset
We take the dataset which contains feedback information
and course details from the edX [27] and the intermedi8
TABLE I: Theoretical Comparison
Algorithm
Context
Infinity
Time Complexity
O
T2
+ KE T
Space Complexity
!
E
P
O
Kl + T
l=0
d
2
O T d+2 (ln T ) d+2
ACR [29]
Yes
No
HCT [31]
No
Yes
O(T ln T )
RHT
Yes
Yes
O(T ln T )
O(T )
DSRHT
Yes
Yes
O(T ln T )
O (T − 2z )
ary website of MOOC [6]. In those platforms, the context
dimensions contain nationality, gender, age and the highest
education level, therefore we take dX = 4. Besides, there is a
questionnaire from Harvard University in edX to improve your
context information, enlarging the dimension to around thirty
dX = 30. As for the course dimensions, they comprise starting
time, language, professional level, provided school, course
and program proportion, whether it’s self-paced, subordinative
subject etc. For the feedback system, we can acquire reward
information from review plates and forums. Thoroughly, the
reward is produced from two aspects, which are marking
system and comments from forums.
For the users, when a novel field comes into vogue, tremendous people will get acess to this field in seconds. The data we
get include 2 × 105 learners using MOOC in those platforms,
and the average number of courses the learners comment is
around 30. As for our algorithm, it focuses on the group
of learners in the same context sub-hypercube rather than
individuals. Thus when next time users comes with context
information and historical records, we just treat them as the
new training data without distinguishing them. However the
number of people is limited, even if generating a course is
time-costing, the number of course is unlimited and education
runs through the development of human being. Our algorithm
pays more attention to the future highly inflated MOOC
curriculum resources, and existing data bank is not tremendous
enough to demonstrate the superiority of our algorithm since
MOOC is a new field in education. We find 11352 courses
from those platforms including plenty of finished courses. The
number of courses doubles every year. And based on the trend,
the quantity will be more than forty thousand times within 20
years. To give consideration to both accuracy and scale of
sources of data, we copy the original sources to forty five
thousand times to satisfy the number requirements. Thus we
extend the 11352 course data acquired from edX to around
5 × 108 to simulate future explosive data size of courses in
2030.
Lessons
content
Regret
O T
O T
dI +dC +1
dI +dC +2
ln T
O T
d+1
1
O T d+2 (ln T ) d+2
dX +dC +1
dX +dC +2
(ln T ) dX +dC +2
dX +dC +1
dX +dC +2
(ln T ) dX +dC +1
1
1
Including
Homework,Video
files and so on
Forums :
Getting
rewards
Video
window
Fig. 5: MOOC Learning Model
Adaptive Clustering Recommendation Algorithm (ACR)
[29]: The algorithm injects contextual factors capable
of adapting to more learners, however, when the course
database is fairly large, ergodic process in this model
can’t handle the dataset well.
• High Confidence Tree algorithm (HCT) [31]: The algorithm supports unlimited dataset however large it is, but
there is only one learner for the recommendation model
since it doesn’t take context into consideration.
• We consider both infinity and context, thus our model
can better suit future MOOC situation. In DSRHT we
sacrifice some immediate interests to get better long-term
performance.
To verify the conclusions practically, we divide experiment
into following three steps:
1) Step 1.: In this step we compare our RHT algorithm with
the two previous work which are ACR [29] and HCT [31] with
different size of training data. We input over 6 × 106 training
data including context information and feedback records in the
reward space mentioned in the section of database description
into the three models, and then the models will start to
recommend the courses stored in the cloud. In consideration of
HCT not supporting context, we normalize all the context information to the same (center point of unit context hypercube).
Since the reward distribution is stochastic, we simulate 10
times to get the average values where the interfere of random
factor is restrained. Then the two regret tendency diagrams are
plotted to evaluate algorithms performances.
2) Step 2.: We use the DSRHT algorithm to simulate
the results. The RHT algorithm can be seemed as degraded
DSRHT with z = 0, and we compare the DSRHT algorithm
with different parameters z. Without loss of generality, we take
•
B. Experimental Setup
As for our algorithm, the final training number of data is
over 6 × 106 and the number of courses is about 5 × 108
which means for the training number we can regard the course
number as infinite. Three works are introduced as follow.
9
ACR
HCT
RHT
5
Regret
TABLE II: Average Accuracies of RHT
ൈ ૚૙૞ 澔
6
4
Algorithm
Number ×106
1
2
3
4
5
6
3
2
ACR [29]
HCT [31]
RHT
65.43%
78.62%
83.23%
86.28%
88.19%
88.79%
81.02%
82.13%
82.76%
83.01%
83.22%
83.98%
85.34%
87.62%
89.92%
90.45%
91.09%
91.87%
5
6
ൈ ૚૙૟ 澔
1
0
0
1
2
4
3
Arrival data
5
6
ൈ ૚૙૟ 澔
Fig. 6: Contrast of Regret (RHT)
z=0
z=10
z=20
5
4
Regret
0.6
ACR
HCT
RHT
0.5
Average Regret
ൈ ૚૙૞ 澔
6
0.4
3
2
1
0.3
0
0.2
0
1
2
0.1
0
0
4
3
Arrival data
Fig. 8: Contrast of Regret with Different Parameter z
1
2
4
3
Arrival data
5
6
ൈ ૚૙૟ 澔
regret comes to be lower than HCT. From the Fig. 7 (Average
Regret diagram), HCT’s average regret is less than ACR at
first, result also showing that ACR performs slightly better
than HCT finally.
Table 2 records the average accuracies which is the total
rewards divdided by the number of training data. “Num” is the
number of training data. We find that when the time increases,
all the three algorithm can get promoted. Our algorithm has
the highest accuracies during the learning period. The ACR
performs not good when the process starts, which is 65.43%
and is worse than HCT 81.02%. Finally, ACR converges to
88.79% but HCT is still 83.98%. When it comes to our
algorithm, it’s 91.87% which is much better than HCT.
Fig. 8 and Fig. 9 analyze DSRHT algorithm by using
different parameters z as 0, 10 and 20. From the diagrams
we find that comparing z = 0 and z = 10, we can validate
what we said before: Storage improvement focus more on the
Fig. 7: Contrast of Average Regret (RHT)
z = 0, z = 10 and z =
dX +dC −1
dX +dC +2
ln( lnTT
ln 2
)
≈ 20. Then we
plot the regret and z diagram to analyze the constant optimal
parameter.
3) Step 3.: We record the storage data to analyze the
space complexity of those four algorithms. First we upload
517.68TB indexing information of courses to our school
servers and perform the four algorithms successively. In the
process of training, we record the regret for six times. And in
the end of training, we record the space usage of the tree which
represent the training cost. As for the DSRHT, we use the
virtual partitions in school servers to simulate the distributed
stored course data. Specifically, we reupload the course data to
the school servers in 1024 virtual partiotions, and then perform
the DSRHT algorithm.
0.6
C. Results and Analysis
z=0
z=10
z=20
0.5
Average Regret
We analyze our algorithm from two different angles: Comparing with other two works and comparing with itself with
different parameter z. In each direction, we compare the regret
first, and analyze the average regret. And then we discuss the
accuracies based on the average regret. At last we will compare
the storage conditions from different algorithms.
In Fig. 6 and Fig. 7 we compare the RHT algorithm with
ACR and HCT. From the Fig. 6 (Regret diagram), we can
get that our method is better than the two others which has
less regret from the beginning. The HCT algorithm performs
better than ACR when it starts. With time going on, the ACR’s
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
Arrival data
6
ൈ ૚૙૟ 澔
Fig. 9: Contrast of Average Regret with Different Parameter z
10
TABLE III: Average Accuracies of DSRHT
DS factor z
Number×106
1
2
3
4
5
6
z=0
z = 10
z = 20
85.34%
87.62%
89.92%
90.45%
91.09%
91.87%
82.67%
86.98%
90.49%
91.50%
92.03%
92.89%
51.10%
79.94%
86.37%
89.79%
91.33%
92.04%
TABLE IV: Average Storage Cost
Storage Cost (TB)
Storage Ratio
ACR [29]
12573
24.287
HCT [31]
2762
5.335
RHT
4123
7.964
the node selected in the depth of k + 1). According to the
definition of Bound and Estimation, we can know that with
the depth decreasing, the Estimation of node decrases either.
Pt
Pt
Pt
Pt
≤ Ek+1,i
Thus we can know that Ek+1,i
′ ≤ Eh,i ≤ Bh,i .
n (k+1)′
o
Pt
We divide the set
B Pt (t′ ) ≥ Ek+1,i
(t′ )
into
(k+1)′
n
o
n h,i
o
Pt ′
Pt
Bh,i
(t ) > f (rPt ∗ ) ∪ f (rPt ∗ ) ≥ Ek+1,i
(t′ ) . Ac(k+1) ′
cording to the definition of Bound and Estimation again,
o
nwe can get
Pt
(t′ )
f (rPt ∗ ) ≥ Ek+1,i
(k+1) ′
n
o n
o
Pt
Pt
Pt
′
′
′
⊂ f (rPt ∗)≥Bk+1,i
(t
)
∪
B
(t
)
≥
E
(t
)
k+1,i
k+1,i
′
′
′
(k+1)
(k+1)
o
o n (k+1)
n
Pt
Pt
′
′
Pt ∗
(t
)
(t
)
∪
f
(r
)
≥
E
⊂ f (rPt ∗ ) ≥ Bk+1,i
k+1,i(k+1)′
(k+1) ′
n
o n
o
Pt
Pt
′
Pt ∗
′
⊂ f (rPt ∗ ) ≥ Bk+1,i
(t
)
∪
f
(r
)
≥
E
(t
)
.
k+2,i
′
′
(k+1)
(k+2)
DSRHT (z = 10)
2132
4.118
nBased on the conclusionoabove,
n we know
o
Pt
Pt
Pt ∗
′
′
f (r ) ≥ Ek+1,i(k+1)′(t ) ⊂ f (rPt ∗) ≥ Bk+1,i
(t
)
(k+1) ′
n
o
(19)
n−1
Pt
Pt ∗
′
f (r ) ≥ Ek+2,i(k+2)′(t ) .
∪
long run performance. However, when z = 20, the algorithm
has taken a lot of time to start recommend course precisely.
j=k+2
Even if finally the accuracy catches the RHT algorithm at the
Pt ′
When
it comes to Tnh,i
(t ), the pratical path including
the
end, the effect is not as well as we expect.
o
P
P
P
′
′
t
t
t
Table 3 illustrates the problem more precisely. When the node N means that B (t ) ≥ E
h,i
h,i
k+1,i(k+1)′ (t ) . So
training number is less than 4 × 106 , the condition that z = 20
′
h
i P
n
o
t
Pt ′
Pt ′
Pt
′
is worst in the three conditions. After that, it come to catch E T Pt (t′ ) =
P
B
(t
),
T
(t
)
≥
E
(t
)
≤
q
h,i
h,i
h,i
k+1,i(k+1)′
n=1
the RHT 91.09% with 91.33%. Thus we can see selecting
′
n
o
t
P
the ditributed storage number can’t pursuit the quantity only,
Pt ′
Pt
Pt ′
′
+
P Bh,i
(t ) ≥ Ek+1,i
(t
),
T
(t
)
>
q
h,i
(k+1) ′
whether it’s practical makes sense as well.
n=1
′
nh
i
t
P
As for the storage analyze, we use the detailed information
Pt
Pt
≤ q+
P Bh,i
(n) > f (rPt ∗ ) and Th,i
(n) > q
of courses to represent courses data, and the whole course
n=q+1
h
io
storage is 517.68TB. To be more intuitionistic, we use the
Pt
Pt ∗
or Bj,i
(n)≤f
(r
)
for
j∈{q+1,
...,
n−1}
.
h′
ratio of actual space occupied and course space occupied to
In the inequation, we let the first term to be 1 at each time
denote storage ratio. From table 4 we know that ACR [29]
Pt
algorithm is not suitable for real big data since the storage so it grows to q. In the second term, since the Th,i (n) > q, the
ratio reaches 24.287. HCT [31] algorithm performs well in terms when n ≤ q are zero and with the help of inequation
space complexity which is better than RHT. As for DSRHT, (19) we can get the conclusion.
the storage ratio is 4.118 which is less than HCT and nearly
half of RHT.
A PPENDIX B P ROOF OF L EMMA 2
VIII. C ONCLUSION
Proof: According to Assumption 3, we let c1 , c2 which
In this paper, we have presented RHT and DSRHT algo- have been select from the same context be in a same optimal
rithms for the courses recommendation in MOOC big data. node and let c1 be the best course, thus we can get
Pt
To better fit with the changeable dataset in the future we
) ≤ k1 (mPt )h .
f (rPt ∗ ) − f (rcP2t ) ≤ diam(Nh,i
(20)
extend the number of objects to infinite. Paying attention to Since in the context sub-hypercube we change the context
the individualization in recommender system, we inject the spcae with a point, we need to inject the context deviation
√
context-awareness into our algorithm. Futhermore, we use
(21)
f (rPt ∗ ) − f (rcP2t ) ≤ k1 (mPt )h +LX ( ndTX )α .
distributed storage to relieve the storing pressure and make it
Pt
more suitable for big data. We have also test our algorithm and We note the event when the track go through the node Nh,i
Pt
Pt
compare it with two similar algorithms. As for our ongoing as event Nh,i ∈ ℓH,I . Therefore,
n
o
work, we intend to reduce space complexity by the combined
Pt ′
Pt ′
P Bh,i
(t ) ≤ f (rPt ∗ ) and Th,i
(t ) ≥ 1
method of HCT [31] and distributed storage to improve storage
q
Pt ′
′
t
condition further.
= P µ̂P
(t
)
+
k2 ln T /Th,i
(t ) + k1 (mPt )h
h,i
√
Pt ′
dX α
Pt ∗
A PPENDIX A P ROOF OF L EMMA 1
+LX ( nT ) ≤ f (r ) and Th,i (t ) ≥ 1
h
i
Proof: We assume that the path is out of the best in the
√
Pt ′
dX α
Pt ∗
′
Pt h
t
)
−f
(r
)
Th,i
= P µ̂P
(t )
)
+L
(
(t
)+k
(m
X
1
depth of k, thus we can know the estimation in the depth k +1
h,i
nT
Pt
q
have deviation of the actual value, which means Ek+1,i(k+1)′ ≤
Pt ′
Pt ′
≤ − k2 (ln T )Th,i (t ) and Th,i (t ) ≥ 1
Pt
Ek+1,i
′ (the first one is the best path node and the second is
11
′
t
P
o
n Pt
t
=P
I Nh,i ∈ ℓP
−
H,I
n=1
o
in
√
t′ h
α
P
h
Pt
t
+ f (rcPnt )+k1 (mPt) +LX ( ndTX ) −f (rPt ∗) I Nh,i
∈ ℓP
H,I
n=1
q
Pt ′
Pt ′
(t ) and Th,i
(t ) ≥ 1
≤ − k2 (ln T )Th,i
≤P
f (r̄cPt )
f (rcPnt )
Pt ∗
Pt
Pt ′
> f (rh,i
) + Dh,i
and Th,i
(t ) ≥ q
o
Pt ∗
Pt
Pt ′
> f (rh,i
) + Dh,i
and Th,i
(t ) ≥ q
o
n
q
√
k2 ln T
′
t
≤ P µ̂P
(t
)
+
+ k1 (mPt )h + LX ( ndTX )α
h,i
q
√
α
d
h
Pt
Dh,i
−k1 (mPt ) −LX ( n X )
Pt ′
Pt ∗
T
]
= P [µ̂h,i (t ) − f (rh,i )] > [
2
t′
P
Pt ′
and Th,i
(t ) ≥ q .
Pt
t
∈ ℓP
(f (r̄cPt ) − f (rcPnt ))I{Nh,i
H,I }
n=1
q
Pt ′
Pt ′
≤ − k2 (ln T )Th,i
(t ) and Th,i
(t ) ≥ 1 .
Pt ′
When we multiply Th,i
(t ) with both sides, we can get the
inequations below.
The last inequation is based on the expression (21), since
the second term is positive and we drop it to get the last
√
α
d
h
n
Pt
−k1 (mPt ) −LX ( n X )
Dh,i
expression.
Pt ′
Pt ∗
T
P [µ̂h,i (t ) − f (rh,i )] > [
]
2
o
of illustration, we pick the n when
nFor the convenience
o
Pt ′
⌣Pt
Pt
Pt
Pt
and
T
(t
)
≥
q
h,i
I Nh,i ∈ ℓH,I is equal to 1. We use r c to indicate the rc
o
n
nP
t′
Pt
Pt
P
P
Pt
t
happened in I Nh,i ∈ ℓH,I . Thus,
=P
(f (r̄nPt ) − f (rh,i
))I{Nh,it ∈ ℓH,I
}
n=1
t′
√
n
o
α
d
o
h
Pt
P
Pt
D −k1 (mPt ) −LX ( n X )
t
Pt ′
Pt ′
P
f (r̄cPt ) − f (rcPnt ) I Nh,i
∈ ℓP
T
H,I
]T
(t
)
and
T
(t
)
≥
q
.
> [ h,i
h,i
h,i
2
n=1
q
With the conclusion of Lemma 2, we can get that
Pt ′
Pt ′
(t ) and Th,i
(t ) ≥ 1
≤ − k2 (ln T )Th,i
nP
o
t′
n Pt
Pt
Pt
Pt
t′
o
n
P
f
(r̄
)
I
N
)
−
f
(r
∈
ℓ
n
cn
h,i
H,I
P
Pt
t
n=1
√
≤P
f (r̄cPt ) − f (rcPnt ) I Nh,i
∈ ℓP
α
H,I
dX
o
Pt
Pt h
Dh,i −k1 (m ) −LX ( n )
n=1
Pt ′
Pt ′
T
q
]T
(t
)
and
T
(t
)
≥
q
>
[
h,i
h,i
2
Pt ′
Pt ′
(t ) and Th,i
(t ) ≥ 1
≤ − k2 (ln T )Th,i
≤ (t′ )−2k2 +1 .
( Pt ′
Th,i (t ) According to Lemma 1 and the prerequisite in Lemma 3, we
P
⌣Pt
⌣P t
4k2 ln T
=P
r c − r cn
select upper bound of q as √
α 2 + 1.
d
Pt
P
n=1
Dh,i −k1 (m t )h −LX ( n X )
)
T
q
Thus,
Pt ′
Pt ′
≤ − k2 (ln T )Th,i (t ) and Th,i (t ) ≥ 1
h
h
i
i
t′
P
t′
n Pt ′
Pt
Pt
p
Pt ∗
P
P
⌣Pt
⌣P t
E
T
(t
)
≤
P
B
(n)
>
f
(r
)
and
T
(n)
>
q
h,i
h,i
h,i
≤
P
f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n .
n=q+1
n=1
j=1
h
i
Pt
Pt ∗
Pt ′
or Bj,ih′(n)≤ f (r ) for j ∈ {q+1, ..., n−1}
We consider the situation when n = 1, 2...T (t ) and the
h,i
Pt ′
fact that Th,i
(t ) ≤ t′ . Besides, the last inequation use the
union bound theory and loose the threshold
n t′
p
P
P
⌣P t
⌣Pt
P
f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n
n=1
j=1
(22)
t′
P
′ −2k2 +1
≤
exp(−2k2 ln T ) ≤ (t )
.
+
≤
4k2 ln T
4k2 ln T
P
t −k (mPt )h −L (
Dh,i
1
X
P
t −k (mPt )h −L (
Dh,i
X
1
P
t −k
Dh,i
1
X
nT
2
√
dX
X( n
T
≥
q
k2 ln T
q
.
(24)
OF
T HEOREM 2
Proof: Since we substitute the j with j ′ , Lemma are still
valid for distributed stored algorithm.
Based on the segmentation, the regret can be presented with
E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )] + E[R4 (T )].
For E[R1 (T )], since it’s the same as the algorithm 1, so we
can get the first term has
i
≥
α
)
+1
thus we can get the conclusion Lemma 3.
)
h
(mPt ) −L
α 2
+1
n=q+1
L EMMA 3
Proof: With the help of the assumption q
4k2 ln T
√
α 2 , we can get
d
)
And we take the constant k2 ≥ 1,
i
h
t′
P
−2k +1
(t′ ) 2 +n−2k2 +2 ≤ 4 ≤ M,
1+
A PPENDIX D P ROOF
dX
nT
α 2
n=q+1
n=1
OF
√
)
i
h
t′
P
−2k +1
(t′ ) 2 +n−2k2 +2 .
+
Note that the sum of time T represents the contextual sum of
time since the number of courses in the context sub-hypercube
is stochastic. And for the convenience, we use T as the sum
of time. With the help of Hoeffding-Azuma inequality [26],
we get the conclusion.
A PPENDIX C P ROOF
√
d
Pt
Dh,i
−k1 (mPt )h −LX ( n X
T
(23)
nThus,
o
Pt ′
Pt ′
P Bh,i
(t ) > f (rPt ∗ ) and Th,i
(t ) ≥ q
q
n
√
Pt ′
′
t
= P µ̂P
k2 ln T /Th,i
(t )+k1 (mPt )h +LX ( ndTX )α
h,i (t )+
E[R1 (T )] ≤ 4 k1 (mPt )H + LX (
√
dX α
nT )
T.
(25)
The depth is from z to H, revealing that H > z. To satisfy
dX +dC −1
this, we suppose 2H ≥ lnTT dX +dC +2 . Since the exploration
12
process started from depth z, the depth we can select satisfy
the inequation above. Thus the second term’s regret bound is
h
√
H
α i PP
h
t
4 k1 (mPt ) + LX ( ndTX ) φP
E[R2 (T )] ≤
h Pt h=z
(26)
h
√
H
αi
dX
P
h
4 k1 (mPt ) + LX ( ndTX ) .
≤ 4K(nPT )h dC
[k1 (m t ) ] h=z
We choose the context sub-hypercube whose regret bound
is biggest to continue the inequation (26). And as for the third
term, the regret bound is
√
H h
αi
PP
P
h
Pt c 4 k1 (mPt ) +LX ( ndTX )
E[R3 (T )] ≤
(φh ) .
P
Pt h=z
P
t
Nh,i
∈(φht )
[5] M. Hilbert, “Big data for development: a review of promises and
challenges,” Development Policy Review, pp. 135-174, 2016.
[6] http://mooc.guokr.com/
[7] G. Paquette, A. Miara, “Managing open educational resources on the
web of data,” International Journal of Advanced Computer Science and
Applications (IJACSA), vol. 5, no. 8, 2014.
[8] G. Paquette, O. Marino, D. Rogozan, M. Lonard, “Competencybased personalization for Massive Online Learning,” Avaliable at:
http;//r-libre.teluq.ca/491/1/SLE-Competency-based%20personalisationrevised%2013.08.2014.pdf.2014.
[9] C. G. Brinton, M. Chiang, “MOOC performance prediction via clickstream data and social learning networks,” IEEE Conference on Computer
Communications (INFOCOM), pp. 2299-2307, 2015.
[10] S. Bubeck, R. Munos, G. Stoltz, et al, “X-armed bandits,” Journal of
Machine Learning Research pp. 1655-1695, 2011.
[11] G. Adomavicius, A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions,”
IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6,
pp. 734-749, 2005.
[12] D. Yanhui, W. Dequan, Z. Yongxin, et al. “A group recommender
system for online course study,” International Conference on Information
Technology in Medicine and Education, pp. 318-320, 2015.
[13] M. J. Pazzani, D. Billsus, “Content-based recommendation over a customer network for ubiquitous shopping,” IEEE Transactions on Services
Computing, vol. 2, no. 2, pp. 140-151, 2009.
[14] R. Burke, “Hybrid recommender systems: Survey and experiments,”
User Modeling and User-adapted Interaction, vol. 16, no. 2, pp. 325341. 2007.
[15] K. Yoshii, M. Goto, K. Komatani, T. Ogata, H. G. Okuno, “An efficient
hybrid music recommender system using an incrementally trainable
probabilistic generative model,” IEEE Transactions on Audio, Speech,
Language Processing, vol. 16, no. 2, pp. 435-447, 2008.
[16] L. Yanhong, Z. Bo, G. Jianhou, “Make adaptive learning of the MOOC:
The CML model,” International Conference on Computer Science and
Education (ICCSE), pp. 1001-1004, 2015.
[17] A. Alzaghoul, E. Tovar, “A proposed framework for an adaptive learning
of Massive Open Online Courses (MOOCs),” International Conference
on Remote Engineering and Virtual Instrumentation (REV), pp. 127-132,
2016.
[18] C. Cherkaoui, A. Qazdar, A. Battou, A. Mezouary, A. Bakki, D.
Mamass, A. Qazdar, B. Er-Raha, “A model of adaptation in online
learning environments (LMSs and MOOCs),” International Conference
on Intelligent Systems: Theories and Applications (SITA), pp. 1-6, 2015.
[19] E. Hazan, N. Megiddo, “Online learning with prior knowledge,” Learning Theory. Berlin, Germany: Springer-Verlag, pp. 499C513, 2007.
[20] A. Slivkins, “Contextual bandits with similarity information,” J. Mach.
Learn. Res., vol. 15, no. 1, pp. 2533C2568, Jan. 2014.
[21] J. Langford T. Zhang, “The EpochCGreedy algorithm for contextual
multi-armed bandits,” Proc. NIPS, pp. 1096C1103, 2007.
[22] W. Chu, L. Li, L. Reyzin, R. E. Schapire, “Contextual bandits with
linear payoff functions,” Proc. AISTATS, pp. 208C214, 2011.
[23] T. Lu, D. Pl, M. Pl, “Contextual multi-armed bandits,” Proc. AISTATS,
pp. 485C492, 2010.
[24] Bubeck S, Munos R, Stoltz G, et al. X-armed bandits[J]. Journal of
Machine Learning Research, 2011, 12(May): 1655-1695.
[25] C. Tekin, M. van der Schaar, “Distributed online big data classification
using context information,” IEEE Annual Allerton Conference: Communication, Control, and Computing, pp. 1435-1442, 2013.
[26] W. Hoeffding, “Probability inequalities for sums of bounded random
variables,” Journal of the American Statistical Association, pp. 13-30,
1963.
[27] https://www.edx.org/
[28] J. P. Berrut, L. N. Trefethen, “Barycentric lagrange interpolation,” pp.
501-517, 2004.
[29] L. Song, C. Tekin, M. van der Schaar, “Online learning in large-scale
contextual recommender systems,” vol. pp, no. 99, 2015.
[30] Slivkins A. Contextual bandits with similarity information[J]. Journal of
Machine Learning Research, 2014, 15(1): 2533-2568
[31] M. G. Azar, A. Lazaric, E. Brunskill, “Online Stochastic Optimization
under Correlated Bandit Feedback,” pp. 1557-1565, 2014.
c
t
We notice that since the nodes in ΓP
3 is the child node of
To be more specific, in the binary tree, the child nodes is
more than parent nodes but less than twice, thus the number
Pt
t
of ΓP
3 is less than twice of Γ2 .
h
√
H
αi
P
PP
h
Pt c 4 k1 (mPt ) + LX ( ndTX )
(φh ) t
ΓP
2 .
P
Pt h=z
≤
≤
8K(nT )
dX
[k1 (mPt )h ]
dC
8K(nT )dX
[k1 (mPt )h ]
dC






P
t ∈(φ t )
Nh,i
h
4k2 ln T
t −k (mPt )h −L (
Dh,i
X
1
k1 (mPt )h +LX (
P
4k2 ln T
√
dX
nT
)
√
dX
nT
α 2
c
)
α 2
+M
+M


.




Similar to the RHT, the second term is infinitesimal of
higher order of E[R3 (T )]. When it comes to the fourth term,
we notice that since the depth of z is bounded, and the worst
situation happens when z is maximum.




4k2 ln T
. (27)
E[R4 (T )] ≤ (2z − 1) √
α 2 + M

 k1 (mPt )h +LX ( ndX )
T
z
T
ln T
C −1
ddX +d
+d +2
X
C
Since we know that 2 ≤
, we have
dX +dC −1
2
E[R4 (T )] = O lnTT dX +dC +2 ln T (nT )
d +d +1
1
X
C
= O T dX +dC +2 (ln T ) dX +dC +1 .
(28)
From theorem 1, we minimize the deviation by making
√ two
α
H
deviation equal too, which means k1 (mPt ) = LX ( ndTX ) .
For the simplicity we let k1 = 2 and α = 1,
d +d1 +2
T
X
C
thus the slicing number nT =
. Now
ln T
we need to prove that there exists H that can satisfy
√
C −1
ddX +d
X +dC +2
dX ConsidT
Pt H
H
, k1 (m ) = LX (
).
2 ≥
ln T
nT
ering the complexity and ignoring constant, we can get that
when mPt ∈ [ 12 , 1) the H exists.
Based on the attained
above we can
get the
infromation
dX +dC +1
1
regret complexity is O T dX +dC +2 (ln T ) dX +dC +1 .
R EFERENCES
[1] L. Pappano, “The Year of the MOOC,” The New York Times, 2014.
[2] T. Lewin, “Universities Abroad Join Partnerships on the Web,” New York
Times, 2013.
[3] https://zh.coursera.org/.
[4] A. Brown, “MOOCs make their move,” The Bent, vol. 104, no. 2, pp.
13-17, 2013.
13
$&5
+&7
5+7
] ] ] $YHUDJH5HJUHW
$YHUDJH5HJUHW
5HJUHW
ൈ ૚૙૞ 澔
$UULYDOGDWD
$UULYDOGDWD
$YHUDJH5HJUHW
D5HJUHW
૟
ൈ ૚૙
ൈ ૚૙૟ 澔
澔