Context-Aware Online Learning for Course Recommendation of MOOC Big Data arXiv:1610.03147v1 [cs.LG] 11 Oct 2016 Yifan Hou, Member, IEEE, Pan Zhou, Member, IEEE, Ting Wang, Yuchong Hu, Member, IEEE, Dapeng Wu, Fellow, IEEE Abstract—The Massive Open Online Course (MOOC) has expanded significantly in recent years. With the widespread of MOOC, the opportunity to study the fascinating courses for free has attracted numerous people of diverse educational backgrounds all over the world. In the big data era, a key research topic for MOOC is how to mine the needed courses in the massive course databases in cloud for each individual (course) learner accurately and rapidly as the number of courses is increasing fleetly. In this respect, the key challenge is how to realize personalized course recommendation as well as to reduce the computing and storage costs for the tremendous course data. In this paper, we propose a big data-supported, contextaware online learning-based course recommender system that could handle the dynamic and infinitely massive datasets, which recommends courses by using personalized context information and historical statistics. The context-awareness takes the personal preferences into consideration, making the recommendation suitable for people with different backgrounds. Besides, the algorithm achieves the sublinear regret performance, which means it can gradually recommend the mostly preferred and matched courses to learners. Unlike other existing algorithms, ours bounds the time complexity and space complexity linearly. In addition, our devised storage module is expanded to the distributed-connected clouds, which can handle massive course storage problems from heterogenous sources. Our experiment results verify the superiority of our algorithms when comparing with existing works in the big data setting. Index Terms—MOOC, big data, context bandit, course recommendation, online learning I. I NTRODUCTION MOOC is a concept first proposed in 2008 and known to the world in 2012 [1] [2]. Not being accustomed to the traditional teaching model or being desirous to find a unique learning style, a growing number of people have partiality for learning on MOOCs. Advanced thoughts and novel ideas give great vitality to MOOC, and over 15 million users have marked in Coursera [3] which is a platform of it. Course recommender system helps the learner to find the requisite course directly in the course ocean of numerous MOOC platforms such like Coursera, edX, Udacity and so on [4]. However, due to the rapid growth rate of users, the amount of needed courses Yifan Hou and Pan Zhou are with School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China. Ting Wang is with Computer Science and Engineering, Lehigh University, PA 18015, USA. Yuchong Hu is with School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, 430074 China. Dapeng Oliver Wu is with Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA. Contacting email: [email protected] This work was supported by the National Science Foundation of China under Grant 61401169, Grant CNS-1116970, and Grant NSFC 61529101. have been expanding continuously. According to the survey about the completion rate of MOOC, only 4% people finish their chosen courses. Therefore, finding a preferable course resources and locating them in the massive data bank, e.g., cloud computing and storage platforms, would be a daunting “needle-in-a-haystack” problem. One key challenge in future MOOC course recommendation are processing tremendous data that bears the feature of volume, variety, velocity, variability and veracity [5] of big data. Precisely, the recommender system for MOOC big data needs to handle the dynamic changing, heterogonous sources, priorily unknown scale and nearly infinite course data effectively. Moreover, since the Internet and clouding computing services are turning in the direction of supporting different users around the world, cultural difference, geographic disparity and education level, the recommender system needs to consider the features of students (learners), i.e., one has his/her unique preference in evaluating a course in MOOC. For example, someone pays more attention to the quality of exercises while the other one focuses the classroom rhythm more. We use the concept of context to represent those mentioned features as the learners’ personalized information. The context space is encoded as a multidimensional space (dX dimensions), where dX is the number of features. As such, the recommendation becomes student-specific, which improves the recommending accuracy. Hence, appending context information to the model of handling the courses is ineluctable [7] [8]. Previous context-aware algorithms such as [29] only perform well with the known scale of recommendation datasets. Specifically, the algorithm in [29] would rank all courses in MOOC as leaf nodes, then it clusters some relevance courses together as course nodes to build their parent nodes based on the historical information and current users’ features. The algorithm keeps clustering the course nodes and build their parent nodes until the root node (bottom-up design). If there comes a new course, all the nodes and clusters are changed and needed to compute again. As for the MOOC big data, since the number of courses keeps increasing and becoming fairly tremendous, algorithms in [29] are prohibitive to be applied. Our main theme in this paper is recommending courses in tremendous datasets to students in real-time based on their preferences. The course data are stored in course cloud(s) and new courses can be loaded at any time. We devise a top-down binary tree to denote and record the process of partitioning course datasets, and every node in the tree is a set of courses. The course scores feedback from students in marking system are denoted as rewards. Specifically, there is only one root course node in the binary tree at first. Every time a course is recommended, a reward is fed back from the student which is used to improve the next recommending accuracy. The reward structure consists as a unknown stochastic function of context and course features at each recommendation, and our algorithm concerns the expected reward of every node. Then the course binary tree divide the current node into two child nodes and select one course randomly in the node with the current best expected value. It omits most of courses in the node that wouldn’t be selected to greatly improve the learning performance. It also supports incoming new courses to the existing nodes as unselected items without changing the current built pattern of the tree. However, other challenges that influence on the recommending accuracy still remain. In practice, we observe that the number of courses keeps increasing and the in-mummery storage cost of one online course is about 1GB in average, which is fairly large. Therefore, how to store the tremendous course data and to process the course data effectively become a challenge. Most previous works [29] [30] could only realize the linear space complexity, however it’s not promising for MOOC big data. We propose the distributed storage scheme to store the course data with many small storage units separately. For example, the storage units may be divided based on the platforms of MOOC or the languages of courses. On the one hand, this method can make invoking process effectively with little extra costs on course recommendation. On the other hand, we prove the space complexity can be bounded sublinearly under the optimal condition which is better than [29]. In summary, we propose an effective context-aware and online learning algorithm for course big data recommendation to offer the courses to learners in MOOCs. The main contributions are listed as follows: • • • • by experiment results and compare with relevant previous algorithms [29] [31]. Section VIII concludes the paper. II. R ELATED W ORKS A plethora of previous works of exists on recommending algorithms. As for MOOC, the two major tactics to actualize the algorithm are filtering-based approaches and online learning methods [11]. Apropos of filtering-based approaches, there are some branches such like collaborative filtering [12] [9], content-based filtering [13] and hybrid approaches [14] [15]. The collaborative filtering approach gathers the students’ learning records together and then classifies them into groups based on the characteristics provided, recommending a course from the group’s learning records to new students [12] [9]. Content-based filtering recommends a course to the learner which is relevant to the learning records before [13]. Hybrid approach is the combination of the two methods. The filteringbased approaches can perform better at the beginning than online learning algorithm. However, when the data come to very large-scale or become stochastic, the filtering-based approaches lose the accuracy and become incapable of utilizing the history records adequately. Meanwhile, not having the context makes the method unable to recommend courses precisely by taking every learner’s preference into account. Online learning can overcome the deficiencies of filteringbased approaches. Most of the previous works of recommending a course utilize the adaptive learning [16] [17] [18]. In [16], the CML model was presented. This model combines the cloud, personalized course map and adaptive MOOC learning system together, which is quite comprehensive for the course recommend with context-awareness. Nevertheless, as for big data, the model is not efficient enough since the works couldn’t handle the dynamic datasets and these may have a high cost in time with near “infinite” datasets. The similar works are widely distributed in [19]–[24] as contextual bandit problems. In these works, the systems know the rewards of selected ones and record them every time, which means the course feedback can be gathered from learners after they receive the recommended course. There is no work before that realize contextual bandits with infinitely increasing datasets. Our work is motivated from [24] for big data support, however we consider the contextaware online learning for the first time with delicately devised context partition schemes for MOOC big data. The algorithm can accommodate to highly-dynamic increasing course database environments, realizing the real big data support by the course tree that could index nearly infinite and dynamic changing datasets. We consider context-awareness for better personalized course recommendations, and devise effective context partition scheme that greatly improves the learning rate and accuracy for different featured students. Our proposed distributed storage model stores data with distributed units rather than single storage carrier, allowing the system to utilize the course data better and performing well with huge amount of data. Our algorithms enjoy superior time and space complexity. The time complexity is bounded linearly, which means the algorithms have a higher learning rate than preious methods in MOOC. For the space complexity, we prove that it is linear in the primary algorithm and sublinear in distributed storage algorithm under optimal condition. III. P ROBLEM F ORMULATION In this section, we present the system model, context model, course model and the regret definition. Besides, we define some relevant notations and preliminary definitions. Fig. 1 illustrates the model of operation. The senior users such as professors or network platforms upload the course resources including the pre-recorded videos for open classed, live videos for online course and class tests, etc. of MOOC to course cloud. Then the system collects learners’ essential information of features, for instance, the reputation of professors, course language, the educational level concerning courses with appropriate difficulty, the quality in presentations, etc. When the system obtains the information, it recommends The rest of paper are organized as follows: Section II reviews related works and compares with our algorithms. Section III formulates the recommendation problem and algorithm models. Section IV and Section V illustrate our algorithms and bound their regret. Section VI analyzes the space complexity of our algorithms and compares the theoretical results with existing works. In Section VII, we verify the algorithms 2 Reward Load the course Reward Professor or online learning platform Course cloud with recommender system ᮽᵢ Education level: pupil or professor A payoff is fed back Learners Reward nationality A learner with context The contextawareness cloud recommend a course Course information Context information Give a course to learner Recommended course Fig. 2: Context and Course Space Relation Schema Courses: We model the set of courses as a dC -dimension vector space, where each vector contains dC course features. The number of courses is unknown thus we can’t determine dC before attaining various dataset. Since the number of courses grows exponentially, we manage the number to infinity to adapt to development trend of MOOC big data. Specifically, we consider the course model unknown distribution sets. Similar to the context, we define the course space as C and the correlation distance of courses as DC (c, c′ ) to indicate the relativity between the two courses. Fig. 1: MOOC Course Recommendation and Feedback System courses which the users haven’t learned before to the learners as the alternative choices. Then learners give courses satisfaction payoffs to the system based on their preferences such like some learners prefer computer science and others prefer classical music. A. System Model Definition 1. DC over C is a non-negative mapping: DC (c, c′ ) ≥ 0 (C 2 → R). DC (c, c) = 0 when c = c′ . We use time slots t = 1, 2, · · · , T to delegate rounds. In each time slot, there are three running states: (1) a learner with an exclusive context information comes into our model; (2) the model recommends a course from the current course cluster to the learner; (3) the learner provides a payoff due to the newly recommended course to the system. Afterwards the model will learn how to preform better in the next recommended procedure on account of the rewards obtained. We give concepts of three essential elements: Learners: We denote the set of incoming learners as S = 1, 2, · · ·) during the run of the algorithm. When a learner comes, the system will suggest a course based on the previou records of payoffs and context information. Contexts: Context is illustrated as set X , with the space being continuous. Besides, we assume the context space is a dX -dimension space which means the context x ∈ X is a vector of dX -dimension. For example, the context vector includes educational levels from zero basis to PhD in related fields, the geographical positions and the language backgrounds. With the normalization of context information, we suppose the context space is X = [0, 1]dX which is a unit hypercube. DX (x, x′ ) is used to delegate the distance between context. We use the Lipschitz condition to define the distance. We assume that the two courses which are more relevant have the smaller correlation distance (DC ) between them. For example, the courses teached both in English have closer distance than the courses with different languages as far as the language feature of context. Fig. 2 illustrates the relationship between context information and course information over reward. To better illustrate the relations, we degenerate the dimensions of them to 1 dX = dC = 1. Thus we take the context information and course details as two horizontal axes in rectangular coordinate system. From the schematic diagram, the reward varies with the context and course changing. To be more specific, for a determined learner whose context is constant, the reward differs from courses shown in blue plane coordinate system. On the other hand, for a determined course shown in crystal plane coordinate system, different people have different attitudes. Practically, we still take the reward dimension 1. Besides, we expanding the dimension of context to dX and the dimension of course to dC . B. Context Model for Individualization Since the context space is a dX -dimensional vector space, we normalize every dimension of context range from 0 to 1. The realistic meaning of the dimensions denotes dX unrelated features such as ages, cultural backgrounds, nationalities etc, representing the preferences of the learner. We define the slicing number of context unit hypercube as nT , indicating the number of grades in the dimension. To have a better formulation, PT = {P1 ,P2 ,· · · ,P(nT )dX } is used to denote the sliced chronological sub-hypercubes. As illustrated in Fig. 3, we let dX = 3 and nT = 2. We divide every axis into 2 Assumption 1. There exists constant LX > 0 such that for all context x, x′ ∈ X , we have DX (x,x′ ) ≤ LX ||x − x′ ||α , where || • || denotes the Euclidian norm in RdX . Note that the Lipschitz constants LX are not required to be known by our recommendation algorithm. They will only be used in quantifying its performance. Assumption 1 indicates the initial maximal context deviation whose threshold is defined by LX . We present the context distance mathematically with LX and α. 3 Pt Context dimension 3 Selected Context Sub-hypercub Reward Center point Assumption 3 uses the suboptimal gap and distance to denote the mean reward gap between course c1 and course c2 . With this assumption, we only need to consider the suboptimal node’s gap and distance rather than the gap between two normal nodes directly. Optimal node Learnerÿs context information point ࢀ =2 D. The Regret of Learning Context dimension 2 Optimal Course Courses Simply, the regret indicates the loss of accuracy in the recommended procedure due to the unknown dynamics. We assume the reward rt ∈ [0, 1] is the indication of accuracy. According to the reward we define the regret as T T P P (2) R(T ) = (I(r̂t = rt )) , rPt ∗ − E Fig. 3: Context and Course Partition Model parts and the number of sub-hypercubes is (nT )dX = 8. For the simplicity, we use the center point in the sub-hypercube to represent the specific contexts in the cube. With this model of context, we divide the different users into (nT )dX types. t=1 t=1 where the rt represents the reward in time t and the subscript ∗ means the optimal solution, i.e., r∗ is the best reward. Regret shows the convergence rate of the optimal recommended option. When the regret is sublinear R(T ) = O(T γ ) where 0 < γ < 1, the algorithm will finally converge to the optimal solution. In the following section we will propose our algorithms with sublinear regret. C. Course Set Model for Recommendation For the course model, we use the binary tree to index the while course dataset. The courses are gathered together and then it will be divided into two parts by the algorithm based (on the regret information) which is exploring process of the tree model. The number of nodes in the depth h is 2h , thus we let Nh,i (1 ≤ i ≤ 2h ) represent the node in the depth of h. The root node of the binary course tree is a set of the whole courses N0,1 = C. And with the exploration of the tree, the two child nodes contain all the courses from the parent node, and they never intersect with each other, Nh,i = Nh+1,2i−1 ∪ Nh+1,2i , Nh,i ∩ Nh,j = ∅ for any i 6= j. Thus, the C can be covered by IV. R EPARTIMIENTO H IERARCHICAL T REE In this section we propose our main online learning algorithm to mine a course in MOOC big data. A. Algorithm The algorithm is called Repartimiento Hierarchical Trees (RHT) and the pseudocode is given in Algorithm 1. In this algorithm we first find the arrived learners’ context subhypercube Pt from the context space and replace the original context with the center point xt in that sub-hypercube. And we use the set Γ to denote all the nodes whose courses have been recommended and set Ω to show the exploration path. Pt Besides, we introduce some new concepts, the Bound Bh,i Pt which is the upper bound of reward and the Estimation Eh,i which is the estimated value of reward in the tree with depth h of Pt context sub-hypercube. In each time, the algorithm finds Pt one course node whose Eh,i is highest in the set Γ and walks to the node with the route Ω, selecting one course from that node and recommending it for the reward from learner. The procedure is shown in Fig. 4. When the reward feeds back, Pt the algorithm refreshes Eh,i of nodes of the current tree based Pt on Bh,i and rewards rt . Since exploring is a top-down process, after we refresh the upper bound of reward in course nodes, we update the estimation value from bottom to the top. In the end of Pt algorithm 1, it refreshes the tree with Eh,i values. Algorithm 2 shows the exploration process in RHT. When we turn to explore new course nodes, the model prefers to select the nodes with higher estimation value. In other words, the essence of selecting new node is reducing the size of course cluster to make it easier to find the optimal courses for learners. After the new nodes being chosen, it will be taken in the sets ΓPt and ΩPt for the next calculation. Pt In algorithm 3 we use Bh,i to indicate the upper bound of t highest reward. The first term µP h,i is the average rewards, and 2h the regions of Nh,i at any depth C = ∪ Nh,i . 1 To better describe the nodes, we define the diam(Nh,i ) to indicate the size of course nodes. diam(Nh,i ) = sup DC (c, c′ ) f or any c, c′ ∈ Nh,i . c,c′ ∈Nh,i The distance between courses can be represented as the gap between course languages, course time length, course types and any others which indicate the discrepancy. We denote the size of nodes with the largest distance in the course dataset. Note that the diameter is based on the distance, and that can be adjusted by selecting different mappings. To the convenient my analysis, we make some reasonable assumptions as follows. Assumption 2. For any node Nh,i , there exists constant k1 and m ∈ (0, 1), where we can get diam(Nh,i ) ≤ k1 (m)h . With assumption 2 we can limit the size of nodes with k1 (m)h . We utilize exponential convergence to keep the threshold of nodes diameter. Since the model tree is binary tree, the number of nodes increases exponentially with the depth rising. The data we get are stochastic (i.i.d.), so we use the mean reward to handle the model. For each context subhypercube, there is only one overall optimal course in a node of each depth and each node has a local optimal course. We define the course gap with the help of mean reward f (r). Assumption 3. For all courses c1 , c2 ∈ C in the same context point, they satisfy f (rcP1t ) − f (rcP2t ) ≤ max{f (rPt ∗ )−f (rcP1t ), DC (c1 , c2 )}. (1) 4 Algorithm 1 Repartimiento Hierarchical Trees (RHT) Require : The constants k1 > 0, m ∈ (0, 1), the learner’s context xt . Auxiliary function : Exploration and Bound Updating Initialization : For all context sub-hypercubes belonging to PT Pt ΓPt = {N0,1 } Pt E1,i = ∞ for i = 1, 2. 1: for t = 1, 2, ...T do 2: for dt = 0, 1, 2...dX do 3: Find the context interval in dt dimension 4: end for 5: Get the context sub-hypercube Pt 6: xt ← center point of Pt Pt Pt 7: Nh,i ← N0,1 Pt 8: ΩPt ← Nh,i 9: Exploration (ΓPt ) Pt 10: Select a course from the node Nh,i randomly and recommend to the learner 11: Get the reward rt 12: for all Pt ∈ PT do 13: Bound Updating (ΩPt ) Pt t 14: ΩP temp ← Ω Pt Pt 15: for Ωtemp 6= N0,1 do Pt 16: Nh,i ← one n leaf of ΩPt o 17: 18: 19: 20: 21: Courses Depth 1 Current path Depth 2 Depth 3 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 2: 3: 4: 5: 6: 7: After many rounds Depth (h,i) Optimal nodes Optimal course Fig. 4: Algorithm Demonstration Pt (T Pt −1)µ +r Pt t h,i t they come from the learners’ payoffs µP . h,i = T Pt q Pt k2 ln T /Th,i indicates the deviation in The second one exploration-exploitation problem. It’s the tradeoff between the exploration which ensures the scope and exploitation which guarantees the depth. And the third term is the deviation in the course nodes. As for the last term, since we substitute the subhyercube center point for the previous context, we utilizes the Lipschitz constant LX and the diagonal of the sub-hypercube √ √ dX dX α to denote the deviation L ( ) in context. X nT nT Pt Pt Pt Pt Eh,i ← min Bh,i , max{Eh+1,2i−1 , Eh+1,2i } Pt t Delete the Nh,i from ΩP temp end for end for end for B. Regret Analyze According to the defintion of regret, all the nodes which have been selected bring regret. We consider the regret in one sub-hypercube Pt and get the sum of it at last. Since the regret is the deviation from the optimal courses, we need to define the optimal course at first. Pt while Nh,i ∈ ΓPt do Pt Pt if Eh+1,2i−1 > Eh+1,2i then T emp ← 1 Pt Pt else if Eh+1,2i−1 < Eh+1,2i then T emp ← 0 else T emp ∼ B(0.5) end if Pt Pt Nh,i ← Nh+1,2i−T emp Pt ΩPt ← ΩPt ∪ Nh,i end while Pt ΓPt ← Nh,i ∪ ΓPt Definition 2. To locate the node with an affirmatory way we h Pt ∗ define the optimal track as ℓh,i = ′∪ NhP′t,i′ . ′ h =1 h The track is the aggregate of the optimal nodes whose depth from 0 to h. To represent the regret precisely, we need to define minimum suboptimality gap which indicates the difference between the optimal course in that node and the overall optimal course to better describe the model. Definition 3. The Minimum Suboptimality Gap and the context gap are √ Pt Pt ∗ (3) Dh,i = f (rPt ∗ ) − f (rh,i ), LX ( ndTX )α . Theoretically speaking, the minimum suboptimality gap of Pt Nh,i is the expected reward defference between overall optimal Pt course and the best one in Nh,i , and the context gap is the difference between the original point and center point in context sub-hypercube Pt . However what we can get is mean payoff rather the expectation value, so we will bound this loss in the following lemma. As for the context regret, we take the largest difference that it can bring about in sub-hypercube. After the definitions, we can find a measurement to divide all the nodes into two kinds for our following proof. Based on the definition 3, we let the set φPt to be the 2[k1 (mPt )h + Algorithm 3 Bound Updating (A sub-procedure used in RHT) 1: Optimal path Depth λ澳 Algorithm 2 Exploration (A sub-procedure used in RHT) 1: Nodes with highest E now Depth 0 Pt for all Nh,i ∈ ΩPt do Pt Pt Th,i ← T h h,i + 1 i Pt Pt Pt t µ̂P h,i ← (Th,i − 1)µ̂h,i + rt /T h,i q √ Pt Pt t k2 ln T /Th,i +k1 (mPt )h +LX ( ndTX )α Bh,i ← µ̂P h,i + end for Pt Eh+1,2i−1 ←∞ Pt Eh+1,2i ← ∞ 5 √ dX α nT ) ]-optimal φPt ∪ (φPt )c . For the convenience, we use the sets of depth to illustrate the nodesets P Pt P Pt c Pt Γ = φh ∪ (φh ) . (7) nodes in the depth of h i h √ Pt Pt ∗ dX α Pt Pt ∗ Pt h . φ = Nh,i f (r )−f (rh,i ) ≤ 2 k1 (m ) +LX ( nT ) LX ( Note that we call the nodes in set φPt as optimal nodes and those out of it as suboptiaml nodes. We define the regret when one node is selected above, however we should determine how many times the node has been chosen in the recommending process. We start it with suboptimal nodes. Based on definition 2 and definition 3, the Pt ∗ suboptimal nodes are divorced from ℓh,i in some depth. h Definition 4. We lead to the concept of packing number. Assuming the size of course space is 1, in the depth of h the node size threshold is k1 (mPt )h , So the packing number is Pt K t (8) κP ∪N , Ra = [Ra] dC , h h,i where K is the adjustment constant and Ra is the packing balls’ radius. Pt Nh,i Lemma 1. Nodes are suboptimal, and in the depth j (1 ≤ j ≤ h − 1) the path is out of the best track. For a random integer q, we can get the expect times of the node Pt Nh,i and it’s descendants in Pt are h i t′ P Pt ′ Pt Pt E[Th,i (t )] ≤ q+ P Bh,i (n) > f (rPt ∗ ) and Th,i (n) > q n=q+1 h i Pt Pt ∗ or Bj,ih′ (n) ≤ f (r ) f or j ∈ {q+1, ..., n−1} . Lemma 4. hIn the same context sub-hypercube, the numi √ dX α Pt h ber of the 2 k1 (m ) + LX ( nT ) -optimal nodes can be bounded as i−dC h √ Pt (9) . φh ≤ K k1 (mPt )h + LX ( ndTX )α Proof: Since the course number is infinite that we can’t know the data exactly, the dimension of course can’t be ′ determined. There exists a constant d , √ Pt Pt Pt Pt φh ≤ κh ∪{Nh,i ∈ φh }, k1 (mPt )h + LX ( ndTX )α −d′ (10) √ ≤ K k1 (mPt )h + LX ( ndTX )α . Proof: See Appendix A. We determine the threshold of the selected times of the Pt node Nh,i and it’s descendants by lemma 1. Since we don’t know in time T how many times this context sub-hypercube Pt has been selected, we use context time t′ to represent P ′ the t = T. total times in Pt . The sum of t′ is the total time We take the minimal d′ as the dimension of course dC . Since the MOOC big data has various data structures, knowing the exact details of courses is impossible. Thus we can’t determine the accurate dimension dC . After we know the regret in selecting one node, the times when a node has been selected and the nubmer of chosen nodes, we can bound the whole regret in theory 1. Pt However, from lemma 1 we can’t get the expectation of node chosen times directly, thus we lead to lemma 2 to modify the loss between mean value and expectation. Lemma 2. Based on the Assumption n o 2, the inequation Pt ′ Pt ∗ P Bh,i (t ) ≤ f (r ) ≤ (t′ )−2k2 +1 holds for all optimal nodes Pt Nh,i (4) Theorem 1. From the lemma above, we can get the regret of RHT d +d +1 1 X C (11) E[R(T )] = O T dX +dC +2 (ln T ) dX +dC +2 . in any time context time t′ . Proof: See Appendix B. Lemma 2 determines the threshold of optimal nodes, which Pt means the Bh,i will more than the expectation of reward Pt finally. This guarantees the accuracy of our Bh,i with strict mathematical proofs. With the connection of lemma 1 and lemma 2, we can get the precise results about the nodes chosen times in lemma 3. Proof: As for MOOC, the three terms represent three process of recommending a course. E[R1 (T )] is brought from the optimal course nodes during the recommendation. E[R2 (T )] represents the regret in suboptimal course nodes at first and E[R3 (T )] is brought from the follow-up procedures. Thus E[R3 (T )] is much larger than E[R2 (T )], which means E[R2 (T )] doesn’t matter for the whole regret. Pt c t For the node diversion φP h and (φh ) , we divide the regret into three parts E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )]. Pt Pt t where ΓP 1 contains the descendants of φH , Γ2 contains the Pt Pt nodes φh the depth from 1 to H and Γ3 contains nodes Pt Pt t c in (φP h ) . Note that Γ3 is the child of nodes in φH since the suboptimal nodes won’t be played deeper and we use the nodes to represent itself and descendants. TheP depth H is a constant to be determined later. Note that T = t′ , when all the times T are in the same context sub-hypercube, the regret is smallest. And we consider the situation that all the times T are distributed uniformly. In this extreme situation, all the context sub-hypercube has the same times t′ . For E[R1 (T )], the regret is generated from the optimal course clusters whose courses have been recommended. Since Pt Lemma 3. For the suboptimal nodes Nh,i in φPt , if q satisfies q≥ 4k2 ln T P t −k (mPt )h −L ( Dh,i X 1 √ dX nT Then for all t′ ≥ 1, we have Pt ′ 4k2 ln T E[Th,i (t )] ≤ √ d k1 (mPt )h +LX ( X nT ) ) α 2 α 2 , + M, h (5) (6) where the M is a constant less than 5. Proof: See Appendix C. We use the deviation of context and course to represent played times in this lemma. Practically speaking, we find a upper bound for the course nodes chosen times, which means we can determine one nodes’ regret during the process. But this is not sufficient to bound the whole regret, what we have to know last is how many nodes are there. We divide the nodes into two parts based on the course model as ΓPt = 6 When the two terms is equal, we can get i h √ K ln T (nT )dX +1 dX α Pt H T. (15) ) ) + L ( = k (m X 1 d +1 H C n T [k1 (mPt ) ] Since the context deviation and the course deviation is unknow, to minimize the deviation we assume the deviation is equal √ two α H too, which means k1 (mPt ) = LX ( ndTX ) . To simplify the problem, we take α = 1 and k1 = 2, thus 1 wecan get nT = lnTT dX +dC+2 and the regret complexity is the number of nodes can’t be bounded precisely, and the number of all the nodes played is t′ , thus we suppose the ′ nodes have been played h with maximum times t i. E[R1 (T )] ≤ 4 k1 (mPt )H + LX ( √ dX α nT ) T. (12) As for the second term whose depth is from 1 to H, these nodes represent the regrets generated by primary nodes. For the MOOC, it’s the regret in the worse course clusters which have been selected at the beginning of the recommendation. h √ H α i PP h t 4 k1 (mPt ) + LX ( ndTX ) φP E[R2 (T )] ≤ h Pt h=1 h √ H αi dX P h 4 k1 (mPt ) + LX ( ndTX ) . ≤ 4K(nPT )h dC [k1 (m t ) ] h=0 From Lemma 4 we can know i−dC h the number of√ suboptimal Pt nodes in depth h are φh ≤ K k1 (mPt )h +LX ( ndTX )α , dX and the number of the context sub-hypercubes is (nT ) . Thus the last inequation can be derived. There we take the Pt whose regret deviation is most as the represented context sub-hypercube and H in Pt , thus the sum of regret can be presented in that way. When it comes to last term, it’s the regret in the suboptimal course nodes in the deeper recommendation process. And the regret bound is √ H h αi P PP h Pt c 4 k1 (mPt ) +LX ( ndTX ) E[R3 (T )]≤ (φh ) . P Pt h=1 P t ∈(φ t ) Nh,i h dX +dC +1 ≤ P P 8K(nT ) dX P t ∈(φ t ) Nh,i h V. D ISTRIBUTED S TORED C OURSE R ECOMMENDATION T REE A. Distributed Algorithm for Multiple Course Storages In this section, we turn to present a new algorithm called Distributed Storage Repartimiento Hierarchical Trees (DSRHT), which we can improve the storage pressure by using distributed units to store the course data in clouds. The practical situation may be very complex, therefore we bound the number of distributed nd units with our model of tree 2z−1 ≤ nd ≤ 2z , where z is the depth of the tree and 2z is the number of nodes in that depth. In the next illustration we substitute the number nd with 2z . In algorithm 4 we still find the context sub-hypercube at first. Then since there are 2z distributed units, we first ascertain these top nodes. Based on the attained information the algorithm can start to find the course by utilizing the Bound and Estimation. Determining the number of distributed units is needed, we assume there are 2z storage nodes, and the number satisfies dX +dC −1 (16) 2z ≤ lnTT dX +dC +2 , mPt ∈ [ 12 , 1). Since the number of the distributed units is limit, and the time is very large thus the assumption is resonable. As for mPt , since we take the binary tree model, everytime the course cluster is divided to two clusters, which means that the m is larger than 0.5 is quite reasonable. The algorithm selects the initial nodes at depth z at first, and then process data similarly to RHT. When it comes to MOOC, the algorithm will start at the distributed nodes such as the website or other platforms. For the tree partiotion, the difference is that we leave the course nodes whose depth is less than out z to cut down the storage cost. In the complexity section we will prove that the storage can be bounded linearly in the best situation. c 4k2 ln T √ c 2 +M . [k1 (mPt )h ]dC k1 (mPt )h +LX ( ndX )α T i h √ And the nodes are 2 k1 (mPt )h + LX ( ndTX )α suboptimal, h √ . For the conclusion we can make sure the average regret will finally converge to zero, which means the algorithm can find the optimal courses for the learners at last. Note that the tree exists actually, we store the tree in the cloud and during the recommending process. Since the dataset is fairly large in the future, using the distributed storage technology to solve storage problems is inescapable. Pt t We note that the nodes in ΓP 3 is the child node of Γ2 , since Pt all the nodes in Γ3 is the root nodes of the suboptimal nodes Pt t Thus the number of ΓP 3 is less than twice of Γ2 . Then we can get h √ H αi P PP h Pt c 4 k1 (mPt ) + LX ( ndTX ) (φh ) Pt h=1 1 O T dX +dC +2 (ln T ) dX +dC +2 Pt we can get Dh,i − k1 (mPt )h − LX ( ndTX )α ≥ k1 (mPt )h + √ dX α LX ( nT ) . Note that the second term is infinitesimal of higher order of the third one mathematically, thus we focus more on the first term and the last term since the decisive factors of regret is the first one and last one. We notice that with the depth increasing, E[R1 (T )] decreases but E[R3 (T )] increases. When we let their complexity to be equal, we can get the minimum. As for a context sub-hypercube Pt , all the nodes have √ been played bring two kinds of regret: the context regret LX ( ndTX )α and course node regret k1 (mPt )H . The complexity of first term is n h i o √ O 4 k1 (mPt )H + LX ( ndTX )α T . (13) As for complexity is the last term, the P 8K(nT )dX 4k2 ln T O √ α 2 + M Pt )h dC d k (m h ] [ 1 P t k1 (m ) +LX ( n X ) h (14) T dX +1 ln T (nT ) . =O [k1 (mPt )H ]dC +1 B. Regret Analyze In this subsection we prove the regret result in DSRHT can be bounded sublinearly. Since we use the model to determine the threshold of the number of distributed units, we can 7 Algorithm 4 Distributed Course Rcommendation Tree Require : The constants k1 > 0, m ∈ (0, 1), the number of the storage unit 2z and the learner’s context xt . Auxiliary function : Exploration and Bound Updating Initialization : For all context sub-hypercubes belonging to PT Pt Pt Pt ΓPt = {Nz,1 , Nz,2 ...Nz,2 z} Pt = ∞ f or i = 1, 2...2z Ez,i 1: for t =1,2,...T do 2: for dt = 0, 1, 2...dX do 3: Find the context interval in dt dimension 4: end for 5: Get the context sub-hypercube Pt 6: xt ← center point of Pt 7: for j=1,2...2z − 1 do Pt Pt then < Nz,j+1 8: if Nz,j Pt Pt 9: Nz,j = Nz,j+1 10: end if 11: end for Pt Pt 12: Nh,i ← Nz,j Pt Pt 13: Ω ← Nh,i 14: Exploration (ΓPt ) Pt 15: Select a course from the node Nh,i randomly and recommend to the learner 16: Get the reward rt 17: for all Pt ∈ PT do 18: Bound Updating (ΩPt ) Pt t 19: ΩP temp ← Ω Pt 20: for Ωtemp 6= ∅ do Pt 21: Nh,i ← one n leaf of ΩPt o 22: 23: 24: 25: 26: VI. S TORAGE C OMPLEXITY The storage problem has been existing in big data for a long time, so how to use the distributed storage technology to handle the problem matters a lot. In this section, we anlalyze the two algorithms’ space complexity mathematically. We use S(T ) to represent the storage space complexity. For RHT algorithm, since the storage is from the root node, and it’s fairly simple to know the space complexity is linear O(E[S(T )]) = O(T ). Theorem 3. In the optimal condition, we take the number of dX +dC −1 storage units satisfied 2z = lnTT dX +dC +2 , then we can get the storage space complexity dX +dC −1 3 dX +dC +2 E[S(T )] = O T dX +dC +2 T dX +dC +2 −(ln T ) dX +dC −1 . Proof: Every round t has to explore a new leaf node. To get the optimal result, we suppose thedepth is as deepest as dX +dC −1 T dX +dC +2 ln( ln T ) . Under the condition we can choose z = ln 2 dX +dC −1 that t < 2z+1 , we have S1 (T ) ≤ 2z = lnTT dX +dC +2 , when the time t ≥ 2z+1 , after one round there is one unplayed node being selected, so the second part is S2 (T ) ≤ T − 2z+1 = dX +dC −1 T −2 lnTT dX +dC +2 . Thus we can get the storage complexity C −1 ddX +d +dC +2 T X . (18) E[S(T )] = O T − ln T Since the value of z is changeable, appropriate value can make the space complexity sublinear. From (18), if the data dimension is fairly large, the space complexity will be relative small. However, the large dataset and tremendous distributed units will make the algorithm learning too slow and make regret very large. Thus taking a appropriate parameter is crucial. There are many platforms about MOOC, and gathering all the data together is fairly difficult and high-cost. Thus the DSRHT can handle the practical situation better and improve the storage condition in the same time. Besides, we compare our algorithms with some similar works which are all tree partition. In the table 1 we list the theoretical analyze results. Since the ACR [29] has a ergodic process, the time complexity and regret bound is relative high compared with others. As for HCT [31], having no context information makes the algorithm has a very low space complexity and regret bound compared with our algorithms. Pt Pt Pt Pt Eh,i ← min Bh,i , max{Eh+1,2i−1 , Eh+1,2i } Pt t Delete the Nh,i from ΩP temp end for end for end for continue to use the lemma in RHT algorithm. To be more specific, we transform the distributed problem into the RHT starting from depth z rather than the root node. Now we need to redivide the nodes contrast to the RHT Pt Pt Pt t algorithm ΓPt = ΓP 1 + Γ2 + Γ3 + Γ4 , where the first third partitions are the same and the last one is the node at depth z which won’t be played based on the algorithm 1. For the Pt c t node diversion φP h and (φh ) . The depth H (z ≤ H) is a constant to be selected later. VII. N UMERICAL R ESULTS In this section, we present: (1) the source of dataset; (2) the sum of regrets are sublinear and the average regret converges to 0 finally; (3) we compare the regret bounds of our algorithms with other similar works; (4) distributed storage method can reduce the space complexity. Fig. 5 illustrates the MOOC operation pattern in edX [27]. The right side is the teaching window and learning resources, and the left includes lessons content, homepage, forums and other function options. Theorem 2. The regret distributed stored algorithm is ofd the 1 X +dC +1 (17) E[R(T )] = O T dX +dC +2 (ln T ) dX +dC +1 , Proof: See Appendix D. Compared to the RHT algorithm, we notice that the regret bound is slightly increases, and its beginning performance is not as well as RHT. However, the algorithm can handle the practical problem better. Considering the explosive data in future, this alogrithm will perform better than RHT. A. Description of the Dataset We take the dataset which contains feedback information and course details from the edX [27] and the intermedi8 TABLE I: Theoretical Comparison Algorithm Context Infinity Time Complexity O T2 + KE T Space Complexity ! E P O Kl + T l=0 d 2 O T d+2 (ln T ) d+2 ACR [29] Yes No HCT [31] No Yes O(T ln T ) RHT Yes Yes O(T ln T ) O(T ) DSRHT Yes Yes O(T ln T ) O (T − 2z ) ary website of MOOC [6]. In those platforms, the context dimensions contain nationality, gender, age and the highest education level, therefore we take dX = 4. Besides, there is a questionnaire from Harvard University in edX to improve your context information, enlarging the dimension to around thirty dX = 30. As for the course dimensions, they comprise starting time, language, professional level, provided school, course and program proportion, whether it’s self-paced, subordinative subject etc. For the feedback system, we can acquire reward information from review plates and forums. Thoroughly, the reward is produced from two aspects, which are marking system and comments from forums. For the users, when a novel field comes into vogue, tremendous people will get acess to this field in seconds. The data we get include 2 × 105 learners using MOOC in those platforms, and the average number of courses the learners comment is around 30. As for our algorithm, it focuses on the group of learners in the same context sub-hypercube rather than individuals. Thus when next time users comes with context information and historical records, we just treat them as the new training data without distinguishing them. However the number of people is limited, even if generating a course is time-costing, the number of course is unlimited and education runs through the development of human being. Our algorithm pays more attention to the future highly inflated MOOC curriculum resources, and existing data bank is not tremendous enough to demonstrate the superiority of our algorithm since MOOC is a new field in education. We find 11352 courses from those platforms including plenty of finished courses. The number of courses doubles every year. And based on the trend, the quantity will be more than forty thousand times within 20 years. To give consideration to both accuracy and scale of sources of data, we copy the original sources to forty five thousand times to satisfy the number requirements. Thus we extend the 11352 course data acquired from edX to around 5 × 108 to simulate future explosive data size of courses in 2030. Lessons content Regret O T O T dI +dC +1 dI +dC +2 ln T O T d+1 1 O T d+2 (ln T ) d+2 dX +dC +1 dX +dC +2 (ln T ) dX +dC +2 dX +dC +1 dX +dC +2 (ln T ) dX +dC +1 1 1 Including Homework,Video files and so on Forums : Getting rewards Video window Fig. 5: MOOC Learning Model Adaptive Clustering Recommendation Algorithm (ACR) [29]: The algorithm injects contextual factors capable of adapting to more learners, however, when the course database is fairly large, ergodic process in this model can’t handle the dataset well. • High Confidence Tree algorithm (HCT) [31]: The algorithm supports unlimited dataset however large it is, but there is only one learner for the recommendation model since it doesn’t take context into consideration. • We consider both infinity and context, thus our model can better suit future MOOC situation. In DSRHT we sacrifice some immediate interests to get better long-term performance. To verify the conclusions practically, we divide experiment into following three steps: 1) Step 1.: In this step we compare our RHT algorithm with the two previous work which are ACR [29] and HCT [31] with different size of training data. We input over 6 × 106 training data including context information and feedback records in the reward space mentioned in the section of database description into the three models, and then the models will start to recommend the courses stored in the cloud. In consideration of HCT not supporting context, we normalize all the context information to the same (center point of unit context hypercube). Since the reward distribution is stochastic, we simulate 10 times to get the average values where the interfere of random factor is restrained. Then the two regret tendency diagrams are plotted to evaluate algorithms performances. 2) Step 2.: We use the DSRHT algorithm to simulate the results. The RHT algorithm can be seemed as degraded DSRHT with z = 0, and we compare the DSRHT algorithm with different parameters z. Without loss of generality, we take • B. Experimental Setup As for our algorithm, the final training number of data is over 6 × 106 and the number of courses is about 5 × 108 which means for the training number we can regard the course number as infinite. Three works are introduced as follow. 9 ACR HCT RHT 5 Regret TABLE II: Average Accuracies of RHT ൈ 澔 6 4 Algorithm Number ×106 1 2 3 4 5 6 3 2 ACR [29] HCT [31] RHT 65.43% 78.62% 83.23% 86.28% 88.19% 88.79% 81.02% 82.13% 82.76% 83.01% 83.22% 83.98% 85.34% 87.62% 89.92% 90.45% 91.09% 91.87% 5 6 ൈ 澔 1 0 0 1 2 4 3 Arrival data 5 6 ൈ 澔 Fig. 6: Contrast of Regret (RHT) z=0 z=10 z=20 5 4 Regret 0.6 ACR HCT RHT 0.5 Average Regret ൈ 澔 6 0.4 3 2 1 0.3 0 0.2 0 1 2 0.1 0 0 4 3 Arrival data Fig. 8: Contrast of Regret with Different Parameter z 1 2 4 3 Arrival data 5 6 ൈ 澔 regret comes to be lower than HCT. From the Fig. 7 (Average Regret diagram), HCT’s average regret is less than ACR at first, result also showing that ACR performs slightly better than HCT finally. Table 2 records the average accuracies which is the total rewards divdided by the number of training data. “Num” is the number of training data. We find that when the time increases, all the three algorithm can get promoted. Our algorithm has the highest accuracies during the learning period. The ACR performs not good when the process starts, which is 65.43% and is worse than HCT 81.02%. Finally, ACR converges to 88.79% but HCT is still 83.98%. When it comes to our algorithm, it’s 91.87% which is much better than HCT. Fig. 8 and Fig. 9 analyze DSRHT algorithm by using different parameters z as 0, 10 and 20. From the diagrams we find that comparing z = 0 and z = 10, we can validate what we said before: Storage improvement focus more on the Fig. 7: Contrast of Average Regret (RHT) z = 0, z = 10 and z = dX +dC −1 dX +dC +2 ln( lnTT ln 2 ) ≈ 20. Then we plot the regret and z diagram to analyze the constant optimal parameter. 3) Step 3.: We record the storage data to analyze the space complexity of those four algorithms. First we upload 517.68TB indexing information of courses to our school servers and perform the four algorithms successively. In the process of training, we record the regret for six times. And in the end of training, we record the space usage of the tree which represent the training cost. As for the DSRHT, we use the virtual partitions in school servers to simulate the distributed stored course data. Specifically, we reupload the course data to the school servers in 1024 virtual partiotions, and then perform the DSRHT algorithm. 0.6 C. Results and Analysis z=0 z=10 z=20 0.5 Average Regret We analyze our algorithm from two different angles: Comparing with other two works and comparing with itself with different parameter z. In each direction, we compare the regret first, and analyze the average regret. And then we discuss the accuracies based on the average regret. At last we will compare the storage conditions from different algorithms. In Fig. 6 and Fig. 7 we compare the RHT algorithm with ACR and HCT. From the Fig. 6 (Regret diagram), we can get that our method is better than the two others which has less regret from the beginning. The HCT algorithm performs better than ACR when it starts. With time going on, the ACR’s 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 Arrival data 6 ൈ 澔 Fig. 9: Contrast of Average Regret with Different Parameter z 10 TABLE III: Average Accuracies of DSRHT DS factor z Number×106 1 2 3 4 5 6 z=0 z = 10 z = 20 85.34% 87.62% 89.92% 90.45% 91.09% 91.87% 82.67% 86.98% 90.49% 91.50% 92.03% 92.89% 51.10% 79.94% 86.37% 89.79% 91.33% 92.04% TABLE IV: Average Storage Cost Storage Cost (TB) Storage Ratio ACR [29] 12573 24.287 HCT [31] 2762 5.335 RHT 4123 7.964 the node selected in the depth of k + 1). According to the definition of Bound and Estimation, we can know that with the depth decreasing, the Estimation of node decrases either. Pt Pt Pt Pt ≤ Ek+1,i Thus we can know that Ek+1,i ′ ≤ Eh,i ≤ Bh,i . n (k+1)′ o Pt We divide the set B Pt (t′ ) ≥ Ek+1,i (t′ ) into (k+1)′ n o n h,i o Pt ′ Pt Bh,i (t ) > f (rPt ∗ ) ∪ f (rPt ∗ ) ≥ Ek+1,i (t′ ) . Ac(k+1) ′ cording to the definition of Bound and Estimation again, o nwe can get Pt (t′ ) f (rPt ∗ ) ≥ Ek+1,i (k+1) ′ n o n o Pt Pt Pt ′ ′ ′ ⊂ f (rPt ∗)≥Bk+1,i (t ) ∪ B (t ) ≥ E (t ) k+1,i k+1,i ′ ′ ′ (k+1) (k+1) o o n (k+1) n Pt Pt ′ ′ Pt ∗ (t ) (t ) ∪ f (r ) ≥ E ⊂ f (rPt ∗ ) ≥ Bk+1,i k+1,i(k+1)′ (k+1) ′ n o n o Pt Pt ′ Pt ∗ ′ ⊂ f (rPt ∗ ) ≥ Bk+1,i (t ) ∪ f (r ) ≥ E (t ) . k+2,i ′ ′ (k+1) (k+2) DSRHT (z = 10) 2132 4.118 nBased on the conclusionoabove, n we know o Pt Pt Pt ∗ ′ ′ f (r ) ≥ Ek+1,i(k+1)′(t ) ⊂ f (rPt ∗) ≥ Bk+1,i (t ) (k+1) ′ n o (19) n−1 Pt Pt ∗ ′ f (r ) ≥ Ek+2,i(k+2)′(t ) . ∪ long run performance. However, when z = 20, the algorithm has taken a lot of time to start recommend course precisely. j=k+2 Even if finally the accuracy catches the RHT algorithm at the Pt ′ When it comes to Tnh,i (t ), the pratical path including the end, the effect is not as well as we expect. o P P P ′ ′ t t t Table 3 illustrates the problem more precisely. When the node N means that B (t ) ≥ E h,i h,i k+1,i(k+1)′ (t ) . So training number is less than 4 × 106 , the condition that z = 20 ′ h i P n o t Pt ′ Pt ′ Pt ′ is worst in the three conditions. After that, it come to catch E T Pt (t′ ) = P B (t ), T (t ) ≥ E (t ) ≤ q h,i h,i h,i k+1,i(k+1)′ n=1 the RHT 91.09% with 91.33%. Thus we can see selecting ′ n o t P the ditributed storage number can’t pursuit the quantity only, Pt ′ Pt Pt ′ ′ + P Bh,i (t ) ≥ Ek+1,i (t ), T (t ) > q h,i (k+1) ′ whether it’s practical makes sense as well. n=1 ′ nh i t P As for the storage analyze, we use the detailed information Pt Pt ≤ q+ P Bh,i (n) > f (rPt ∗ ) and Th,i (n) > q of courses to represent courses data, and the whole course n=q+1 h io storage is 517.68TB. To be more intuitionistic, we use the Pt Pt ∗ or Bj,i (n)≤f (r ) for j∈{q+1, ..., n−1} . h′ ratio of actual space occupied and course space occupied to In the inequation, we let the first term to be 1 at each time denote storage ratio. From table 4 we know that ACR [29] Pt algorithm is not suitable for real big data since the storage so it grows to q. In the second term, since the Th,i (n) > q, the ratio reaches 24.287. HCT [31] algorithm performs well in terms when n ≤ q are zero and with the help of inequation space complexity which is better than RHT. As for DSRHT, (19) we can get the conclusion. the storage ratio is 4.118 which is less than HCT and nearly half of RHT. A PPENDIX B P ROOF OF L EMMA 2 VIII. C ONCLUSION Proof: According to Assumption 3, we let c1 , c2 which In this paper, we have presented RHT and DSRHT algo- have been select from the same context be in a same optimal rithms for the courses recommendation in MOOC big data. node and let c1 be the best course, thus we can get Pt To better fit with the changeable dataset in the future we ) ≤ k1 (mPt )h . f (rPt ∗ ) − f (rcP2t ) ≤ diam(Nh,i (20) extend the number of objects to infinite. Paying attention to Since in the context sub-hypercube we change the context the individualization in recommender system, we inject the spcae with a point, we need to inject the context deviation √ context-awareness into our algorithm. Futhermore, we use (21) f (rPt ∗ ) − f (rcP2t ) ≤ k1 (mPt )h +LX ( ndTX )α . distributed storage to relieve the storing pressure and make it Pt more suitable for big data. We have also test our algorithm and We note the event when the track go through the node Nh,i Pt Pt compare it with two similar algorithms. As for our ongoing as event Nh,i ∈ ℓH,I . Therefore, n o work, we intend to reduce space complexity by the combined Pt ′ Pt ′ P Bh,i (t ) ≤ f (rPt ∗ ) and Th,i (t ) ≥ 1 method of HCT [31] and distributed storage to improve storage q Pt ′ ′ t condition further. = P µ̂P (t ) + k2 ln T /Th,i (t ) + k1 (mPt )h h,i √ Pt ′ dX α Pt ∗ A PPENDIX A P ROOF OF L EMMA 1 +LX ( nT ) ≤ f (r ) and Th,i (t ) ≥ 1 h i Proof: We assume that the path is out of the best in the √ Pt ′ dX α Pt ∗ ′ Pt h t ) −f (r ) Th,i = P µ̂P (t ) ) +L ( (t )+k (m X 1 depth of k, thus we can know the estimation in the depth k +1 h,i nT Pt q have deviation of the actual value, which means Ek+1,i(k+1)′ ≤ Pt ′ Pt ′ ≤ − k2 (ln T )Th,i (t ) and Th,i (t ) ≥ 1 Pt Ek+1,i ′ (the first one is the best path node and the second is 11 ′ t P o n Pt t =P I Nh,i ∈ ℓP − H,I n=1 o in √ t′ h α P h Pt t + f (rcPnt )+k1 (mPt) +LX ( ndTX ) −f (rPt ∗) I Nh,i ∈ ℓP H,I n=1 q Pt ′ Pt ′ (t ) and Th,i (t ) ≥ 1 ≤ − k2 (ln T )Th,i ≤P f (r̄cPt ) f (rcPnt ) Pt ∗ Pt Pt ′ > f (rh,i ) + Dh,i and Th,i (t ) ≥ q o Pt ∗ Pt Pt ′ > f (rh,i ) + Dh,i and Th,i (t ) ≥ q o n q √ k2 ln T ′ t ≤ P µ̂P (t ) + + k1 (mPt )h + LX ( ndTX )α h,i q √ α d h Pt Dh,i −k1 (mPt ) −LX ( n X ) Pt ′ Pt ∗ T ] = P [µ̂h,i (t ) − f (rh,i )] > [ 2 t′ P Pt ′ and Th,i (t ) ≥ q . Pt t ∈ ℓP (f (r̄cPt ) − f (rcPnt ))I{Nh,i H,I } n=1 q Pt ′ Pt ′ ≤ − k2 (ln T )Th,i (t ) and Th,i (t ) ≥ 1 . Pt ′ When we multiply Th,i (t ) with both sides, we can get the inequations below. The last inequation is based on the expression (21), since the second term is positive and we drop it to get the last √ α d h n Pt −k1 (mPt ) −LX ( n X ) Dh,i expression. Pt ′ Pt ∗ T P [µ̂h,i (t ) − f (rh,i )] > [ ] 2 o of illustration, we pick the n when nFor the convenience o Pt ′ ⌣Pt Pt Pt Pt and T (t ) ≥ q h,i I Nh,i ∈ ℓH,I is equal to 1. We use r c to indicate the rc o n nP t′ Pt Pt P P Pt t happened in I Nh,i ∈ ℓH,I . Thus, =P (f (r̄nPt ) − f (rh,i ))I{Nh,it ∈ ℓH,I } n=1 t′ √ n o α d o h Pt P Pt D −k1 (mPt ) −LX ( n X ) t Pt ′ Pt ′ P f (r̄cPt ) − f (rcPnt ) I Nh,i ∈ ℓP T H,I ]T (t ) and T (t ) ≥ q . > [ h,i h,i h,i 2 n=1 q With the conclusion of Lemma 2, we can get that Pt ′ Pt ′ (t ) and Th,i (t ) ≥ 1 ≤ − k2 (ln T )Th,i nP o t′ n Pt Pt Pt Pt t′ o n P f (r̄ ) I N ) − f (r ∈ ℓ n cn h,i H,I P Pt t n=1 √ ≤P f (r̄cPt ) − f (rcPnt ) I Nh,i ∈ ℓP α H,I dX o Pt Pt h Dh,i −k1 (m ) −LX ( n ) n=1 Pt ′ Pt ′ T q ]T (t ) and T (t ) ≥ q > [ h,i h,i 2 Pt ′ Pt ′ (t ) and Th,i (t ) ≥ 1 ≤ − k2 (ln T )Th,i ≤ (t′ )−2k2 +1 . ( Pt ′ Th,i (t ) According to Lemma 1 and the prerequisite in Lemma 3, we P ⌣Pt ⌣P t 4k2 ln T =P r c − r cn select upper bound of q as √ α 2 + 1. d Pt P n=1 Dh,i −k1 (m t )h −LX ( n X ) ) T q Thus, Pt ′ Pt ′ ≤ − k2 (ln T )Th,i (t ) and Th,i (t ) ≥ 1 h h i i t′ P t′ n Pt ′ Pt Pt p Pt ∗ P P ⌣Pt ⌣P t E T (t ) ≤ P B (n) > f (r ) and T (n) > q h,i h,i h,i ≤ P f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n . n=q+1 n=1 j=1 h i Pt Pt ∗ Pt ′ or Bj,ih′(n)≤ f (r ) for j ∈ {q+1, ..., n−1} We consider the situation when n = 1, 2...T (t ) and the h,i Pt ′ fact that Th,i (t ) ≤ t′ . Besides, the last inequation use the union bound theory and loose the threshold n t′ p P P ⌣P t ⌣Pt P f ( r c ) − f ( r cj ) ≤ − k2 (ln T )n n=1 j=1 (22) t′ P ′ −2k2 +1 ≤ exp(−2k2 ln T ) ≤ (t ) . + ≤ 4k2 ln T 4k2 ln T P t −k (mPt )h −L ( Dh,i 1 X P t −k (mPt )h −L ( Dh,i X 1 P t −k Dh,i 1 X nT 2 √ dX X( n T ≥ q k2 ln T q . (24) OF T HEOREM 2 Proof: Since we substitute the j with j ′ , Lemma are still valid for distributed stored algorithm. Based on the segmentation, the regret can be presented with E[R(T )] = E[R1 (T )] + E[R2 (T )] + E[R3 (T )] + E[R4 (T )]. For E[R1 (T )], since it’s the same as the algorithm 1, so we can get the first term has i ≥ α ) +1 thus we can get the conclusion Lemma 3. ) h (mPt ) −L α 2 +1 n=q+1 L EMMA 3 Proof: With the help of the assumption q 4k2 ln T √ α 2 , we can get d ) And we take the constant k2 ≥ 1, i h t′ P −2k +1 (t′ ) 2 +n−2k2 +2 ≤ 4 ≤ M, 1+ A PPENDIX D P ROOF dX nT α 2 n=q+1 n=1 OF √ ) i h t′ P −2k +1 (t′ ) 2 +n−2k2 +2 . + Note that the sum of time T represents the contextual sum of time since the number of courses in the context sub-hypercube is stochastic. And for the convenience, we use T as the sum of time. With the help of Hoeffding-Azuma inequality [26], we get the conclusion. A PPENDIX C P ROOF √ d Pt Dh,i −k1 (mPt )h −LX ( n X T (23) nThus, o Pt ′ Pt ′ P Bh,i (t ) > f (rPt ∗ ) and Th,i (t ) ≥ q q n √ Pt ′ ′ t = P µ̂P k2 ln T /Th,i (t )+k1 (mPt )h +LX ( ndTX )α h,i (t )+ E[R1 (T )] ≤ 4 k1 (mPt )H + LX ( √ dX α nT ) T. (25) The depth is from z to H, revealing that H > z. To satisfy dX +dC −1 this, we suppose 2H ≥ lnTT dX +dC +2 . Since the exploration 12 process started from depth z, the depth we can select satisfy the inequation above. Thus the second term’s regret bound is h √ H α i PP h t 4 k1 (mPt ) + LX ( ndTX ) φP E[R2 (T )] ≤ h Pt h=z (26) h √ H αi dX P h 4 k1 (mPt ) + LX ( ndTX ) . ≤ 4K(nPT )h dC [k1 (m t ) ] h=z We choose the context sub-hypercube whose regret bound is biggest to continue the inequation (26). And as for the third term, the regret bound is √ H h αi PP P h Pt c 4 k1 (mPt ) +LX ( ndTX ) E[R3 (T )] ≤ (φh ) . P Pt h=z P t Nh,i ∈(φht ) [5] M. Hilbert, “Big data for development: a review of promises and challenges,” Development Policy Review, pp. 135-174, 2016. [6] http://mooc.guokr.com/ [7] G. Paquette, A. Miara, “Managing open educational resources on the web of data,” International Journal of Advanced Computer Science and Applications (IJACSA), vol. 5, no. 8, 2014. [8] G. Paquette, O. Marino, D. Rogozan, M. Lonard, “Competencybased personalization for Massive Online Learning,” Avaliable at: http;//r-libre.teluq.ca/491/1/SLE-Competency-based%20personalisationrevised%2013.08.2014.pdf.2014. [9] C. G. Brinton, M. Chiang, “MOOC performance prediction via clickstream data and social learning networks,” IEEE Conference on Computer Communications (INFOCOM), pp. 2299-2307, 2015. [10] S. Bubeck, R. Munos, G. Stoltz, et al, “X-armed bandits,” Journal of Machine Learning Research pp. 1655-1695, 2011. [11] G. Adomavicius, A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734-749, 2005. [12] D. Yanhui, W. Dequan, Z. Yongxin, et al. “A group recommender system for online course study,” International Conference on Information Technology in Medicine and Education, pp. 318-320, 2015. [13] M. J. Pazzani, D. Billsus, “Content-based recommendation over a customer network for ubiquitous shopping,” IEEE Transactions on Services Computing, vol. 2, no. 2, pp. 140-151, 2009. [14] R. Burke, “Hybrid recommender systems: Survey and experiments,” User Modeling and User-adapted Interaction, vol. 16, no. 2, pp. 325341. 2007. [15] K. Yoshii, M. Goto, K. Komatani, T. Ogata, H. G. Okuno, “An efficient hybrid music recommender system using an incrementally trainable probabilistic generative model,” IEEE Transactions on Audio, Speech, Language Processing, vol. 16, no. 2, pp. 435-447, 2008. [16] L. Yanhong, Z. Bo, G. Jianhou, “Make adaptive learning of the MOOC: The CML model,” International Conference on Computer Science and Education (ICCSE), pp. 1001-1004, 2015. [17] A. Alzaghoul, E. Tovar, “A proposed framework for an adaptive learning of Massive Open Online Courses (MOOCs),” International Conference on Remote Engineering and Virtual Instrumentation (REV), pp. 127-132, 2016. [18] C. Cherkaoui, A. Qazdar, A. Battou, A. Mezouary, A. Bakki, D. Mamass, A. Qazdar, B. Er-Raha, “A model of adaptation in online learning environments (LMSs and MOOCs),” International Conference on Intelligent Systems: Theories and Applications (SITA), pp. 1-6, 2015. [19] E. Hazan, N. Megiddo, “Online learning with prior knowledge,” Learning Theory. Berlin, Germany: Springer-Verlag, pp. 499C513, 2007. [20] A. Slivkins, “Contextual bandits with similarity information,” J. Mach. Learn. Res., vol. 15, no. 1, pp. 2533C2568, Jan. 2014. [21] J. Langford T. Zhang, “The EpochCGreedy algorithm for contextual multi-armed bandits,” Proc. NIPS, pp. 1096C1103, 2007. [22] W. Chu, L. Li, L. Reyzin, R. E. Schapire, “Contextual bandits with linear payoff functions,” Proc. AISTATS, pp. 208C214, 2011. [23] T. Lu, D. Pl, M. Pl, “Contextual multi-armed bandits,” Proc. AISTATS, pp. 485C492, 2010. [24] Bubeck S, Munos R, Stoltz G, et al. X-armed bandits[J]. Journal of Machine Learning Research, 2011, 12(May): 1655-1695. [25] C. Tekin, M. van der Schaar, “Distributed online big data classification using context information,” IEEE Annual Allerton Conference: Communication, Control, and Computing, pp. 1435-1442, 2013. [26] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American Statistical Association, pp. 13-30, 1963. [27] https://www.edx.org/ [28] J. P. Berrut, L. N. Trefethen, “Barycentric lagrange interpolation,” pp. 501-517, 2004. [29] L. Song, C. Tekin, M. van der Schaar, “Online learning in large-scale contextual recommender systems,” vol. pp, no. 99, 2015. [30] Slivkins A. Contextual bandits with similarity information[J]. Journal of Machine Learning Research, 2014, 15(1): 2533-2568 [31] M. G. Azar, A. Lazaric, E. Brunskill, “Online Stochastic Optimization under Correlated Bandit Feedback,” pp. 1557-1565, 2014. c t We notice that since the nodes in ΓP 3 is the child node of To be more specific, in the binary tree, the child nodes is more than parent nodes but less than twice, thus the number Pt t of ΓP 3 is less than twice of Γ2 . h √ H αi P PP h Pt c 4 k1 (mPt ) + LX ( ndTX ) (φh ) t ΓP 2 . P Pt h=z ≤ ≤ 8K(nT ) dX [k1 (mPt )h ] dC 8K(nT )dX [k1 (mPt )h ] dC P t ∈(φ t ) Nh,i h 4k2 ln T t −k (mPt )h −L ( Dh,i X 1 k1 (mPt )h +LX ( P 4k2 ln T √ dX nT ) √ dX nT α 2 c ) α 2 +M +M . Similar to the RHT, the second term is infinitesimal of higher order of E[R3 (T )]. When it comes to the fourth term, we notice that since the depth of z is bounded, and the worst situation happens when z is maximum. 4k2 ln T . (27) E[R4 (T )] ≤ (2z − 1) √ α 2 + M k1 (mPt )h +LX ( ndX ) T z T ln T C −1 ddX +d +d +2 X C Since we know that 2 ≤ , we have dX +dC −1 2 E[R4 (T )] = O lnTT dX +dC +2 ln T (nT ) d +d +1 1 X C = O T dX +dC +2 (ln T ) dX +dC +1 . (28) From theorem 1, we minimize the deviation by making √ two α H deviation equal too, which means k1 (mPt ) = LX ( ndTX ) . For the simplicity we let k1 = 2 and α = 1, d +d1 +2 T X C thus the slicing number nT = . Now ln T we need to prove that there exists H that can satisfy √ C −1 ddX +d X +dC +2 dX ConsidT Pt H H , k1 (m ) = LX ( ). 2 ≥ ln T nT ering the complexity and ignoring constant, we can get that when mPt ∈ [ 12 , 1) the H exists. Based on the attained above we can get the infromation dX +dC +1 1 regret complexity is O T dX +dC +2 (ln T ) dX +dC +1 . R EFERENCES [1] L. Pappano, “The Year of the MOOC,” The New York Times, 2014. [2] T. Lewin, “Universities Abroad Join Partnerships on the Web,” New York Times, 2013. [3] https://zh.coursera.org/. [4] A. Brown, “MOOCs make their move,” The Bent, vol. 104, no. 2, pp. 13-17, 2013. 13 $&5 +&7 5+7 ] ] ] $YHUDJH5HJUHW $YHUDJH5HJUHW 5HJUHW ൈ 澔 $UULYDOGDWD $UULYDOGDWD $YHUDJH5HJUHW D5HJUHW ൈ ൈ 澔 澔
© Copyright 2026 Paperzz