Predicting missing connections in a directed graph Facebook challenge on Kaggle (ending date: 7/10/2012) Brady Benware Overview of Approach While the implementation has some optimizations, conceptually the approach I used is as follows and is depicted below: 1. Edge Removal - Sample the provided graph to create a local data set with known missing edges 2. Feature Extraction a. Identify potential missing edges and extract feature data from the sampled graph b. Identify potential missing edges and extract feature data from the original graph 3. Random Forest a. Train a Random Forest (RF) on data from 2.a. to partition the potential missing edges into real (1) and fake (0) b. Use the RF from step 3.a. to determine the probability that a potential missing edge is real 4. Top10 selection - Create the submission file by listing the potential missing edges ranked by their probability for each node a. Create a submission file from the training data out-of-bag estimates for local scoring b. Create a submission file for kaggle from the predicted probabilities of the Test nodes Edge Removal I suspected that it would be critical to try to replicate the same edge removal process that was used by the competition administrators. I therefore spent some effort to reverse engineer this process. The final rules for removing an edge from the graph are as follows: Repeat until there are 262588 nodes in “My Test nodes” a. Select a node at random (Node A) b. If the node is in the list of “Test nodes” then resample (This rule is VERY important. If you allow node from the test set in the training data then your learner will be trying to classify the real final missing edges as fake) c. If the node has less 0 followers OR the node has 0 leaders then resample (This rule seems like a mistake on the admins part as the rule should have been not to use nodes that have only one connection. This is an important rule to get the correct sampling) d. Randomly select an edge attached to Node A (can be either a leader or a follower) e. Identify the other node attached to this edge (call it Node B) f. If Node B has only one edge then start over (resample a new node A) g. Add the selected edge to “My Missing edges”, add Node A to “My Test nodes” (if not already present, and delete the edge from “My Graph” This procedure created a list of nodes in “My Test nodes” whose distribution of leader and follower connection counts matched almost perfectly with the distribution found when analyzing the nodes in “Test nodes”. Feature Extraction The process I used to extract features was as follows: For each node A of interest a. Find a set of nodes RECOMMENDERS that will make suggestions for other nodes for A to follow b. For each REC in RECOMMENDERS i. Identify the probability that A will follow a recommendation from REC (this is computed as the number of recommendations A is already following divided by the number of recommendations from REC) ii. For each recommendation B that A is not already following, add the probability to a variable tracking the probability for B c. Divide each B probability sum by the count of RECOMMENDERS to get the average, this becomes the feature value This process was used for several different recommender type and recommendation types. For example, the leaders of node A could be the recommenders and they in turn would recommend their followers. The full list of features extracted in this manner are as follows: F – recommender is A itself, recommending it’s followers LL – recommenders are the leaders of A, recommending their leaders LF – recommenders are the leaders of A, recommending their followers FL – recommenders are the followers of A, recommending their leaders FF – recommenders are the followers of A, recommending their followers LLL – recommenders are the leaders of the leaders of A, recommending their leaders LLF – recommenders are the leaders of the leaders of A, recommending their followers LFL – recommenders are the followers of the leaders of A, recommending their leaders LFF – recommenders are the followers of the leaders of A, recommending their followers FLL – recommenders are the leaders of the followers of A, recommending their leaders FLF – recommenders are the leaders of the followers of A, recommending their followers FFL – recommenders are the followers of the followers of A, recommending their leaders FFF – recommenders are the followers of the followers of A, recommending their followers More features… For each recommendation I also extracted information about how often that node B is followed by a node whose relationship is the same as node A is to node B. For example, take the case of the recommenders being the followers of node A, and they are recommending their followers. The data of interest about node B is then how often does the leader of a leader of B follow B. I call these connect back probability and the full list included in my training data were: Lcb – This is the probability at which the leaders of node B will follow node B (actually I discover a bug here and it turns out I was not actually using this . Too bad I think this plays an important role. Atleast when you are talking about thousandths of a point. Guess I should have written this document sooner ) LLcb – probability that the B’s leaders’ leaders will follow B LFcb – probability that B’s leaders’ followers will follow B FLcb – probability that B’s followers’ leaders will follow B FFcb – probability that B’s followers’ followers will follow B Two more to go… Rank – Because the list of recommendations can be quite large, it is necessary to truncate the list. I chose to consider the top 50 best recommendations based on a preliminary ranking. Initially my ranking was based on a weighted average of the recommender probabilities: W.Ave. = F + 1/4 * (LL + LF + FL + FF) + 1/8 * (LLL + LLF + LFL + LFF + FLL + FLF + FFL + FFF) Then the recommendations were ranked according to this metric and assigned a value between 0-1, where zero was the lowest rank in the top 50 and 1 was the highest rank. The recommendation in between were linearly spaced. Later, there was a wonderfully controversial post by Den Souleo on the forum and I incorporated his scoring methodology and used it as my ranking. Best I can tell this accounts for about 0.001 difference in my leader board score. http://www.kaggle.com/c/FacebookRecruiting/forums/t/2082/0-711-is-the-new-0 score – this value is similar to rank, except it looks at the absolute difference between two initial rankings based on the weight metric (either Den’s method or the one above). The top rank get a value of 1, while others are W.Ave.(i) / max(W.Ave.). Random Forest For the training data, each potential missing edge identified from the feature extraction step was labeled with a ‘1’ if the edge was one that was really removed from the graph, or ‘0’ if it had not been removed from the graph. For the nodes in “Test nodes”, the same edge identification and feature extraction procedure was followed, but of course the edges were not labeled. Both the training data set and test data set were fed into a random forest and a probability was produced for each potential missing edge in the test data. The probability was used as the final relative ranking of the potential edges for each test node. The optimal number of features to consider at each branch point (a.k.a. Mtry) was ‘2’. This was pretty consistent throughout all my experiments. In addition to producing the probability for the test set, out-of-bag predictions were made for the training data as well. These OOB probabilities were used to create a ranking of missing edges within my local environment which I could use to calculate an expected MAP@10. This expected MAP@10 depended greatly on the sampling procedure used for edge removal. In my early experiments where I had not yet reverse engineered the removal procedure, this MAP@10 value could only be used as a relative figure of merit. One insight that this revealed is that the ability to predict missing edges for nodes with fewer connections is much better than the ability to predict missing edges for nodes with a larger number of connections. Perhaps this is because highly connected nodes actually belong to multiple networks that should be considered separately (much more time needed to explore that).
© Copyright 2025 Paperzz