Predicting missing connections in a directed graph

Predicting missing connections in a directed graph
Facebook challenge on Kaggle (ending date: 7/10/2012)
Brady Benware
Overview of Approach
While the implementation has some optimizations, conceptually the approach I used is as follows and is
depicted below:
1. Edge Removal - Sample the provided graph to create a local data set with known missing edges
2. Feature Extraction
a. Identify potential missing edges and extract feature data from the sampled graph
b. Identify potential missing edges and extract feature data from the original graph
3. Random Forest
a. Train a Random Forest (RF) on data from 2.a. to partition the potential missing edges
into real (1) and fake (0)
b. Use the RF from step 3.a. to determine the probability that a potential missing edge is
real
4. Top10 selection - Create the submission file by listing the potential missing edges ranked by
their probability for each node
a. Create a submission file from the training data out-of-bag estimates for local scoring
b. Create a submission file for kaggle from the predicted probabilities of the Test nodes
Edge Removal
I suspected that it would be critical to try to replicate the same edge removal process that was used by
the competition administrators. I therefore spent some effort to reverse engineer this process. The
final rules for removing an edge from the graph are as follows:
Repeat until there are 262588 nodes in “My Test nodes”
a. Select a node at random (Node A)
b. If the node is in the list of “Test nodes” then resample (This rule is VERY important. If
you allow node from the test set in the training data then your learner will be trying to
classify the real final missing edges as fake)
c. If the node has less 0 followers OR the node has 0 leaders then resample (This rule
seems like a mistake on the admins part as the rule should have been not to use nodes
that have only one connection. This is an important rule to get the correct sampling)
d. Randomly select an edge attached to Node A (can be either a leader or a follower)
e. Identify the other node attached to this edge (call it Node B)
f. If Node B has only one edge then start over (resample a new node A)
g. Add the selected edge to “My Missing edges”, add Node A to “My Test nodes” (if not
already present, and delete the edge from “My Graph”
This procedure created a list of nodes in “My Test nodes” whose distribution of leader and follower
connection counts matched almost perfectly with the distribution found when analyzing the nodes in
“Test nodes”.
Feature Extraction
The process I used to extract features was as follows:
For each node A of interest
a. Find a set of nodes RECOMMENDERS that will make suggestions for other nodes for A to
follow
b. For each REC in RECOMMENDERS
i. Identify the probability that A will follow a recommendation from REC (this is
computed as the number of recommendations A is already following divided by
the number of recommendations from REC)
ii. For each recommendation B that A is not already following, add the probability
to a variable tracking the probability for B
c. Divide each B probability sum by the count of RECOMMENDERS to get the average, this
becomes the feature value
This process was used for several different recommender type and recommendation types. For
example, the leaders of node A could be the recommenders and they in turn would recommend their
followers. The full list of features extracted in this manner are as follows:
F – recommender is A itself, recommending it’s followers
LL – recommenders are the leaders of A, recommending their leaders
LF – recommenders are the leaders of A, recommending their followers
FL – recommenders are the followers of A, recommending their leaders
FF – recommenders are the followers of A, recommending their followers
LLL – recommenders are the leaders of the leaders of A, recommending their leaders
LLF – recommenders are the leaders of the leaders of A, recommending their followers
LFL – recommenders are the followers of the leaders of A, recommending their leaders
LFF – recommenders are the followers of the leaders of A, recommending their followers
FLL – recommenders are the leaders of the followers of A, recommending their leaders
FLF – recommenders are the leaders of the followers of A, recommending their followers
FFL – recommenders are the followers of the followers of A, recommending their leaders
FFF – recommenders are the followers of the followers of A, recommending their followers
More features…
For each recommendation I also extracted information about how often that node B is followed by a
node whose relationship is the same as node A is to node B. For example, take the case of the
recommenders being the followers of node A, and they are recommending their followers. The data of
interest about node B is then how often does the leader of a leader of B follow B. I call these connect
back probability and the full list included in my training data were:
Lcb – This is the probability at which the leaders of node B will follow node B (actually I discover a bug
here and it turns out I was not actually using this . Too bad I think this plays an important role. Atleast
when you are talking about thousandths of a point. Guess I should have written this document sooner )
LLcb – probability that the B’s leaders’ leaders will follow B
LFcb – probability that B’s leaders’ followers will follow B
FLcb – probability that B’s followers’ leaders will follow B
FFcb – probability that B’s followers’ followers will follow B
Two more to go…
Rank – Because the list of recommendations can be quite large, it is necessary to truncate the list. I
chose to consider the top 50 best recommendations based on a preliminary ranking. Initially my ranking
was based on a weighted average of the recommender probabilities:
W.Ave. = F + 1/4 * (LL + LF + FL + FF) + 1/8 * (LLL + LLF + LFL + LFF + FLL + FLF + FFL + FFF)
Then the recommendations were ranked according to this metric and assigned a value between 0-1,
where zero was the lowest rank in the top 50 and 1 was the highest rank. The recommendation in
between were linearly spaced.
Later, there was a wonderfully controversial post by Den Souleo on the forum and I incorporated his
scoring methodology and used it as my ranking. Best I can tell this accounts for about 0.001 difference
in my leader board score.
http://www.kaggle.com/c/FacebookRecruiting/forums/t/2082/0-711-is-the-new-0
score – this value is similar to rank, except it looks at the absolute difference between two initial
rankings based on the weight metric (either Den’s method or the one above). The top rank get a value
of 1, while others are W.Ave.(i) / max(W.Ave.).
Random Forest
For the training data, each potential missing edge identified from the feature extraction step was
labeled with a ‘1’ if the edge was one that was really removed from the graph, or ‘0’ if it had not been
removed from the graph. For the nodes in “Test nodes”, the same edge identification and feature
extraction procedure was followed, but of course the edges were not labeled.
Both the training data set and test data set were fed into a random forest and a probability was
produced for each potential missing edge in the test data. The probability was used as the final relative
ranking of the potential edges for each test node.
The optimal number of features to consider at each branch point (a.k.a. Mtry) was ‘2’. This was pretty
consistent throughout all my experiments.
In addition to producing the probability for the test set, out-of-bag predictions were made for the
training data as well. These OOB probabilities were used to create a ranking of missing edges within my
local environment which I could use to calculate an expected MAP@10. This expected MAP@10
depended greatly on the sampling procedure used for edge removal. In my early experiments where I
had not yet reverse engineered the removal procedure, this MAP@10 value could only be used as a
relative figure of merit. One insight that this revealed is that the ability to predict missing edges for
nodes with fewer connections is much better than the ability to predict missing edges for nodes with a
larger number of connections. Perhaps this is because highly connected nodes actually belong to
multiple networks that should be considered separately (much more time needed to explore that).