Opinion Fraud Detection in Online Reviews using Network Effects Leman Akoglu Stony Brook University [email protected] Rishi Chandy Carnegie Mellon University [email protected] Christos Faloutsos Carnegie Mellon University [email protected] Problem Formulation: A Collective Classification Approach Datasets Objective function utilizes pairwise Markov Random Fields (Kindermann&Snell, 1980): I) SWM: All app reviews of entertainment category (games, news, sports, etc.) from an anonymous online app store database As of June 2012: * 1, 132, 373 reviews * 966, 842 users * 15,094 software products (apps) Ratings: 1 (worst) to 5 (best) II) Also simulated fake review data (with ground truth) edge signs node labels as random variables compatibility potentials Competitors Compared to 2 iterative classifiers (modified to handle signed edges): I) Weighted-vote Relational Classifier (wv-RC) (Macskassy&Provost, 2003) II) HITS (honesty-goodness in mutual recursion) (Kleinberg, 1999) observed neighbor potentials prior belief Which reviews do/should you trust? Inference A Fake-Review(er) Detection System Finding best assignments is the inference problem, NP-hard for general graphs. We use a computationally tractable (linearly scalable with network size) approximate inference algorithm called Loopy Belief Propagation (LBP) (Pearl, 1982). Desired properties that such a system to have: Property 1: Network effects Fraudulence of reviews/reviewers is revealed in relation to others. So review network should be used. Property 2: Side information Information on behavioral (e.g. login times) and linguistic (e.g. use of capital letters) clues should be exploited. Property 3: Un/Semi-supervision • When consensus reached, calculate belief Methods should not expect fully labeled training set. (humans are at best close to random) Property 4: Scalability Methods should be (sub)linear in data/network size. • Iterative process in which neighbor variables “talk” to each other, passing messages signed Inference Algorithm (sIA): Performance on simulated data: (from left to right) sIA, wv-RC, HITS Real-data Results “I (variable x1) believe you (variable x2) belong in these states with various likelihoods…” Top 100 users and their product votes: “bot” members? I) Repeat for each node: Property 5: Incremental Methods should compute fraudulence scores incrementally with the arrival of data (hourly/daily). i Problem Statement + (4-5) rating A network classification problem: Given II) At convergence: o (1-2) rating Top-scorers matter: the user-product review network (bipartite) review sentiments (+: thumbs-up, -: thumbs-down) Compatibility: Classify network objects into type-specific classes: users: `honest’ / `fraudster’ Conclusions products: `good’ / `bad’ Scoring: reviews: `genuine’ / `fake’ Before After Novel framework that exploits network effects to automatically spot fake review(er)s. • Problem formulation as collective classification in bipartite networks • Efficient scoring/inference algorithm to handle signed edges • Desirable properties: i) general, ii) un/semi-supervised, iii) scalable • Experiments on real&synthetic data: better than competitors, finds real fraudsters.
© Copyright 2026 Paperzz