Spam Filtering Uma Sawant, Megha Pandey, Sambuddha Roy LinkedIn Spam Irrelevant or unsolicited messages sent over the Internet, typically to large numbers of users, for the purposes of advertising, phishing, spreading malware, etc. Spam removal essential for good user experience 3 Outline of the talk Challenges in spam filtering Man + machine motif Spam filtering @ LI 4 Outline of the talk Challenges in spam filtering Man + machine motif Spam filtering @ LI 5 Fighting spam Large network Small network Growth ❏ Cold start : no training data for ML algorithms ❏ Manual curation is doable ❏ Small network is less lucrative, not many sophisticated attacks ❏ Need for automation at scale ❏ Manually curated data == training data for ML algorithms ❏ Highly lucrative for spammers, more and cleverer attacks 6 Types of spam 1. Offensive content ❏ Hate mail ❏ Nudity 7 Types of spam 2. Monetary gain ❏ Money scam ❏ Romance scam 8 Types of spam 3. Drive traffic to external link ❏ phishing Wrong sender Misdirection 9 Types of spam 4. Collection of personal data ❏ Email, phone, birthday ❏ Address book sale Hi XXX, How are you? I hope you are fine. I came across your profile and found it very impressive,so I just reach out to you regarding an offer for a online business. Can I ask for your active phone number and email address? 10 Types of spam 5. Fake profiles with malicious intent ❏ Connect with someone that would not have otherwise accepted the connection ❏ Precursor to fraud 11 Types of spam 6. Unprofessional content ❏ Puzzles ❏ Memes ❏ Jokes Puzzles Memes Jokes 12 Outline of the talk Challenges in spam filtering Man + machine motif Spam filtering @ LI 13 Typical ML pipeline ❏ Collect training data ❏ Train algorithms Labeled training data false Test data ML algorithms true 14 Spam filtering : Man + machine ❏ Spam data distribution changes as spammers change tactics ❏ Human intervention and classifier retraining is important Labeled training data Feedback loop spam User generated content Spam detection algorithms FP Human Review ham 15 Spam filtering : Man + machine ❏ Spam data distribution changes as spammers change tactics ❏ Human intervention and classifier retraining is important Labeled training data Feedback loop spam User generated content Spam detection algorithms FN FP Human Review ham Member flagging 16 Outline of the talk Challenges in spam filtering Man + machine motif Spam filtering @ LI 17 Data @ LinkedIn ❏ User profile (photo, description, geo ...) ❏ User updates (posts, comments, like, shares, ...) ❏ Emails ❏ Ads ❏ Network connections ❏ Groups ❏ ... 18 Spam filtering signals : content + context Context : who is the Goal : classify a post as spam or not spam author (Number and types of connections, previous posts etc) Content of the post: text and rich media (image, video) Context : how the post propagates over the network Context : which user segment Context : nature and quality of comments the post generates does the post engage 19 Spam detection pipeline author, geo, connections Classification of contextual signals User generated content spam scores text , rich media spam Content classification to detect spam Prioritize detected items for human review spam scores Human review ranked items for review ham 20 Spam detection pipeline author, geo, connections Classification of contextual signals User generated content spam scores text , rich media spam Content classification to detect spam Prioritize detected items for human review spam scores Human review ranked items for review ham 21 Content Classification Text Content Multimedia Content ➢ Profanity classifier for short text: • Detection of abuse, profanity and personal attacks in text updates and comments • Short to mid length text • Prevalent use of non-dictionary words and internet slang 22 Content Classification Text Content Multimedia Content ➢ Profanity classifier for short text ➢ Blogs and articles classifier: • Detection of objectionable content in blogs and articles • Larger length text documents • Understanding of context and overall semantic sense may be useful 23 Content Classification Text Content Multimedia Content ➢ Profanity classifier for short text ➢ Blogs and articles classifier ➢ Scams and promotions classifier: • Posted with malicious intent • Money scams, unsolicited promotions • Can be detected based on text content of the post 24 Content Classification Text Content ➢ Profanity classifier for short text ➢ Blogs and articles classifier ➢ Scams and promotions classifier Multimedia Content ➢ Profanity classifier for user photos and video clips: • Detection of objectionable content in images and videos 25 Content Classification Text Content ➢ Profanity classifier for short text ➢ Blogs and articles classifier ➢ Scams and promotions classifier Multimedia Content ➢ Profanity classifier for user photos and video clips ➢ Near de-duplication • Detect close visual similarity to known bad content • Uses image hashing techniques 26 Content Classification Text Content ➢ Profanity classifier for short text ➢ Blogs and articles classifier ➢ Scams and promotions classifier Multimedia Content ➢ Profanity classifier for user photos and video clips ➢ Near de-duplication ➢ Other ML Classifiers: • Based on a variety of visuo-temporal and spatial features 27 Engineering constraints ❏ Latency ❏ Online ❏ Nearline ❏ Offline ❏ Respect member privacy ❏ Enhanced member experience (personalization) ❏ Precision vs recall (high cost of false positives) 28 ©2014 LinkedIn Corporation. All Rights Reserved.
© Copyright 2025 Paperzz