The Dangerous Effects of Spam on Social Media Research Results A white paper by Conversition Introduction If you’ve ever tried to comment on an internet blog post or ask questions in a web forum, you know that the internet is teeming with spam. Sadly, spammers work very hard to post millions of spam comments on millions of personal blogs and millions of online forums reaching millions of people every day. This massive scale means that only a tiny percentage of spam messages need to be acted on by unsuspecting shoppers for spammers to end up winning. It also means that if you’re collecting opinions about brands, products, and services from across the internet for social media research, you’re sure to find a lot of this spam in your datasets. Does all of that erroneous data really affect your research results? What are the unintended consequences of not removing that data? This paper will show the effects of spam in terms of three criteria: 1. How does spam affect sample sizes? 2. How does spam affect sentiment scores? 3. How does spam affect important variables? Datasets To observe the full range of possible effects, we selected 12 brands from diverse categories such as food, beverage, electronics, home care, personal care, pharmaceutical, and celebrities. We chose some brands that were massively popular in social media, as well as some that have relatively little social media data. We began by collecting three sets of data for each of the 12 brands, giving us 36 unique datasets in the end. Dataset 1. Any data that mentioned the brand name. Dataset 2. Any data that mentioned the brand name, subsequently filtered through Conversition’s proprietary spam cleaning system. This removed obvious erroneous data such as pornography, money making schemes, or other intentional misdirections. Dataset 3. Any data that mentioned the brand name, subsequently filtered through both Converistion’s proprietary spam cleaning system as well as cleaned through our standard manual processes. This removed both the obviously erroneous data as well as the tricky pieces of data that sneak through the automated system. [This is the standard Conversition process.] What Does Spam Look like? Spam takes on many forms. Some of the more common ways include financial spam (e.g., “buy more cheap buy cheaper cheapest”), pharmaceutical (e.g., “no rx deliver Saturday fedex Saturday COD Saturday”), free product scams (e.g., “free ms office free free”), and more. Common features of spam include repeating words, poor punctuation, poor grammar, and inserting irrelevant words in the middle of a phrase. Below are just a few examples of real spam that were identified as part of this research. Other types of data include those that don’t suit the research objective (e.g., lists of coupons, lists of stock prices) and those that aren’t actually about the brand (e.g., parkay flooring vs parkay margarine). [All the links in the messages have been removed] streptomycin for acne best price Tetracycline online paypal free shipping Tetracycline 500 mg discount presciptions Himplasia apotheke Tetracycline anovulation in internet fast delivery Pepcid cost low buy no rx Apcalis SX rx overnight how much is Apcalis SX otc cialis prix acheter levitra Pepcid ds safety get Apcalis SX cod viagra without prescription us Apcalis SX 20 mg without prescription without script Propecia Propecia online prescription cheap Propecia next day shipping propecia bph generic Propecia versus brand name Pepcid 2 guys 1 horse official video can you take milk of magnesia with pepcid somali girls sex Jayme Galloway Jack D. Ape - www.GorillaJack.com Jimmy Fallon & Carly Rae Jepsen Marcin Michalak Shelley Ostrove Beata, Tommy, Maks & Alex Janus Katy Perry Bart Rucinski Nissan USA & Nissan Canada Anytime Fitness Black Speed Bisexual Dating London 2010 Roscoe village Laytown perth speed bisexual dating australia katy perry dating black guy online dating sites don't work best online bisexual dating vancouver best http://tlcgct.com mdkyg jrwhg http://nytgdq.com wzqxo njkbo http://okrbzq.com atrou hhlfd http://zhgjuv.com ctijk maxnn http://bbsbze.com pyweh ggeqk Love the blog!! Free CanadianRX, only the best for your strength for kat perry your woman. Nursery Water with Floride 128 oz, $1 $0.55 off Nursery Water printable Final Price: $.45 each! Parkay Margarine Spray 8 oz or Squeeze Bottle 12 oz, $1 Shopper's Value Beef Ribeye Steak 4 oz Boneless, $1 Shoppers Value Bleach 96 oz or Homelife Food Storage Bags Qt or Gallon 15-25 ct or Homelife Aluminum Foil Effects on Sample Sizes For each brand, we first identified the sample size of Dataset #1 which included any spam and irrelevant data that may have been caught (the red circle in the following images). Second, we calculated the sample size of Dataset #2 which incorporated the automatic cleaning that all Conversition datasets normally go through to eliminate spam (the orange circle). Third, we calculated the sample size of Dataset #3 which incorporated both our automated and human intervention cleaning processes (the green circle). Each circle is an accurate reflection of the percentage of data available. Once the automated Conversition spam process had been applied to the 12 sets of data, between 18% and 98% of the original records remained, for an average of about 75%. And, once human intervention took place and additional spam records were manually identified and removed, between 7% and 60% of the original data remained, for an average of about 50%. As you can see, some of the brands benefited massively from the automated cleaning processes (e.g., Pepcid, Nicoderm, iPhone) while other brands benefited massively from the manual cleaning processes (e.g., Speedstick, Quaker, Katy Perry). Clearly, spam can result in massively overestimating the volume of records in a social media dataset. More importantly, it’s clear that identifying and removing inappropriate records can only be accomplished through the science of an automated system as well the art of manual intervention. Effects on Sentiment Scores Sentiment analysis is a process frequently applied to social media data to determine how positive or negative the data and opinions are. It corresponds to the Likert scales that are normally seen in surveys in the sense that sentiment analysis can score social media data in terms of whether someone is “very positive”, “somewhat positive”, “neutral”, “somewhat negative”, or “very negative” about a product or service. When spam is erroneously included in research data, it’s hard to know what the effect will be. Is spam much more positive towards the brand because it is encouraging people to buy fake products? Is spam much more negative towards the brand because it is trying a shock and awe method of attracting attention? We used traditional box scores to compare sentiment of the three sets of data. In this case, we focused on Top 2 Box scores which reflect the percentage of verbatims that are positive (as opposed to negative or neutral).The chart below plots Top 2 Box scores for each of the 12 brands and each of the 3 sets of data. Dataset #1, the spam filled data, is represented by the first red bar in each set of three. This data was on average 3 percentage points more positive than the fully clean Dataset #3 with differences ranging from 9 points to +40 points. Dataset #2, the automatically cleaned data, is represented by the middle orange bar in each set of three. This data generated sentiment that was, on average, the same as the fully clean data, but on an individual level, differed by -10 points to -8 points. This lack of consistency is worrisome as you cannot conclude that unclean data is generally more positive or generally more negative than clean data. The effect is unpredictable leaving researchers to ponder what could possibly be the truth for their specific dataset. 60.0% 40.0% 20.0% 0.0% All Spam Partial Spam No Spam Effects on Key Variables More important than sample sizes or overall sentiment though, is how individual features of products and people can be falsely represented when erroneous data is included. Failing to exclude spam can cause a researcher or brand manager to think that people like or dislike a particular feature, perhaps the color, shape, or size of a product, much more than they really do. Conclusions based on this erroneous data can lead to actionable outcomes taken in the wrong directions, perhaps beginning production of a blue product instead of red product, or creating a larger package when consumers actually wanted a smaller package. We chose three brands to illustrate the potential problems. For each of these brands, we identified a number of relevant variables and then determined the sentiment score for Dataset #1 (with spam) and Dataset #3 (no spam). Pepcid Heartburn Medication Spam Pepcid data originates from illegal pharmacies trying to sell fake and illegal products. The key difference is that spam data is significantly more positive. Not only is it more positive overall, its key market research variables, including recommendation, switching, and trial scores, are more positive. Further, drug interactions and nausea are discussed in less negative terms to encourage people to buy the product. This dataset was massively populated with verbatims related to computer tablets (consider that Pepcid heartburn products come in tablet form). Parkay Margarine Parkay data contains thousands of coupon listings, generally undesirable, which often include brand names from ten or more other products listed with the margarine, many of which have nothing to do with the product. These erroneous inclusions, which mention product attributes and nutrition of other products cause scores to differ wildly from reality, with differences as large as 50 percentage points. In addition, this set of data was falsely populated with many conversations about parkay flooring (correctly spelled as parquet). Katy Perry Sadly, Katy Perry data is highly populated with pornography and misdirections (e.g., combined in keyword loaded message also mentioning about cars, banks, cookies, and pens). This either creates variables which are irrelevant, or falsely hypes and biases existing variables. Clean data is typically more positive than spam data, but there are many large differences in scores between the spam and clean data, some as large as 60 percentage points. Conclusions A less than careful approach to data quality can quickly and easily result in many negative outcomes. Because some brands elicit more spam than other brands, comparisons of social media presence without sufficient spam cleaning can lead to incorrect conclusions about which brand is the social media category leader or lagger. Sentiment scores can be falsely high or falsely low causing the impression that a brand is more or less loved than it really is. And worst of all, individual variables can generate sentiment that is so different from reality, that the actionable outcomes are simply the reverse of what should have taken place. The only solution is to ensure that social media dataset is as clean as automatically, and humanly, possible.
© Copyright 2026 Paperzz