The Dangerous Effects of Spam on Social Media

The Dangerous Effects of Spam on Social Media Research Results
A white paper by Conversition
Introduction
If you’ve ever tried to comment on an internet blog post or ask questions in a web forum, you know that
the internet is teeming with spam. Sadly, spammers work very hard to post millions of spam comments on
millions of personal blogs and millions of online forums reaching millions of people every day. This
massive scale means that only a tiny percentage of spam messages need to be acted on by unsuspecting
shoppers for spammers to end up winning.
It also means that if you’re collecting opinions about brands, products, and services from across the
internet for social media research, you’re sure to find a lot of this spam in your datasets. Does all of that
erroneous data really affect your research results? What are the unintended consequences of not
removing that data?
This paper will show the effects of spam in terms of three criteria:
1. How does spam affect sample sizes?
2. How does spam affect sentiment scores?
3. How does spam affect important variables?
Datasets
To observe the full range of possible effects, we selected 12 brands from diverse categories such as food,
beverage, electronics, home care, personal care, pharmaceutical, and celebrities. We chose some brands
that were massively popular in social media, as well as some that have relatively little social media data.
We began by collecting three sets of data for each of the 12 brands, giving us 36 unique datasets in the
end.
Dataset 1. Any data that mentioned the brand name.
Dataset 2. Any data that mentioned the brand name, subsequently filtered through Conversition’s
proprietary spam cleaning system. This removed obvious erroneous data such as pornography, money
making schemes, or other intentional misdirections.
Dataset 3. Any data that mentioned the brand name, subsequently filtered through both
Converistion’s proprietary spam cleaning system as well as cleaned through our standard manual
processes. This removed both the obviously erroneous data as well as the tricky pieces of data that
sneak through the automated system. [This is the standard Conversition process.]
What Does Spam Look like?
Spam takes on many forms. Some of the more common ways include financial spam (e.g., “buy more
cheap buy cheaper cheapest”), pharmaceutical (e.g., “no rx deliver Saturday fedex Saturday COD
Saturday”), free product scams (e.g., “free ms office free free”), and more. Common features of spam
include repeating words, poor punctuation, poor grammar, and inserting irrelevant words in the middle of
a phrase. Below are just a few examples of real spam that were identified as part of this research. Other
types of data include those that don’t suit the research objective (e.g., lists of coupons, lists of stock
prices) and those that aren’t actually about the brand (e.g., parkay flooring vs parkay margarine). [All the
links in the messages have been removed]









streptomycin for acne best price Tetracycline online paypal free shipping Tetracycline 500 mg discount presciptions
Himplasia apotheke Tetracycline anovulation in internet fast delivery Pepcid cost low buy
no rx Apcalis SX rx overnight how much is Apcalis SX otc cialis prix acheter levitra Pepcid ds safety get Apcalis SX cod
viagra without prescription us Apcalis SX 20 mg without prescription without script
Propecia Propecia online prescription cheap Propecia next day shipping propecia bph generic Propecia versus brand
name Pepcid
2 guys 1 horse official video can you take milk of magnesia with pepcid somali girls sex
Jayme Galloway Jack D. Ape - www.GorillaJack.com Jimmy Fallon & Carly Rae Jepsen Marcin Michalak Shelley Ostrove
Beata, Tommy, Maks & Alex Janus Katy Perry Bart Rucinski Nissan USA & Nissan Canada Anytime Fitness
Black Speed Bisexual Dating London 2010 Roscoe village Laytown perth speed bisexual dating australia katy perry dating
black guy online dating sites don't work best online bisexual dating vancouver best
http://tlcgct.com mdkyg jrwhg http://nytgdq.com wzqxo njkbo http://okrbzq.com atrou hhlfd http://zhgjuv.com ctijk
maxnn http://bbsbze.com pyweh ggeqk
Love the blog!! Free CanadianRX, only the best for your strength for kat perry your woman.
Nursery Water with Floride 128 oz, $1 $0.55 off Nursery Water printable Final Price: $.45 each! Parkay Margarine Spray
8 oz or Squeeze Bottle 12 oz, $1 Shopper's Value Beef Ribeye Steak 4 oz Boneless, $1 Shoppers Value Bleach 96 oz or
Homelife Food Storage Bags Qt or Gallon 15-25 ct or Homelife Aluminum Foil
Effects on Sample Sizes
For each brand, we first identified the sample size of Dataset #1 which included any spam and irrelevant
data that may have been caught (the red circle in the following images). Second, we calculated the sample
size of Dataset #2 which incorporated the automatic cleaning that all Conversition datasets normally go
through to eliminate spam (the orange circle). Third, we calculated the sample size of Dataset #3 which
incorporated both our automated and human intervention cleaning processes (the green circle). Each
circle is an accurate reflection of the percentage of data available.
Once the automated Conversition spam process had been applied to the 12 sets of data, between 18%
and 98% of the original records remained, for an average of about 75%. And, once human intervention
took place and additional spam records were manually identified and removed, between 7% and 60% of
the original data remained, for an average of about 50%.
As you can see, some of the brands benefited massively from the automated cleaning processes (e.g.,
Pepcid, Nicoderm, iPhone) while other brands benefited massively from the manual cleaning processes
(e.g., Speedstick, Quaker, Katy Perry). Clearly, spam can result in massively overestimating the volume of
records in a social media dataset. More importantly, it’s clear that identifying and removing inappropriate
records can only be accomplished through the science of an automated system as well the art of manual
intervention.
Effects on Sentiment Scores
Sentiment analysis is a process frequently applied to social media data to determine how positive or
negative the data and opinions are. It corresponds to the Likert scales that are normally seen in surveys in
the sense that sentiment analysis can score social media data in terms of whether someone is “very
positive”, “somewhat positive”, “neutral”, “somewhat negative”, or “very negative” about a product or
service.
When spam is erroneously included in research data, it’s hard to know what the effect will be. Is spam
much more positive towards the brand because it is encouraging people to buy fake products? Is spam
much more negative towards the brand because it is trying a shock and awe method of attracting
attention?
We used traditional box scores to compare sentiment of the three sets of data. In this case, we focused on
Top 2 Box scores which reflect the percentage of verbatims that are positive (as opposed to negative or
neutral).The chart below plots Top 2 Box scores for each of the 12 brands and each of the 3 sets of data.
Dataset #1, the spam filled data, is represented by the first red bar in each set of three. This data was on
average 3 percentage points more positive than the fully clean Dataset #3 with differences ranging from 9 points to +40 points. Dataset #2, the automatically cleaned data, is represented by the middle orange
bar in each set of three. This data generated sentiment that was, on average, the same as the fully clean
data, but on an individual level, differed by -10 points to -8 points.
This lack of consistency is worrisome as you cannot conclude that unclean data is generally more positive
or generally more negative than clean data. The effect is unpredictable leaving researchers to ponder
what could possibly be the truth for their specific dataset.
60.0%
40.0%
20.0%
0.0%
All Spam
Partial Spam
No Spam
Effects on Key Variables
More important than sample sizes or overall sentiment though, is how individual features of products and
people can be falsely represented when erroneous data is included. Failing to exclude spam can cause a
researcher or brand manager to think that people like or dislike a particular feature, perhaps the color,
shape, or size of a product, much more than they really do. Conclusions based on this erroneous data can
lead to actionable outcomes taken in the wrong directions, perhaps beginning production of a blue
product instead of red product, or creating a larger package when consumers actually wanted a smaller
package.
We chose three brands to illustrate the potential problems. For each of these brands, we identified a
number of relevant variables and then determined the sentiment score for Dataset #1 (with spam) and
Dataset #3 (no spam).
Pepcid Heartburn Medication
Spam Pepcid data originates from illegal
pharmacies trying to sell fake and illegal
products.
The key difference is that spam data is
significantly more positive. Not only is it more
positive overall, its key market research
variables, including recommendation,
switching, and trial scores, are more positive.
Further, drug interactions and nausea are
discussed in less negative terms to encourage
people to buy the product.
This dataset was massively populated with
verbatims related to computer tablets
(consider that Pepcid heartburn products
come in tablet form).
Parkay Margarine
Parkay data contains thousands of coupon
listings, generally undesirable, which often
include brand names from ten or more other
products listed with the margarine, many of
which have nothing to do with the product.
These erroneous inclusions, which mention
product attributes and nutrition of other
products cause scores to differ wildly from
reality, with differences as large as 50
percentage points.
In addition, this set of data was falsely
populated with many conversations about
parkay flooring (correctly spelled as parquet).
Katy Perry
Sadly, Katy Perry data is highly populated with
pornography and misdirections (e.g.,
combined in keyword loaded message also
mentioning about cars, banks, cookies, and
pens). This either creates variables which are
irrelevant, or falsely hypes and biases existing
variables.
Clean data is typically more positive than
spam data, but there are many large
differences in scores between the spam and
clean data, some as large as 60 percentage
points.
Conclusions
A less than careful approach to data quality can quickly and easily result in many negative outcomes.
Because some brands elicit more spam than other brands, comparisons of social media presence without
sufficient spam cleaning can lead to incorrect conclusions about which brand is the social media category
leader or lagger. Sentiment scores can be falsely high or falsely low causing the impression that a brand is
more or less loved than it really is. And worst of all, individual variables can generate sentiment that is so
different from reality, that the actionable outcomes are simply the reverse of what should have taken
place. The only solution is to ensure that social media dataset is as clean as automatically, and humanly,
possible.