Predicting the NFL Using Twitter

Predicting the NFL
Using Twitter
Shiladitya Sinha, Chris Dyer,
Kevin Gimpel, Noah Smith
Disclaimer
This talk will not teach you how to
become a successful sports bettor.
Questions
What is the NFL and what about it are we
predicting?
Why are we using Twitter?
NFL = National Football
League
What about the NFL are we
predicting?
● Two game outcomes are commonly bet upon:
○ Winner with the spread (Winner WTS)
○ Over-Under
● Outcomes are determined at the end of the game
using numbers set by bookies before the game
● Due to transaction costs, an accuracy of 53% is
needed to be profitable
Point Spread and Over Under
● Point Spread: Signed number used to determine winner
with the spread (WTS) given game score
○ If home_score + spread
■ > away_score: Home wins WTS
■ < away_score: Away wins WTS
■ = away_score: Push
● Over-Under: Betting line for total points scored by the
two teams; bettor chooses “over” or “under”
● “Push” if point differential equals point spread or total
points equals over-under
Point Spread and Over Under
Atlanta Falcons win a home game against the
Philadelphia Eagles 33 - 26
But did they win with the spread?
Point Spread
Outcome
<-7
Falcons lose WTS
>-7
Falcons win WTS
=-7
Push
Why use Twitter?
Twitter has been used to predict and/or explain:
●
●
●
●
●
●
●
●
Opinion Polls
Elections
Spread of contagious diseases
The stock market
Movie revenue
Food poisoning
Civil unrest in Latin America and the Middle East
Citations of scientific papers
...and the list goes on
Why not sports?
Why use Twitter to predict
NFL games?
● Structure of the NFL
○ 32 Teams spread geographically across United
States.
○ Regular season partitioned into 17 weeks
○ Roughly a week between games for a given team
● Participation of fans and spectators
○ Discussion on social media, including Twitter
○ Sports betting
○ Wisdom of the crowd
Fan Participation
Data Gathering
Data Gathering
● Current and historical NFL game data from
NFLdata.com
● Tweets from 2010-2012 seasons via Twitter
Garden Hose stream (10% of all tweets)
How to isolate relevant tweets?
Data Alignment
● Use a preset list of team hashtags to identify
relevant tweets
○ Discard tweets containing hashtags corresponding
to multiple teams
○ Label tweets with the team their hashtags refer to
● Pittsburgh Steelers hashtags
○
#steelers #pittsburghsteelers #stillers #gosteelers
#gosteelersgo #letsgosteelers #gostillers #gostillersgo
#letsgostillers #stillernation #stillersnation #steelernation
#steelersnation
Fan Participation
(Annotated)
How many Tweets?
Season
Weekly Tweets
(12 hours after previous
game to 1 hour before
upcoming game)
Pregame Tweets Postgame Tweets
(24 hours to 1 hour
before upcoming game)
(4 hours to 28 hours after
previous game)
2010
40,385
53,294
185,709
2011
130,977
147,834
524,453
2012
266,382
290,879
1,014,473
Data Alignment
● Given the set a tweet is contained in, label it
with the appropriate week of the season
○ For pregame and weekly tweets, this is the week
the upcoming game will be played
○ For postgame tweets, this is the week the previous
game was played
The team and week labels of a tweet determine
the unique game the tweet corresponds to
A Simple Task
Sanity check:
Can we look at a postgame tweet and
determine if the team it corresponds to
won or lost?
Postgame Tweet Analysis
● Use a bag-of-words model to extract features for
each tweet
○ In-tweet word frequency features
○ TF-IDF features:
In-tweet word frequency
Word frequency over all postgame tweets from the
week
● Classify tweets as wins for Home or Away team
○ For k in [1,16]:
■ Train classifier on all postgame tweets from 2010
to week k of the 2012 season
Postgame Tweet Analysis
● Average accuracy of 67% over 2012 season
○ Very simple features and parameter settings
○ Room for improvement
Highly weighted features (Top or bottom 30) for a
Home team win:
home: win
home: victory
away: loss
home: won
home: WIN
away: lost
home: Great
away: lose
away: refs
Forecasting
2012 Season Live Predictions
During the 2012 season, we tweeted predictions
before games using @NFLOracle
We predicted 34/58 or 65.4% of winner
WTS results correctly!
We did not encounter our tweets in the twitter
garden-hose stream
Training and Testing
Predict outcomes of upcoming games using Logistic
Regression:
Seasons 2010, 2011 and
Weeks [1, k-3] of 2012
Train on all games, with the
exception of the last week of a
season.
Weeks [k-2, k-1] of 2012
Tune L1/L2 regularization
parameters .
Apply procedure over weeks 3-16 of 2012
(or current season)
Week k of
2012
Predict and
Test
Predicting the NFL without
using Twitter
● Use historical game data to create simple
feature sets
○ Point spreads, over unders, scoring, etc.
● Combine pairs of simple feature to get a
preliminary list of game based feature sets
○ Highest average accuracy of 56%
○ Serve as a baseline for Twitter derived feature sets
Simple features derived
from tweets
● Word level features (Twitter unigrams)
○ Use only words that appear in at least .1% of all tweets
○ Use log (word frequency + 1) over all weekly tweets
corresponding to the game
● Creates a high dimensional feature space (~20k words)
○ This is ‘high’ due to the small set of games
● How to combine these features with features generated
from historical game data?
Dimensionality Reduction
● Canonical Correlation Analysis
○ Dimensionality Reduction
○ Combining of multiple data streams
Results of applying CCA to Twitter unigrams and game statistics
features:
Feature Set
Winner WTS Accuracy
Twitter Unigrams
47.6
1 component CCA
50.4
2 component CCA
51.0
4 component CCA
51.9
8 component CCA
48.1
The Rate Feature
● Measure the difference in volume of a team’
s tweets across consecutive weeks
○ Easy to compute
○ Considers tweets collectively rather than individually
○ Doesn’t use point spread or any game statistics
Highest average accuracy using a rate feature
set is 56%
Results
Conclusion
Our results suggest that a social media driven
approach can be effective in predicting sporting
events.
Download our dataset of NFL game outcomes
and tweet IDs!
www.ark.cs.cmu.edu/football/