Presentation - FSU Computer Science

• I was looking through many
APIs to figure out what I
wanted to use and how I
wanted to develop this
Twitterbot.
• My early attempts consisted
of developing grammars
using natural language API’s
or n-grams using other
machine learning API’s
• I was exploring my options
on what I wanted the bot to
do.
• I had envisioned a couple
things:
• Parsing news pages, web
crawling, and making up
stories.
• Generating fantasy phrases
based on Dungeons and
Dragons modules from PDFs
• My initial corpus consisted of smaller
free text-based news articles and short
stories
• I had to parse through html and figure
out decent articles to train with.
• There were some errors in parsing, in
the text itself, and uninteresting
information such as the programming
descriptions
• Even though there were issues
with parsing and using the
Markov Chains, I still got some
neat results.
• I decided I liked the fantasy
approach from my original idea.
The PDFs for Dungeons and
Dragons books had a ton of bad
data such as dice rolls and in
general a non-friendly corpus to
parse
• Amazon Web Services – Elastic
Cloud Computing (EC2) Instance
• Windows Server 2012
• Python 3.5 and Libraries: Tweepy,
Markovify, BeautifulSoup
• Create a corpus of the Lord of the
Rings series and an equal part of
Harry Potter which was about four
of the books (to balance out the
probability)
• Set a timeout for every hour and
let it go!
• Developing the Corpus:
• Find the books as a text file to make metadata
removal easier? Nope!
• Find the e-books then? Yeah!
• Convert the e-book to a text file!
• Clean up the metadata!
• Remove the page, title, and chapter wording
leaving only sentence structure
• Remove all quotations and other grammatical
symbols that would cause odd output
• Remove other miscellaneous data that would
be bad for the training model
• Run the Markov Chain algorithm and find
sentence candidates under 140 characters per
Twitter restriction, then post every hour!
• The bot posts new and interesting
plotlines, deleted scenes, remixes, lots
of interesting phrases, and comments
that make me think of Mystery Science
Theater 3000 (and occasionally
jibberish)