Presentation - FSU Computer Science

• I was looking through many
APIs to figure out what I
wanted to use and how I
wanted to develop this
Twitterbot.
• My early attempts consisted
of developing grammars
using natural language API’s
or n-grams using other
machine learning API’s
• I was exploring my options
on what I wanted the bot to
do.
• I had envisioned a couple
things:
• Parsing news pages, web
crawling, and making up
stories.
• Generating fantasy phrases
based on Dungeons and
Dragons modules from PDFs
• My initial corpus consisted of smaller
free text-based news articles and short
stories
• I had to parse through html and figure
out decent articles to train with.
• There were some errors in parsing, in
the text itself, and uninteresting
information such as the programming
descriptions
• Even though there were issues
with parsing and using the
Markov Chains, I still got some
neat results.
• I decided I liked the fantasy
approach from my original idea.
The PDFs for Dungeons and
Dragons books had a ton of bad
data such as dice rolls and in
general a non-friendly corpus to
parse
• Amazon Web Services – Elastic
Cloud Computing (EC2) Instance
• Windows Server 2012
• Python 3.5 and Libraries: Tweepy,
Markovify, BeautifulSoup
• Create a corpus of the Lord of the
Rings series and an equal part of
Harry Potter which was about four
of the books (to balance out the
probability)
• Set a timeout for every hour and
let it go!
• Developing the Corpus:
• Find the books as a text file to make metadata
removal easier? Nope!
• Find the e-books then? Yeah!
• Convert the e-book to a text file!
• Clean up the metadata!
• Remove the page, title, and chapter wording
leaving only sentence structure
• Remove all quotations and other grammatical
symbols that would cause odd output
• Remove other miscellaneous data that would
be bad for the training model
• Run the Markov Chain algorithm and find
sentence candidates under 140 characters per
Twitter restriction, then post every hour!
• The bot posts new and interesting
plotlines, deleted scenes, remixes, lots
of interesting phrases, and comments
that make me think of Mystery Science
Theater 3000 (and occasionally
jibberish)

Download Report

Presentation - FSU Computer Science

Paperzz.com

Your Paperzz