• I was looking through many APIs to figure out what I wanted to use and how I wanted to develop this Twitterbot. • My early attempts consisted of developing grammars using natural language API’s or n-grams using other machine learning API’s • I was exploring my options on what I wanted the bot to do. • I had envisioned a couple things: • Parsing news pages, web crawling, and making up stories. • Generating fantasy phrases based on Dungeons and Dragons modules from PDFs • My initial corpus consisted of smaller free text-based news articles and short stories • I had to parse through html and figure out decent articles to train with. • There were some errors in parsing, in the text itself, and uninteresting information such as the programming descriptions • Even though there were issues with parsing and using the Markov Chains, I still got some neat results. • I decided I liked the fantasy approach from my original idea. The PDFs for Dungeons and Dragons books had a ton of bad data such as dice rolls and in general a non-friendly corpus to parse • Amazon Web Services – Elastic Cloud Computing (EC2) Instance • Windows Server 2012 • Python 3.5 and Libraries: Tweepy, Markovify, BeautifulSoup • Create a corpus of the Lord of the Rings series and an equal part of Harry Potter which was about four of the books (to balance out the probability) • Set a timeout for every hour and let it go! • Developing the Corpus: • Find the books as a text file to make metadata removal easier? Nope! • Find the e-books then? Yeah! • Convert the e-book to a text file! • Clean up the metadata! • Remove the page, title, and chapter wording leaving only sentence structure • Remove all quotations and other grammatical symbols that would cause odd output • Remove other miscellaneous data that would be bad for the training model • Run the Markov Chain algorithm and find sentence candidates under 140 characters per Twitter restriction, then post every hour! • The bot posts new and interesting plotlines, deleted scenes, remixes, lots of interesting phrases, and comments that make me think of Mystery Science Theater 3000 (and occasionally jibberish)
© Copyright 2026 Paperzz