Sam R. Cosgrave

First breath on that haunts / a clean fresh start a failure / the world is mischief.

First breath on that haunts / a clean fresh start a failure / the world is mischief.

Sam R. Cosgrave is a bot masquerading as a haiku poet on Twitter.

It started when a friend of mine who writes haikus made his poems available in a handy machine readable format. I made a web page that recombines the lines of his poems to make new ones with a fair bit more non-sequitur.

Around the same time I was experimenting with some Markov chain code that I wrote for a bot that lived on a MUD. I wanted to Markov the haikus but I was missing an important thing: a way to make it stick to the haiku five-seven-five syllable format. So I set out to make a syllable counter.

Since it’s English there’s no straightforward way to count syllables, so I used two techniques. I grabbed the 1915 Webster’s dictionary text file from Project Gutenberg. This has pronunciation guides for every word in the form syll-ab-le, with dashes between syllables. So I wrote code to parse the dictionary text into lines of the form syllable: 3. There are 91,954 lines in the file, I used that format to make it easy to parse as a YAML file. If a word is not found in the dictionary (a word that didn’t exist in 1915 for instance) then I fall back to counting vowels, since syllables often correspond to consonants grouped around a vowel. So I now have code that will return the number syllables in a string most of the time.

So I made my friend a page that spits out markov’d haikus based on his poems. Good fun. My friend was posting his haikus to twitter with the #haiku tag. Looking there I realised a lot of people are also posting haikus, and they’re often in a somewhat parsable format. People tend to either post their poems on a single line with the line breaks indicated by a / or a similar punctuation mark, or they use returns. It became clear to me that this could go much further than making fun of my friend.

So I wrote a program to gather haikus from Twitter. It uses the Twitter API to search for 100 recent tweets in English with the hashtag #haiku. Then for each tweet it removes the various cruft like @usernames, #hashtags and URLs, skipping the tweet entirely if it starts with RT. If whatever is left matches one of two regular expressions representing the two haiku posting formats I mentioned above then it is stored in a text file along with its twitter post ID. The ID is used to make sure I don’t store the same tweet more than once.

The gathering script has been running hourly for nearly four years and has so far accumulated 810,623 haikus. There are some repeats and a fair bit of failed parsing nonsense but that doesn’t matter too much for what I’m going to do with them.

A script runs daily to build three markov chains from the big list of haikus, one per line. It skips any haiku that contains a word from a list published by Darius Kazemi as part of https://github.com/dariusk/wordfilter. This is mainly to avoid a selection of racist, sexist and ableist slurs. There is still a lot of scope for colourfully poetic language. The resulting data structure representing the three chains is written out as a large JSON file together with a file containing all of the source haiku lines as the keys of a hash, the purpose of which will be revealed shortly.

Every minute between 7am and 11pm, cron runs the srcosgrave_tweet script. There’s a 1 in 225 chance that it will generate and post a haiku. On World Poetry Day it is 7 times more likely to post. Generating goes like so:

For each line:

  • Generate a line from the markov chain
  • Throw it away and generate another one if the line is any of:
    • Less than 10 characters (a cheap short-circuit, it’s unlikely to fit the syllable requirement if too short).
    • Present in the keys of the hash of haiku lines (hash keys are quick to look up), this is to ensure originality.
    • Not the required number of syllables.

With a new haiku prepared, there’s a 50% chance it will generate an image to accompany the poem. This renders the text in one of seven fetching fonts overlaid onto one of seven background textures. The textures are by Ervin Bartis, they are Creative Commons licensed and available on Flickr.

It posts its poetry to the twitter account @srcosgrave with the hashtag #haiku (which does indeed mean that it consumes its own output). At first I wanted to experiment with whether or not people would mistake it for a human, hence the name: “Sam” because I wanted it to be gender ambiguous; “R.” for Robot, as per the humanoid robots in Asimov’s books; “Cosgrave” after Lionel Cosgrave, who would frequently turn up in stuff written by Richard Herring, especially On The Hour. I don’t think anybody was actually fooled, it’s pretty obvious what’s going on if you look at more than one tweet. It mostly gets attention from bots searching for specific words.