Previous page: Daily Crap 2008-06-07
Next page: Daily Crap 2008-06-27


Awful Markov Chains

As an exercise in writing inefficient Ruby code, I created these scripts for creating a Markov model [1] of bodies of text, then using it to generate realistic-looking quotations based on it! If you follow me on Twitter, you've unfortunately been subject to the results for the past day.

Creating a Model

I used the clearest, most space-inefficient method I could imagine:

  1. Store each word in a giant hash, keyed by each word in the text, with values being arrays of succeeding words (once per occurrence)
  2. Save the hash to disk

The hash looks like this one, generated from the sentence "save the store from the storm":

db = { "" => ["save"],
       "save" => ["the"],
       "the" => ["store", "storm"],
       "store" => ["from"],
       "from" => ["the"]
       }

When this hash is saved to disk (using Marshal), it ends up being bigger than the original text, but the format is incredibly easy to use to produce text, so it was worth it to me as a quick hack. This is the code for generating it:

# usage: ruby learn.rb db-file-name

file = ARGV[0]

# load the existing model from disk
db = Marshal.load(File.read(file)) rescue {}

# read words from $stdin; yield each with the one before
def get_words
  while $stdin.gets do
    preceding = ""
    $_.split.each do |word|
      yield preceding, word
      preceding = word
    end
  end
end

# add words to the model
get_words do |preceding, word|
  db[preceding] ||= []
  db[preceding] << word
end

# save back to disk
File.open(file, "w") { |f| f.write Marshal.dump(db) }

One important thing to note: the word at the beginning of each line is stored as occurring after "", the empty string. These are the possible starting points for generated chains…

Generating Text

Generating text is easy! I may as well show you the code first:

# usage: ruby produce.rb db-file-name max-characters

file = ARGV[0]
count = ARGV[1].to_i

# load the model
db = Marshal.load(File.read(file)) rescue {}

# define convenience method for getting a random element of an array
class Array
  def rand
    self[Kernel.rand(length)]
  end
end

words = []
last = ""
loop do
  last = db[last].rand rescue nil
  break if last.nil?
  break if (words + [last]).join(" ").length >= count
  words << last
  # break if last[/[\.\?\!]$/] # stop at an end-of-sentence marker
end

puts words.join(" ")

It's easy to explain:

  1. Choose a first word from among those that had no preceding words during training (those at the beginning of lines)
  2. Randomly choose a word to come next from the pool of words that came after the previous one in the training text
  3. Continue until a word is chosen that has no successors, or the maximum number of characters is reached.

Using the training sentence from before ("save the store from the storm"), you can produce such glorious sentences as: "save the storm," "save the store from the storm," and, "save the store from the store from the store from the store from the storm."

Applications

The first thing I did of course, was train it with my plain text copy of the book of Genesis. Perhaps you do not have this text available. Why not try it with your own copy of the whole bible [2]? I've gotten choice "quotations" like:

  • "And in the valleys, to him, he railed on the captains of Israel, hast put their father's brothers' sons: so is in the Egyptians, so I will"
  • "Then delivered unto thee, that shall carry forth fruit, as for a trance"
  • "For God hath ought to profit, but the ground and to wife. Then said the LORD, for ye not find no light."

Then I trained it on a database dump of a forum I frequent full of h4x0rz, g4m3rz, and nubc4k3z:

  • "now I thought as old T20 and is my post n64 games?/??"
  • "Marvel Ultimate and she just some holy Francisc, work a test server when a dirty commies."
  • "Your MOM looks like to appear!"

Get creative.

Kicking it up a notch: Twitter

Using this power, you can become nearly as annoying a tweeter as I am! This script downloads a user's Twitter RSS feed, parses it with Hpricot [3] (so you need the gem), then outputs new tweets to stdout (old tweets are cached). This output is meant to be piped into the learn.rb script … for generating tweets that are not entirely unlike those the poor victim wrote themselves.

The example code has _why [4]'s Twitter feed information hard-coded near the top, for ultimate randomness. The ID number is taken from the URL for his RSS feed on Twitter -- feel free to change this to the ID number for any user you'd like to mimic.

# usage: ruby getwhy.rb | ruby learn.rb why

require "rubygems"
require "hpricot"
require "open-uri"

@id = "3573501"
@file = "_why_tweets"

rss = Hpricot(open("http://twitter.com/statuses/user_timeline/#{@id}.rss"))
new_statuses = (rss/"item title").map { |i| i.inner_html }.map { |i| i.split(" ", 2).last }
saved_statuses = Marshal.load(File.read(@file)) rescue []

never_before_seen = new_statuses - saved_statuses
statuses = new_statuses | saved_statuses
File.open(@file, "w") { |f| f.write Marshal.dump(statuses) }
puts never_before_seen.join("\n")

Comments

Click here to view the comments on this post.