Markov of Chain: Automating Weird Sun tweets

2017-May-27

Let’s use python to train a Markov chain generator using all the tweets from a certain list of users, say this one. We’ll use the following libraries.

from functional import seq
import markovify
import re
import tweepy
import unidecode

To use the Twitter API, we need to authenticate ourselves. Register for your personal keys at https://apps.twitter.com/ and then create a config.json file that looks like this

{
  "consumer_key":    "...",
  "consumer_secret": "...",
  "access_key":      "...",
  "access_secret":   "..."
}

Now we can initialize the Twitter API provided by tweepy.

config = seq.json('config.json').dict()
auth = tweepy.OAuthHandler(
    config['consumer_key'], config['consumer_secret'])
auth.set_access_token(config['access_key'], config['access_secret'])
api = tweepy.API(auth)

First we write the following function (based on this gist) which returns the most recent tweets of a given user. The API limits us to at most 3240 tweets per user.

def get_user_tweets(screen_name):
    alltweets = []

    #  200 is the maximum allowed count
    # 'extended' means return full unabridged tweet contents
    new_tweets = api.user_timeline(screen_name=screen_name, count=200,
                                  tweet_mode='extended')

    alltweets.extend(new_tweets)

    # save the id of the oldest tweet less one
    oldest_id = alltweets[-1].id - 1

    # keep grabbing tweets until there are no tweets left to grab
    while len(new_tweets) > 0:
        # since we're grabbing 200 at a time, we use `max_id` to
        #   ask for a certain range of tweets
        new_tweets = api.user_timeline(
                screen_name = screen_name, count=200,
                tweet_mode='extended', max_id=oldest_id)

        alltweets.extend(new_tweets)

        #update the id of the oldest tweet less one
        oldest_id = alltweets[-1].id - 1

        print("...{} tweets downloaded so far".format(len(alltweets)))

    # put each tweet on a single line
    tweet_texts = [re.sub(r'\s*\n+\s*', ' ', tweet.full_text)
                   for tweet in alltweets]

    return tweet_texts

The other interaction with Twitter we need to perform is get all users in a list. We’ll write a function that fetches the usernames and calls get_user_tweets on each:

def get_list_tweets(screen_name, list_name):
    '''
    params: `screen_name` is the username of the owner of the list,
    `list_name` is the name of the list found in the URL
    '''

    # get list of all users in list
    user_names = []
    for user in tweepy.Cursor(
            api.list_members,
            screen_name,
            list_name).items():
        user_names.append(user.screen_name)

    # for each user, get their tweets
    list_tweets = []
    for user_name in user_names:
        list_tweets += get_user_tweets(user_name)
    print('Found {1} tweets from @{2}.'
        .format(len(list_tweets), user_name))
    return list_tweets

Let’s run get_list_tweets and save the output to a file.

tweets = get_list_tweets('Grognor', 'weird-sun-twitter')

with open('data/tweetdump.txt', 'w') as f:
    f.write('\n'.join(tweets))

With all of the raw data saved, we’re done with the Twitter API and we can process the data and auto-generate tweets offline. Assuming the file tweetdump.txt has a set of tweets, one per line, we load them as a list of strings tweets.

tweets = open('data/tweetdump.txt').readlines()

Some processing needs to be done in order to get high quality text from the tweets. The next function process_tweet is called on each one.

def process_tweet(tweet):
    # convert to ASCII
    tweet = unidecode.unidecode(tweet)
    # remove URLs
    tweet = re.sub(r'http\S+', '', tweet)
    # remove mentions
    tweet = re.sub(r'@\S+', '', tweet)

    tweet = tweet.strip()

    # append terminal punctuation if absent
    if len(tweet) > 0:
        last_char = tweet[-1]
        if last_char not in '.!?':
            tweet += '.'

    return tweet

processed_tweets = [ process_tweet(tweet) for tweet in tweets ]

And we remove any tweets that aren’t useful.

def is_excluded(tweet):
    ex = False
    # no RTs
    ex = ex or bool(re.match(r'^RT', tweet))
    # remove whitespace-only tweets
    ex = ex or bool(re.match(r'^\s*$', tweet))
    return ex

good_tweets = [ tweet for tweet in processed_tweets
               if not is_excluded(tweet) ]

We save the fully processed tweets for easy access later.

with open('data/processed_tweets.txt', 'w') as f:
    f.write('\n'.join(good_tweets))

The markovify library lets us train, and generate from, a Markov chain very easily. Just load the training text and set a state size.

text = open('data/processed_tweets.txt').read()

text_model = markovify.Text(text, state_size=3)

for x in range(5):
    print('* ' + text_model.make_short_sentence(140))

Some favorites:

“It is no coincidence we call them gods because we suppose they are trying to convince Robin Hanson.”
“Tell anyone who does not produce Scott Alexander.”
“Weird sun is a costly signal of the ability to remember sources of information, not just the study of complex manifolds.”
“If you read The Hobbit backwards, it’s about a layer of radioactive ash that develops the capacity to become larger.”
“When you read a physical book, you get a dust speck in the eye.”
“We all continuously scream about how the people in it are breaking the awkward silence.”
“People are important, but so are lexicographic preferences.”
“You don’t need an expert Bayesian Epistemologist to ensure it’s not a markov chain.”

Building a shell with JavaScript

2017-May-20

ShellJS is a JS library that provides functions like cd() and ls() which you can use to write Node scripts instead of bash scripts. That’s great for scripts, but what about an interactive shell? Well, we could just run the Node repl and import ShellJS:

$ node
> require('shelljs/global');
{}
> pwd()
{ [String: '/tmp']
  stdout: '/tmp',
  stderr: null,
  code: 0,
  cat: [Function: bound ],
  exec: [Function: bound ],
  grep: [Function: bound ],
  head: [Function: bound ],
  sed: [Function: bound ],
  sort: [Function: bound ],
  tail: [Function: bound ],
  to: [Function: bound ],
  toEnd: [Function: bound ],
  uniq: [Function: bound ] }

Hmm, that’s a little verbose, and we might want to avoid manually importing ShellJS. We also might want more features than the Node repl offers, such as vi keybindings.

We can get vi keybindings with rlwrap, but then tab completion goes away. The solution is given in this SO answer. First we need to install an rlwrap filter that negotiates tab-completion with a Node repl. The filter file can be found at that link, where it’s called node_complete. Put node_complete in $RLWRAP_FILTERDIR, which should be the folder on your system containing the RlwrapFilter.pm Perl module. For me it’s /usr/share/rlwrap/filters.

Now rlwrap is ready to negotiate tab completion, but the Node repl isn’t. We’ll have to actually write our own Node repl, which is easy because the repl module gives us all the tools we need. We’ll create a file called, say, myrepl.js, the contents of which are also given in the SO answer, only 9 lines. This script starts a repl with a hook to negotiate tab completion with rlwrap. If myrepl.js is in ~/bin, now we can run

$ rlwrap -z node_complete -e '' -c ~/bin/myrepl.js

and have both JS tab completion and rlwrap features, such as vi keybindings if that’s what we’ve configured. Let’s create a file called mysh with the following contents:

#!/usr/bin/env bash
rlwrap -z node_complete -e '' -c ~/bin/myrepl.js

Assuming ~/bin is in our path variable, we can put mysh there and launch our shell anywhere by just running mysh. So far so good but we wanted to automatically import ShellJS. In myrepl.js, add the following:

var shell = require('shelljs');
Object.assign(myrepl.context, shell);

Those two lines add all the ShellJS functions to the JS global object inside the repl. We have:

$ mysh
> pwd()
{ [String: '/tmp']
  stdout: '/tmp',
  stderr: null,
  code: 0,
  cat: [Function: bound ],
  exec: [Function: bound ],
  grep: [Function: bound ],
  head: [Function: bound ],
  sed: [Function: bound ],
  sort: [Function: bound ],
  tail: [Function: bound ],
  to: [Function: bound ],
  toEnd: [Function: bound ],
  uniq: [Function: bound ] }

Progress. Now, how do we clean up this output? The repl module allows us to define a custom writer. This is a function which takes the output of a line of JS and returns a string to represent the output in the repl. What we need to do is intercept objects like the one returned by pwd() above and only show the stderr and stdout properties. Add the following near the beginning of myrepl.js:

var util = require('util');

var myWriter = function(output) {
  var isSS = (
      output &&
      output.hasOwnProperty('stdout') &&
      output.hasOwnProperty('stderr'));
  if (isSS) {
    var stderrPart = output.stderr || '';
    var stdoutPart = output.stdout || '';
    return stderrPart + stdoutPart;
  } else {
    return util.inspect(output, null, null, true);
  }
};

And load this writer by changing

var myrepl = require("repl").start({terminal:false});

var myrepl = require("repl").start({
  terminal: false,
  writer: myWriter});

Now we get

$ mysh
> pwd()
/tmp

Much better. However, since the echo function prints its argument to the console and returns an object with it in the stdout property, we get this:

$ mysh
> echo('hi')
hi
hi

I haven’t solved this issue quite yet although I’d be surprised if there isn’t a reasonable solution out there. You can add to mysh and myrepl.js to get more features, such as colors, custom evaluation, custom pretty printing, other pre-loaded libraries, et cetera. The sky is the limit. I added an inspect function which allows us to see the full ShellJS output of a command if we really want it. My complete myrepl.js file is:

#!/usr/bin/env node

var util = require('util');
var colors = require('colors/safe');

var inspect = function(obj) {
  if (obj && typeof obj === 'object') {
    obj['__inspect'] = true;
  }
  return obj;
};

var myWriter = function(output) {
  var isSS = (
      output &&
      output.hasOwnProperty('stdout') &&
      output.hasOwnProperty('stderr') &&
      !output.hasOwnProperty('__inspect'));
  if (isSS) {
    var stderrPart = output.stderr || '';
    var stdoutPart = output.stdout || '';
    return colors.cyan(stderrPart + stdoutPart);
  } else {
    if (typeof output === 'object') {
      delete output['__inspect'];
    }
    return util.inspect(output, null, null, true);
  }
};

// terminal:false disables readline (just like env NODE_NO_READLINE=1):
var myrepl = require("repl").start({
  terminal: false,
  prompt: colors.green('% '),
  ignoreUndefined: true,
  useColors: true,
  writer: myWriter});

var shell = require('shelljs');
Object.assign(myrepl.context, shell);
myrepl.context['inspect'] = inspect;

// add REPL command rlwrap_complete(prefix) that prints a simple list
//   of completions of prefix
myrepl.context['rlwrap_complete'] =  function(prefix) {
  myrepl.complete(prefix, function(err,data) {
    for (x of data[0]) {console.log(x);}
  });
};

So this is basically what we wanted. We have a JS repl with convenient ShellJS commands. We also have vi keybindings, and tab completion for JS and filenames. It’s very rough around the edges, but it was really simple to make. GitHub user streamich built a more advanced form of this, called jssh which adds many features but lacks some too. The bottom line is, if you know JS, you might be surprised at what you can build.

Modeling aesthetics in mathematics

2016-Jul-5

What exactly is beautiful math?

[A]bove all, adepts [of mathematics] find therein delights analogous to those given by painting and music. They admire the delicate harmony of numbers and forms; they marvel when a new discovery opens to them an unexpected perspective; and has not the joy they thus feel the esthetic character, even though the senses take no part therein? Only a privileged few are called to enjoy it fully, it is true, but is not this the case for all the noblest arts?

-Henri Poincaré, The Value of Science

One expects a mathematical theorem or a mathematical theory not only to describe and to classify in a simple and elegant way numerous and a priori disparate special cases. One also expects “elegance” in its “architectural”, structural makeup. Ease in stating the problem, great difficulty in getting hold of it and in all attempts at approaching it, then again some very surprising twist by which the approach, or some part of the approach, becomes easy, etc. Also, if the deductions are lengthy or complicated, there should be some simple general principle involved, which “explains” the complications and detours, reduces the apparent arbitrariness to a few simple guiding motivations, etc. These criteria are clearly those of any creative art.

-John von Neumann, The Mathematician

The moral: a good proof is one that makes us wiser.

-Yuri Manin, A Course in Mathematical Logic for Mathematicians

My hypothesis is that generally when people talk about beauty in mathematics they’re talking about things that teach us something useful for proving new facts. For example, proving a difficult but simple theorem is useful because its difficulty means it may imply other previously difficult theorems, and its simplicity means it may show up and be used often. A theorem that establishes a connection between two previously disparate areas of mathematics is considered beautiful, and such a connection allows knowledge from one are to be applied to the other, potentially cracking new problems. An unexpected proof – “an unexpected perspective” or “surprising twist” – offers something new to be learned, something that can then be used for other problems.

Quote of the day: Yuri Gurevich

2016-Apr-29

I remember, in a geometry class, my teacher wanted to prove the congruence of two triangles. Let’s take a third triangle, she said, and I asked where do triangles come from. I worried that there may be no more triangles there. Those were hard times in Russia and we were accustomed to shortages. She looked at me for a while and then said: ‘Shut up’.

-Platonism, Constructivism, and Computer Proofs vs. Proofs by Hand

Link of the day: Learn the Greek alphabet

2016-Apr-3

A handy flashcards web app for memorizing all the Greek letters

Link of the day: Jon Skeet speaks

2016-Mar-15

Jon Skeet on the tricky edge cases that can show up with basic data types and how they model reality. Back to basics: the mess we’ve made of our fundamental data types

Older Newer