Andrew's Blog     All posts     Feed     About

Uncertainty due to computational approximation in Bayesian inference

In Bayesian inference, we can factor approximate computation (e.g. linearization) into the actual posterior probabilities.

Suppose we have a pmf \(f(x) = P(X=x)\) which is hard to compute. If we approximate \(f\) by \(\tilde{f}\) then

\[\begin{align*} P\left(X = a \,|\, \text{we only compute } \tilde{f}\right) &= \sum_x x P \left(f(a)=x \,|\, a, \tilde{f}(a) \right)\\ &= E\left(f(a) \,|\, a, \tilde{f}(a) \right) \end{align*}\]

What is \(P\left(f(a)=x \,|\, a, \tilde{f}(a)\right)\)? Well, if \(f\) is hard to compute then we probably can’t gather much data, so there are various options to produce a subjective belief:

  • average-case analysis of \(\tilde{f}\) with an uninformed prior, e.g. probabilistic numerics
  • reference classes of “similar” cases
  • uniform distribution across worst-case bounds
  • past empirical experience
  • etc.

Note that if the mean of the pmf \(P(f(a)=\cdot \,|\, a, \tilde{f}(a))\) is \(f(a)\) then \(P(X = a \,|\, \text{we only compute } \tilde{f}) = P(X=a)\). So accounting for uncertainty due to approximation is equivalent to “de-biasing” it.

Example: Suppose \(f\) has a single atom and our approximation \(\tilde{f}\) is modeled as \(f\) shifted by some unknown amount: \(\tilde{f}(x) = f(x + Y - 5)\), where \(Y \sim {\rm B{\small IN}}(10, 1/2)\). If \(\tilde{f}(0) = 1\), then

\[\begin{align*} P(X=0 \,|\, \text{we only compute } \tilde{f}) &= P(f(0) = 1 \,|\, \tilde{f}(0) = 1) \\ &\approxeq P(\tilde{f}(0) = 1 \,|\, f(0) = 1) \\ &=\binom{10}{5} 2^{-10} \doteq 0.246. \end{align*}\]

(The approximate equality holds if, say, we assume the location of the atom is a priori uniformly distributed on a large integer interval.)

Note that this is not completely new. E.g. when inferring how likely it is that software is bug-free based on a finite set of tests, we are putting probability distributions on mathematically determined statements, assuming the software is deterministic.

Inference is approximated for computational reasons in many places such as linearization as mentioned already, clustering by compression using a zip algorithm (instead of computing Kolmogorov complexity), PASS-GLM, MCMC sampling, numerical methods, approximation algorithms, probabilistic data structures, et cetera.

Is this ultimately rigorous in a decision theoretic sense? I don’t think so, but what is rigorous can easily be mathematically intractable. So whatever, it’s a heuristic.

Comment ranking algorithms: Hacker News vs. YouTube vs. Reddit

Comments on social sites have to be sorted somehow. How do big platforms do it – is it some complicated mix of recommender systems, learning-to-rank algorithms, Markov decision processes, neural networks, and learning automata? Well, maybe in some cases but often it’s just a simple formula. In this article we put the formulas used by Hacker News, YouTube, and Reddit, along with a few alternatives, to the test, using virtual comment section simulations. Spoiler alert: YouTube does not do well.

The simulation model

240 visitors arrive at equally spaced increments over a 24 hour period. Each visitor is randomly assigned as a commenter (10%) or a voter (90%). Commenters leave a single comment, which gets a randomly assigned quality category: great (10%), mediocre (80%), or stinker (10%). Great comments have a high probability of receiving upvotes and a low probability of receiving downvotes; stinkers are the reverse; and mediocre comments have a low probability of receiving any votes. Voters, on the other hand, see the top-ranked comment and vote according to its probabilities. At this point they stop reading or keep going based on a probability that depends on the vote they just gave (0% for upvotes, 50% for downvotes, 15% for non-votes). If they don’t leave, they see the next-ranked comment and the process continues until they finally do leave or they read all the comments. When the simulation concludes, we log the average number of upvotes per visitor which we use as our utility function.

See Python source code for full details.

Of course this is not a perfect model of every comment section. These parameter values will not always be accurate, although I did play around with e.g. the commenter/voter ratio and I got basically the same final conclusions. Realistically the rate of visitors may vary over time. A voter’s probability of leaving after a certain comment conditional on the most recent (non-)vote may also depend on how many comments they’ve already read. Comment threads are not represented here. Vote probabilities may change over time. Et cetera, et cetera.

The ranking formulas

Here we use the following symbols

  • Number of upvotes received so far: \(n_{+}\)
  • Number of downvotes received so far: \(n_{-}\)
  • Age of comment, in hours: \(h\)

All ranking methods in our analysis rank comments by scoring each comment and sorting in descending order. The scores are determined by the formulas below.

Starting with the basics, we have the ratio \((n_{+} - n_{-})/(n_{+} + n_{-})\) and the difference \(n_{+} - n_{-}\), a.k.a. the number of net upvotes. We don’t expect these to be optimal but they’re useful baselines. Another version of the ratio is \(n_{+}/(n_{+} + n_{-})\) which performs similarly.

For testing purposes, we have the random ranking which is, well, just random, and the upvote probability ranking which ranks according to the true upvote probability.

Reddit’s algorithm, detailed here, is a frequentist method for estimating the true voting probabilities based on \(n_{+}\) and \(n_{-}\). The Bayesian version of this is what we’ll call the Bayesian average: the same as ratio but we imagine that a few extra “phantom” votes have been cast, say 3 downvotes and 3 upvotes.

Hacker News roughly uses the formula \((n_{+} - n_{-}) / (h+2)^{1.8}\), which is like ratio, if we interpret the denominator \((h+2)^{1.8}\) as an estimate of the number votes cast. In fact, this denominator is probably more naturally thought of as an estimate of the number of votes cast including implicit non-votes. Non-votes (with a value of 0) would not impact the numerator.

To get a sense of how the simulations look, here are the comments as presented to the 240th visitor from one run using the Hacker News scoring formula:

\(h\) Upvote probability Downvote probability \(n_{+}\) \(n_{-}\) HN score
7.9 0.671 0.324 47 5 0.657
14.2 0.671 0.076 82 3 0.515
21.9 0.496 0.14 110 10 0.324
23.3 0.434 0.051 72 12 0.174
8.9 0.162 0.03 8 0 0.094
14.1 0.112 0.054 12 3 0.060
10.9 0.184 0.058 6 0 0.059
5.1 0.151 0.008 2 0 0.058
12.9 0.226 0.049 6 0 0.046
15.0 0.114 0.061 10 6 0.024
7.3 0.021 0.009 1 0 0.017
13.4 0.071 0.008 1 1 0.0
5.2 0.489 0.038 0 0 0.0
3.6 0.151 0.041 1 0 0.0
1.0 0.579 0.087 0 0 0.0
0.7 0.047 0.024 0 0 0.0
21.7 0.158 0.222 19 20 -0.003
20.7 0.048 0.017 1 3 -0.007
10.4 0.055 0.044 1 2 -0.010
11.3 0.041 0.027 0 2 -0.018
19.5 0.104 0.166 5 10 -0.019
5.4 0.045 0.604 1 3 -0.054

YouTube also uses a formula that involves the age of the comment. Their system additionally factors in the user’s lifetime ratio, which for our tests we set to 0 as if all users are new.

Lastly, let’s consider how we might modify the Bayesian average to take time into account. To make new comments more visible we’ll make the phantom votes all upvotes at first, then asymptotically reduce them to non-votes. We’ll also switch to a denominator similar to the Hacker News formula’s in order to estimate non-votes. This yields the modified Bayes formula

\[\frac{n_{+} - n_{-} + n_p / (h+1)}{n_p + h},\]

where \(n_p\) is the number of phantom votes. We use the value \(n_p=7\) in the simulations.

Ranking the rankings

I did enough simulation runs (1000-20000) with each formula to be pretty confident about how they compare. Without further ado, voila:

Ranking algorithm Average number of upvotes per visitor
Upvote probability 0.978
Modified Bayes 0.916
Hacker News 0.899
Bayesian average 0.878
Difference 0.848
Reddit 0.836
Ratio 0.813
YouTube 0.644
Random 0.607

So YouTube is marginally better than random, Reddit is worse than the simple difference, and Hacker News is the only one of the three better than Bayesian average. Disappointing but also plausible. How generalizable are the results? As always, more work required…

Mathematics as a service

What would a market for mathematics look like?

Formal verification might allow an elegant mechanism: Someone posts a proposition in a formal language like Coq and the first to submit a proof that passes verification wins the bounty. Everything can be automated and maybe even trustless. This has been tried, at, which was shut down due to consistency bugs in the verifier. Even without bugs, proof assistants are still difficult to use; mathematician Thomas Hales says “It is very hard to learn to use Lean proficiently. Are you a graduate student at Stanford, CMU, or Pitt writing a thesis on Lean? Are you a student at Imperial being guided by Kevin Buzzard? If not, Lean might not be for you.”

If we stick to natural language to avoid the learning curve, things get messy. How does the market decide what a complete proof is, which proof is first, and who did it? Perhaps the only tenable solution is to leave these decisions up to the individuals who post the bounties. How would we know that bounties would ever get paid? Stack Exchange forces bounties to be put in escrow and if they’re not awarded to someone there’s no refund. Another option is to rely on reputation by using certified identities (e.g. users’ email addresses are verified and public so they can be checked against personal webpages).

Something along these lines might be doable (and if someone wants to build it I’ll donate the domain but what’s the use case? Monetary rewards for mathematical problems are rare and mathematicians generally already earn a salary, so the interest would likely be modest. Students (anywhere in the world) are plausible suppliers though, perhaps even high school students, while consumers could be anyone with a research grant usable for paying “research assistants”, or industry and non-profit research groups. A market that brings these two sides together could be of some value.

Paid question answering has been tried before, e.g. Google Answers which wasn’t very popular. Did it fail due to lack of network effects, lack of innovative mechanisms, or an essential flaw in the concept? I don’t know. Bounties on GitHub issues seem to be a bit more successful.

In addition to bounties, there could be a prediction market. The time of resolution may have to be indefinite, though, since resolving “proposition X will be publicly proved by date Y” would in general require determining the nonexistence of a public proof, which is at least somewhat error-prone. However, prediction markets are basically illegal so it’s a moot point.

March 2020 links

James I’s 1597 book Daemonologie, “a philosophical dissertation on contemporary necromancy … touches on topics such as werewolves and vampires”.

96.5% of 19-year-old males in Seoul have myopia.

List of Scottish Canadians.

Free ebook of classic novel plot summaries.

“Kime”: complex-valued time.

Robin Hanson predicts China virus disaster

Robin Hanson says “In few months, China is likely to be a basket case, having crashed their economy in failed attempt to stop COVID-19 spreading.” Quantifying the forecast, he says China’s economy (or growth?) will be “a factor of two to ten down” and seems to expect dramatic results in 6 months.

Ranking cities by weather

Let’s analyze data from from the last 10 years to compare weather (technically “climate”) in a selection of North American cities.

If we define a “nice day” as one where

  • there are at least 10 hours of daylight,
  • the high apparent temperature is at least 0°C and at most 30°C,
  • the cloud cover is at most 70%, and
  • the UV index is at most moderate (unfortunately I used UV index at a single point in time during the day and didn’t adjust for time zones),

we get:

City Probability of nice day
San Diego 0.27
Los Angeles 0.23
San Francisco 0.22
Raleigh 0.22
Austin 0.2
Vancouver 0.19
New York 0.19
Cambridge 0.19
Chicago 0.16
Ottawa 0.16
Toronto 0.15

What are the nicest months to visit Toronto?

Month Average number of nice days in Toronto
January 0
February 2.9
March 9.0
April 4.7
May 1.2
June 0.4
July 0.5
August 4.0
September 12.1
October 15.8
November 2.4
December 0

If we define a “sunny day” as one where

  • there are at least 10 hours of daylight,
  • the high apparent temperature is at least 15°C, and
  • the cloud cover is at most 50%,

we get:

City Probability of sunny day
Los Angeles 0.69
Austin 0.56
San Francisco 0.49
Raleigh 0.46
San Diego 0.45
New York 0.33
Cambridge 0.32
Chicago 0.26
Toronto 0.23
Vancouver 0.2
Ottawa 0.18

What are the sunniest months to visit Toronto?

Month Average number of sunny days in Toronto
January 0
February 0
March 0.7
April 2.6
May 10.0
June 12.5
July 17.9
August 17.8
September 15.1
October 6.1
November 0.4
December 0

Lastly, if we define a “warm day” as one where

  • the high apparent temperature is at least 15°C and at most 25°C and
  • the UV index is at most high,

we get:

City Probability of warm day
San Diego 0.5
San Francisco 0.45
Los Angeles 0.37
Vancouver 0.33
Raleigh 0.28
New York 0.25
Austin 0.25
Ottawa 0.23
Toronto 0.23
Cambridge 0.22
Chicago 0.21

What are the warmest months to visit Toronto?

Month Average number of warm days in Toronto
January 0
February 0.3
March 1.8
April 6.9
May 11.7
June 11.5
July 4.8
August 10.2
September 19.9
October 13.7
November 2.1
December 0.1