Finally, Tim Peters of the Spambayes Project proposed a way of generating a particularly useful spamminess indicator based on the combined probabilities.

I hired programmers to do the programming to actually test it ... Robinson helped develop recommendation engine technology which applies high-power mathematical techniques using software algorithms to have a computer guess intelligently about what a consumer might like.

For example, if a consumer likes music by artists such as the Beach Boys, Bob Dylan and the Talking Heads, the computer software will match these preferences with a much larger dataset of other consumers who also like those three artists but which cumulatively has much greater musical knowledge than the single consumer.

In 2003, Robinson's article in Linux Journal detailed a new approach to computer programming perhaps best described as a general purpose classifier which expanded on the usefulness of Bayesian filtering.

Robinson's method used math-intensive algorithms combined with Chi-square statistical testing to enable computers to examine an unknown file and make intelligent guesses about what was in it.

The technique had wide applicability; for example, Robinson's method enabled computers to examine a file and guess, with much greater accuracy, whether it contained pornography, or whether an incoming email to a corporation was a technical question or a sales-related question.

Spam Bayes assigned probability scores to both spam and ham (useful emails) to guess intelligently whether an incoming email was spam; the scoring system enabled the program to return a value of unsure if both the spam and ham scores were high.

The approach described here truly has been a distributed effort in the best open-source tradition.

Paul Graham, an author of books on Lisp, suggested an approach to filtering spam in his on-line article, “A Plan for Spam”.

I took his approach for generating probabilities associated with words, altered it slightly and proposed a Bayesian calculation for dealing with words that hadn't appeared very often ...

an approach based on the chi-square distribution for combining the individual word probabilities into a combined probability (actually a pair of probabilities—see below) representing an e-mail.