Calculating Probability
Perl – Think Again
To tackle mathematical problems with conditional probabilities, math buffs rely on Bayes' formula or discrete distributions, generated by short Perl scripts.
Features
The Monty Hall problem is loved by statisticians around the world [1]. You might be familiar with this puzzle, in which a game show host offers a contestant a choice of three doors – behind one door is a prize, but the other two doors only reveal goats. After the contestant chooses a door, the TV host opens a different door, revealing a goat, and asks the candidate to reconsider (Figure 1). Who would have thought that probabilities in a static television studio could change so dramatically just because the host opens a door without a prize?
The human brain seems to struggle with conditional probabilities like these, which describe random events with preconditions. It's just like a task I set up in a Perl column 15 years ago, which involved a top hat with three cards inside. The first card is black on both the front and back, the second card is red on front and back, and the third card red on one side and black on the other (Figure 2). One card is drawn. You can see the front, which is red. What is the probability of the back of the card also being red?
Counterintuitive
Most people immediately respond "50 percent," because there are apparently two equally probable cases: the redandblack card and the redandred card – that is, the opposite side is black in one case and red in the other case. To many people's surprise, however, the correct answer is "66 percent." The clue lies within the preconditions of the experiment: The redandred card is drawn twice as often in comparison to the redandblack one, because it meets the precondition of the experiment – one side is red – on both sides, whereas the redandblack card only gets drawn onethird of the time.
More than 250 years ago, the mathematician Thomas Bayes tried to express such intuitively errorprone problems in an immutable formula. The Bayes' theorem [2] describes in its "diachronic interpretation" how the probabilities of hypotheses change over time, as you encounter new data in an experiment. The formula
P(HD) = P(H) × P(DH)/P(D)
defines the probability P of a hypothesis H, for example, "I have drawn the redandred card," under the condition of newly emerging data D, such as "The front of the drawn card is red" (Figure 3).
On the righthand side of the equation you see P(H), the probability of the hypothesis, before new data become available. Bayes multiplies it by P(DH), that is, the probability that the data are available under the hypothesis. How likely is it, for example, that the person drawing the cards actually will see a red front if they have drawn the redandred card?
The denominator on the right side, P(D), is the probability of the available data, independent of any hypothesis. How likely is it that the candidate will see a red front after drawing any card?
Clever Bayes
All three possible hypotheses – randomly drawn cards – are obviously equally likely in this experiment. In other words, P(H) – the probability that the redred card is drawn – is equal to 1/3, because it is drawn in one third of all cases. Assuming this hypothesis, the flip side of the drawn redred card is trivially also red in exactly 100 percent of all cases; that is, P(DH) = 1.
Regardless of the hypothesis of one drawn card, the probability of seeing a red card surface after drawing is 50 percent, that is, P(D) = 1/2. After all, there are six card sides in the hat, three of which are red, and three black. If you substitute this into the Bayesian formula, the result for P(HD) is thus 1/3 × 1/(1/2) = 2/3. Thus, the formula predicts the experimentally demonstrable correct result of 66 percent and contradicts our intuition.
Probability, Discretely Programmed
Think Bayes [3], which was published about six months ago, not only explains the Bayesian formula but also describes a numerical method that is easy to use for experiments with short software programs. To do this, you store the hypotheses as statistical distributions, that is, in a Python dictionary or Perl hash, along with their probabilities.
As additional data becomes available, the stored values are then multiplied by the conditional probability of the incoming data, under the assumption that the individual hypothesis is true. Later, the program normalizes all values so that they add up to 1, or 100 percent.
The result is the probability of all hypotheses in the light of the incoming data. This process and specifically the final normalizing step can only be used, however, if:
 not more than one of the hypotheses is true, and
 there are no other options; that is, at least one of the hypotheses is true.
Listing 1 [4] shows the implementation of the cards
script, which prints the correct solution, 0.666. It uses the Distrib.pm
module (Listing 2), which provides auxiliary functions for computing the statistical distribution. The probabilities for the hypotheses of drawing different cards (BB
, BR
, RR
for blackblack, blackred, redred) are fed to the module in lines 79, each with a value of 1/3.
Listing 2
Distrib.pm
01 use Moose; 02 03 has 'values' => (is => 'rw', isa => 'HashRef', 04 default => sub { {} } ); 05 06 sub set { 07 my( $self, $hypo, $prob ) = @_; 08 09 $self>values()>{ $hypo } = $prob; 10 } 11 12 sub mult { 13 my( $self, $hypo, $prob ) = @_; 14 15 $self>values()>{ $hypo } *= $prob; 16 } 17 18 sub normalize { 19 my( $self ) = @_; 20 21 my $values = $self>values(); 22 my $sum = 0; 23 24 for my $hypo ( keys %$values ) { 25 $sum += $values>{ $hypo }; 26 } 27 for my $hypo ( keys %$values ) { 28 $values>{ $hypo } /= $sum; 29 } 30 } 31 32 sub prob { 33 my( $self, $hypo ) = @_; 34 35 return $self>values()>{ $hypo }; 36 } 37 38 1;
Listing 1
cards
01 #!/usr/local/bin/perl w 02 use strict; 03 use Distrib; 04 05 my $distrib = Distrib>new(); 06 07 $distrib>set( "BB", 0.33 ); 08 $distrib>set( "BR", 0.33 ); 09 $distrib>set( "RR", 0.33 ); 10 11 $distrib>mult( "BB", 0 ); 12 $distrib>mult( "BR", 0.5 ); 13 $distrib>mult( "RR", 1 ); 14 15 $distrib>normalize(); 16 17 print $distrib>prob( "RR" ), "\n";
Lines 1113 of Listing 1 define the incoming data (i.e., the probabilities that the front sides of the drawn cards are red) for each hypothesis and multiply the supplied hypothesis values by them. Thus, the probability of the subject seeing a red front on the blackblack card is zero. The probability is 50 percent for the blackred card, depending on which way the card faces coming out of the hat. The redred card shows red in 100 percent of all cases, which is why line 13 sets the multiplier to a value of 1
.
Finally, the call to the prob("RR")
method returns the normalized probability of the hypothesis redred card, which is 2/3.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
News

Linus Addresses Criticism for Harsh Language
Kernel king admits his tone has alienated volunteers, but says the demands of the process require directness.

New SSL Attack Lets a Malicious Listener Steal Session Cookies
New flaw in an old encryption scheme leaves the experts scrambling to disable SSL 3

Leading Open Source Coder Rants about Rants
Lennart Poettering wants to change the way Linux developers talk to each other.

HP Splits into Two Companies
Enterprise giant frees itself from ink and home PCs (and visa versa).

Mozilla Labs Shuts Down
Mozilla’s product think tank sinks silently into history.

Internet Giants Launch Collaboration to Improve Open Source
TODO group will focus on open source tools in largescale environments.

Fedora Announces New Partition Manager App
New tool will look like GParted but support a wider range of storage technologies.

Mozilla Releases a New Firefox
New public key pinning feature will help prevent maninthemiddle attacks.

Tool Predicts Which Websites Will be Compromised
Carnegie Mellon researchers say 3 million pages could fall down the phishing hole in the next year.

SSH Key Management Guidelines
The US government rolls new bestpractice rules for protecting SSH.