Is Hirsch or Wilson confused? A commentary on "The pitfalls of heritability"
Christopher Viger and Daniel Dennett
Center for Cognitive Studies
In "The pitfalls of heritability," a review of Edward O. Wilsonís Consilience [Times Literary Supplement, Feb 12, 1999, p33], Jerry Hirsch claims to have convicted Wilson of a "confusion about genetic similarity and difference." In his book, Wilson claims that if we assume that "a mere one thousand genes out of the fifty to a hundred thousand genes in the human genome were to exist in two forms in the population," the probability of any two humans--excluding identical siblings--having the same genotype is vanishingly small. Hirsch points out that a single genotype can be produced in more than one way, thus increasing the likelihood of a single genotype recurring in the human population. Hirschís point is fair enough as far it goes, but it does not go nearly far enough. Hirsch has failed to carry out all the relevant calculations needed to determine the probability of two humans having the same genotype. In the realm of Vast numbers ("Very much greater than ASTronomical"--Dennett, 1995, p109), increased likelihood in and of itself tells us nothing. Here, then, are some of the relevant calculations.
We begin with some general calculations before considering the specific case. Suppose we have a population in which n genes in the genome exist in two forms. There are thus 2n distinct haploid gametes in the population, determining a total of (2n)2 = 4n possible combinations for the genotypes of the diploid offspring, as represented in the cells of a Punnett Square (41000 » 10602, using the numbers Wilson considers). But as Hirsch correctly points out, many of these genotypes occur more than once in the Punnett Square. To determine the frequencies of each genotype in the Punnett Square, we can think of the diploid genotypes as sequences of n pairs of genes. Since we are considering only the genes that exist in two forms, there are exactly four possibilities for each pair in a sequence. If the two forms of a gene are A and a, the four possibilities are AA, Aa, aA, and aa. A standard oversimplification supposes that we can consider the Aa and aA possibilities to be the same, it making no difference which form of the gene comes from the father and which from the mother. In other words, there are two ways of making the heterozygote Aa combination, and only one way of making either the AA or aa combination. This is an oversimplification since, in the phenomenon of "genomic imprinting," the difference between the paternal and maternal contributions does make a difference, but setting this aside for the sake of Hirschís argument, we can go along with his claim that any sequence having at least one pair that contains a gene of each form can be produced in more than one way. Since we are assuming that the determinations of each gene pair are independent, in general a genotype in which k of the total n pairs of genes have one gene of each form appears in the Punnett Square 2k times. And the number of such genotypes is nCk× 2n- k, the number of ways of specifying k of the n gene pairs1 times the number of possibilities for the other n- k gene pairs, which is 2 for each pair since both pairs must be of the same form and there are only two forms. Thus the total number of genotypes in which k of the total n pairs of genes have one gene of each form appearing in the Punnett Square is nCk× 2n- k2k = nCk× 2n. Summing these values over all k, we obtain 2n2n = 4n, as expected. We also see from this analysis that the number of distinct genotypes in the Punnett Square is å nCk× 2n- k = 3n (31000 » 10477 using Wilsonís numbers), again in agreement with Hirsch (and Wilson). The question before us is what is the probability when we randomly select m cells in the Punnett Square that no two will represent the same genotype. (With Hirsch and Wilson, we are assuming complete independence of the loci and ignoring identical siblings and conditions of extreme inbreeding, to which we will return below.)
To ease the task of calculating the probability with the specific numbers suggested, we can conceive of the Punnett Square as subdivided into blocks, each containing 2n cells. We can define these blocks so that repeating genotypes always occur in a single block as follows. The most frequently repeating genotype, in which every gene pair has one gene of each form, occurs exactly 2n times. These 2n cells constitute one block. In the general case where a genotype repeats 2k times, 2n- k groupings of such genotypes will determine a block containing exactly 2n cells. Since there are nCk× 2n- k such genotypes in total, there are nCk blocks containing genotypes that repeat 2k times. Notice that there are å nCk = 2n such blocks in the Punnett Square, each containing 2n cells, again giving the total 4n cells. In other words, in imagination we can segregate all the repeating genotypes into blocks of the same size, so that you have to select from a single block to find a matching pair.
To simplify the calculation of the probability, we will consider only the probability of two of our sample coming from the same block. Notice that this will greatly overestimate the probability of two genotypes being identical, since the probability of two genotypes in the same block being identical is vanishingly small in general (though it reaches 1 in our "best case" block and .5 in our 1000 tied-for-second-best-case blocks). Thus we have 2n disjoint blocks all of equal size and want to know the probability of obtaining two genotypes in the same block, given a sample of size m. Since the blocks are of equal size, the probability of obtaining a cell in any particular block is equally likely. Our problem thereby reduces to the famous birthday problem. What is the probability that 2 people in a room of 30 have the same birthday? In general, if there are N equally likely possibilities and we have a sample of size m, the probability of no two being alike P(N,m) is given by: P(N,m) = N(N- 1)(N- 2)...(N- m+1)/Nm. Now the numbers in our problem are so large that this calculation is not practical. We can greatly simplify the situation, however, by underestimating P(N,m).
P(N,m) = N(N- 1)(N- 2)...(N- m+1)/Nm > (N- m)m/Nm
= (Nm - mC1× Nm- 1m + mC2× Nm- 2m2 - ... (- 1)mmm)/Nm
Now plug Wilsonís example of a thousand genes with two forms into this formula to (under)estimate the probability that no two humans, excluding identical siblings, have the same genotype, by determining the probability that no two humansí genotypes are in the same block in the Punnett Square.
The total number of possibilities is the total number of blocks 21000 » 10301. We assume a sample size of one quadrillion (1015) to (over)estimate the total number of humans that will ever live. So we use P(N,m) > (Nm - mC1× Nm- 1m + mC2× Nm- 2m2 - ... (- 1)mmm)/Nm, where N = 10301, m = 1015. Now since 10301 >> 1015, all but the first two terms in the estimate of P(N,m) are negligible, which can be seen easily as follows. To again underestimate P(N,m), suppose all terms are negative and approximate mCk with the much larger mk. The remaining terms are thus bounded by the geometric series - å mkNm- k mk/Nm = - å (m2/N)k as k ranges from 2 to m, which for N = 10301, m = 1015, is on the order of - 10 - 540. So P(N,m) > (Nm - mC1× Nm- 1m)/Nm = 1 - m2/N. Thus P(N,m) > 1 - (1015)2/10301 = 1 - 1030/10301 = 1 - 10 - 271, which is a vanishingly small difference from 1, just as Wilson claims. To put this the other way around, the probability of there being two humans, excluding identical siblings, with the same genotype is less than 1 - P(N,m) < 10 - 271. In fact we can see from the above calculation that we would require a human population size of approximately 10150, a number larger than current estimates for the number of elementary particles in the universe, in order to have a non-negligible probability of two humans having the same genotype. Notice that if we ignore the duplications in the Punnett Square, as Wilson seems to have done, the estimate is 1030/10477 = 10 - 447. So Hirsch is right to point out that duplication makes the probability of repeating genotypes more likely--on this estimate it is more likely by an unimaginably huge factor of 10176.2 Nevertheless, Wilsonís main claim still stands! It is certainly not clear that he has demonstrated any confusion on the matter; rather, like Hirsch himself, he has simply not bothered to perform the relevant calculations because his intuition is that with such Vast numbers the probability of two humans, excluding identical siblings, having the same genotype is Vanishingly small, an intuition borne out by performing the exact calculation Hirsch suggests.
The real world situation is of course much more complicated than these fantastic Vast numbers suggest. Any real population, a Vanishing thread of related individuals in the deep space of Vast possibilities, has much less independence than the model supposes, but even in cases of small, extremely inbred, incestuous populations, where the genetic similarities between individuals is much greater than in large outbreeding populations, the odds against duplicate genotypes even in siblings are Vast. And if we assume ubiquitous genomic imprinting, as there is reason to do, the cells of the Punnett Square all represent distinct genotypes; no two are strictly identical.
Concerning the rest of Consilience, we make no comment here. We simply wished to demonstrate that Wilsonís claim, based on his expertise in genetics, that the probability of two humans having the same genotype is vanishingly small is correct, contrary to what Hirsch implies. Unlike many of the critics of Consilience, who content themselves with handwringing and unsupported charges, Hirsch, to his credit, attempts an explicit demonstration of error, a criticism that is worthy of careful assessment and rebuttal. He fails, so he also fails to provide support for his larger, and more damning claim: "Where a shortcoming occurs in a field as close to Wilsonís own as genetics, oneís confidence in his expertise in remoter areas must be badly shaken."
Dennett, Daniel C., 1995, Darwinís Dangerous Idea, New York: Simon & Schuster.
1We use the older nCk notation for readability in the text. This is, of course, just the number of combinations with k members drawn from a set of size n.
2Actually a more careful calculation using the exact frequencies in which genotypes repeat shows the probability of two humans with the same genotype is approximately 10 - 400, so the increased likelihood of humans with the same genotype is a factor of "only" 1050. Our thanks to Terry Gannon of the University of Alberta for carrying out this calculation and verifying the results presented here.