Let's use RockYou:

$ wc -l r 32603387 r $ sort -uS 4G r | wc -l 14344391 $ shuf r > rs # Sanity check $ wc -l rs 32603388 rs # Apparently, "wc -l" didn't count the last line before "shuf" because there was no linefeed character after it. A very minor discrepancy, OK to ignore. $ sort -uS 4G rs | wc -l 14344391 $ head -32603 rs | sort -uS 4G | wc -l 28815 $ head -326034 rs | sort -uS 4G | wc -l 247675 $ head -3260339 rs | sort -uS 4G | wc -l 1976676 # Calculate the ratios $ bc scale=10 28815/32603 .8838143729 247675/326034 .7596600354 1976676/3260339 .6062792856 14344391/32603387 .4399662832

Let's go to Wolfram Alpha and extrapolate:

FindFit[{{32603,28815},{326034,247675},{3260339,1976676},{32603387,14344391}},a x^b,{a,b},x] FindFit[{{32603,28815},{326034,247675},{3260339,1976676},{32603387,14344391}},a x^b log x,{a,b},x] FindFit[{{32603,0.883814},{326034,0.75966},{3260339,0.606279},{32603387,0.439966}},{(1+log(1+x/a))^b,b<0,(1+log(1+32603387/a))^b=0.439966},{a,b},x] FindFit[{{32603,0.883814},{326034,0.75966},{3260339,0.606279},{32603387,0.439966}},{(1+log(1+x/a)/10)^b,b<0,(1+log(1+3260339/a)/10)^b=0.606279,(1+log(1+32603387/a)/10)^b=0.439966},{a,b},x] Table[{x((1+log(1+x/242428))^-0.462189),x((1+log(1+x/197913)/10)^-1.98901),4.76888 x^0.862246,0.812062 x^0.799794 log x},{x,{1,1000,32603,10^5,326034,3260339,32603387,10^9,10^10,10^11}}] Plot[{x,x((1+log(1+x/242428))^-0.462189),x((1+log(1+x/197913)/10)^-1.98901),4.76888 x^0.862246,0.812062 x^0.799794 log x},{x,1,10^5}] # ... and so on for other plot scales, up to x=10^11

Update: the longest FindFit above may be simplified to:

FindFit[{{3260339,0.606279},{32603387,0.439966}},{(1+log(1+x/a)/10)^b,b<0},{a,b},x]

which gives almost the same result (a=197787 b=-1.9888). With either, we're essentially solving two equations with two unknowns.

Further, when fitting the curve to only 3 points, Wolfram Alpha is able to find three parameters at once, which lets us do:

FindFit[{{326034,0.75966},{3260339,0.606279},{32603387,0.439966}},{(1+a log(1+x/b))^c,c<0},{a,b,c},x] FindFit[{{326034,0.75966},{3260339,0.606279},{32603387,0.439966}},{(1+log(1+x/b)/a)^c,c<0},{a,b,c},x] Table[{x((1+0.00799391 log(1+x/35208.8))^-14.8322),x((1+log(1+x/1891.9)/209.283)^-15.0619)},{x,{1,1000,32603,10^5,326034,3260339,32603387,10^9,10^10,10^11}}]

This looks like it'd match our data really well (including in the 1 to 326k range not explicitly included in the FindFit queries), but unfortunately at these parameter values (a power of -15) things get somewhat less stable (precision loss?) We might need to use a better tool for this. The table and graphs below do not use these 3-parameter formulas.

In plain English: we got four pretty good approximation formulas, which produce values matching RockYou's nearly perfectly in the ~300k to 32.6M range. (The simpler two formulas produce impossible values in the 1 to 100k range, though. The more complicated two are chosen such that they can't produce theoretically impossible values, but they nevertheless do produce higher than RockYou's actual numbers of unique passwords in the ~100 to ~200k range.) Then we use these formulas to extrapolate to up to 100 billion of total passwords.

Some numbers of interest: at 1 billion total, there are estimated to be about 300 million unique passwords. At 10 billion, there are ~2.3 billion uniques. At 100 billion, there are ~18 billion uniques. (This is using mean values of the four formulas' results.)

A flaw of this approach: the world is not a service like RockYou. Presumably, most people who let RockYou have a password of theirs did so for only one or a few of their passwords, for specific services. Those same people generally also have other password-protected resources, and on at least some of those they might reuse the same passwords. Thus, if we analyzed the full set of {resource, password} combinations of those same people, we'd likely see more duplicate passwords, and would extrapolate to fewer unique passwords in the world. Another way to look at this, though, is that we might be estimating the number of unique passwords in use by a certain number of people (our “total”) rather than among a certain number of {resource, password} combinations.

The above analysis was based on the RockYou leak. In the ECB mode encrypted passwords leak from Adobe that occurred later, there are about 130 million passwords total, 56 million unique (the exact passwords aren't always known to us, but ECB mode lets us figure out how many are unique anyway). Extrapolation from RockYou using the formulas above gives 47 million to 52 million unique, so Adobe's 56 million suggests that their users' passwords are slightly better in this respect.

Perl script to calculate unique from total using the four formulas, as well as average and mean:

#!/usr/bin/perl die "Usage: $0 TOTAL\n" unless ($#ARGV eq 0); $x = $ARGV[0]; $a = $x * ((1 + log(1 + $x / 242428)) ** -0.462189); $b = $x * ((1 + log(1 + $x / 197913) / 10) ** -1.98901); $c = 4.76888 * $x ** 0.862246; $d = 0.812062 * $x ** 0.799794 * log($x); $avg = ($a + $b + $c + $d) / 4; $mean = ($a * $b * $c * $d) ** (1 / 4); printf "%.0f %.0f %.0f %.0f\naverage = %.0f\nmean = %.0f\n", $a, $b, $c, $d, $avg, $mean;

Back to my pseudo homepage.