# Markov generator

The assumption behind this theory is that people can remember their passwords because there is a hidden Markov model in the way they are generated. This paper was the source of inspiration for this patch. The basic hypothesis is that the probability that a letter appears at position n is a function of the letter at position (n-1) 1). This models the fact that it is more likely to find after a c the letter h than the letter z.

If P(x) is the probability that x is the first letter of a password, and P(a|b) the probability that b follows a, then for the word `bull` P(bull) = P(b)*P(u|b)*P(l|u)*P(l|l).

When converting to integers by defining P'(x) = round(-10*log( P(x) )), this becomes P'(bull) = P(b)+P'(u|b)+P'(l|u)+P'(l|l).

## Properties

This convertion allows for faster computations and several properties, explained in the aforementionned paper, such as:

• it is possible to efficiently count the quantity of passwords where P(password) is less than a given value
• it is possible to efficiently generate the Nth password where P(password) is less than a given value
• it is easy to compute the score P' of any word

## Patch usage

The patch is already included in this tarball, as well as in the jumbo patch (1.7.3.1-all-4 and above). There is no standalone patch because of the general lack of interest in it. It adds a new option to `john` and adds three programs.

### calc_stat

This program is used to generate your statistic file. It is recommended to do this to tune the generator to your specific language by running it on a large sample of clear text passwords. Its usage is:

`./calc_stat [-p] dictionnary_file statfile`

### mkvcalcproba

This program takes a file with a password per line and a stats file as its input, and outputs statistics about them as a tab separated output. The columns are:

• the details of the P' calculation
• the value of P' (markov score)
• the position of the password if it was brute forced blindly
• the markov score of the password without the first two characters

This program is of limited interest to most, but will let you

• generate nice graphs to include in reports (will be added soon)

### genmkvpwd

This program could be used to output passwords on stdout, but is mainly useful for planning a cracking session. Its syntax is:

`./genmkvpwd statfile max_lvl [max_len] [start] [end]`

With a given statfile, it is mainly used in the following way : `./genmkvpwd statfile 0 len` gives, for `level` between 100 and 350, the quantity of passwords whose maximum length in `len` and markov strength is or equal to `level`. For example:

```\$ ./genmkvpwd stats 0 12
lvl=100 (2424 Kb for nbparts) 9846 possible passwords
lvl=101 (2448 Kb for nbparts) 10 K possible passwords (10849)
lvl=102 (2472 Kb for nbparts) 11 K possible passwords (11915)
lvl=103 (2496 Kb for nbparts) 13 K possible passwords (13104)
lvl=104 (2520 Kb for nbparts) 14 K possible passwords (14446)
...
lvl=346 (8328 Kb for nbparts) 440023 G possible passwords (440023594138549)
lvl=347 (8352 Kb for nbparts) 478482 G possible passwords (478482142956335)
lvl=348 (8376 Kb for nbparts) 520144 G possible passwords (520144425218940)
lvl=349 (8400 Kb for nbparts) 565264 G possible passwords (565264299077832)
lvl=350 (8424 Kb for nbparts) 614114 G possible passwords (614114588006381)```

### john mode

The syntax of the new generator is:

`–markov[=level[:start:end[:maxlen]]]`

Here `level` correspond to the maximum markov strength of candidate passwords, `start` to the first password to be generated (default: 0), `end` to the last (default: 0, 0 meaning “last password”), and `maxlen` to the maximum length of candidate passwords.

## Real world usage example

In this example, two computers are available to the cracker:

• a laptop, with a 1.2Ghz Core 2 CPU, with 2 cores
• a server, with a 1.86Ghz Xeon CPU, with 4 cores

Password of length less than 12 will be tested against a single Raw-MD5 hash during 24 hours.

### Benchmarking against the target

It is important to be able to benchmark against the passwords that are going to be tested:

```\$ time ./john -markov:200:10000000:50000000:12 -format:raw-md5 truc.md5
MKV start (lvl=200 len=12 pwd=40000000)
guesses: 0  time: 0:00:00:11 100%  c/s: 3442K  trying: sinounol - sinounot

real    0m11.646s
user    0m11.581s
sys     0m0.008s```

The `time` command is more precise than `john`'s output. The actual speed is 40000000/11.646 = 3434655 passwords/s. Note that while this matches the c/s figure in this example, this is not always the case.

### Getting the total quantity of passwords to generate

The benchmarks give:

• 3434655 passwords/s for the laptop
• 9912442 passwords/s for the server (not an actual benchmark, made up for this example)

There are two cores on the laptop, and four on the server, leading to a total speed of (3434655*2 + 9912442*4) = 46519078 passwords/s. During 24 hours, this is 46519078*3600*24 = 4019248339200 passwords, or 4019G passwords. The `genmkvpwd` for length 12 passwords gives the following values:

```lvl=294 (7080 Kb for nbparts) 3634 G possible passwords (3634988600673)
lvl=295 (7104 Kb for nbparts) 4014 G possible passwords (4014261346547)
lvl=296 (7128 Kb for nbparts) 4432 G possible passwords (4432325494466)```

The quantity of passwords for markov level 295 matches the quantity of passwords that could be tested in 24 hours.

4014261346547 passwords are to be divided between 6 CPUs, capable of generating a total of 46519078 passwords/s.

host CPU speed passwords to generate start
laptop 1 3434655 296385986094 0
laptop 2 3434655 296385986094 296385986094
server 1 9912442 855372343589 592771972188
server 2 9912442 855372343589 1448144315777
server 3 9912442 855372343589 2303516659366
server 4 9912442 the rest 3158889002955

The following command lines are to be executed:

• on the laptop
```\$ ./john -markov:295:0:296385986093:12 -format:raw-md5 truc.md5 &
\$ ./john -markov:295:296385986094:592771972187:12 -format:raw-md5 truc.md5```
• on the server
```\$ ./john -markov:295:592771972188:1448144315776:12 -format:raw-md5 truc.md5 &
\$ ./john -markov:295:1448144315777:2303516659365:12 -format:raw-md5 truc.md5 &
\$ ./john -markov:295:2303516659366:3158889002954:12 -format:raw-md5 truc.md5 &
\$ ./john -markov:295:3158889002955:0:12 -format:raw-md5 truc.md5```

It is recommended to automate all these steps.

## Markov vs. "incremental" mode

Matt Weir ran some curious tests of the two cracking modes. In his tests, the Markov mode with carefully chosen limit outperformed “incremental” in the end. On the other hand, “incremental” could continue running (or the session could be interrupted and continued at a later time), whereas the Markov mode terminated upon reaching the limit.

Please see Matt's first blog post on the topic and his follow-up blog post.

1) “incremental” mode as supported by the “official” JtR also uses this hypothesis