A simple next-word prediction engine

Quick start

# Fetch a sample corpus
$ mkdir data
$ mkdir data/samples
$ curl -o data/samples/big.txt

# Generate stats using NSP
$ mkdir data/output
$ cd scripts
$ ./ ../data/samples/big.txt ../data/output/

# Create binary dictionaries
$ cd ..
$ mkdir dictionaries
$ mkdir dictionaries/test
$ cd python
$ python -u ../data/output/unigrams.txt -n ../data/output/ngrams2.ll,..$ /data/output/ngrams3.ll,../data/output/ngrams4.ll -o ../dictionaries/test/big.dict

# Create binary dictionaries for unit tests
$ python -t
$ python
$ cd ../cpp
$ make test

Generating statistics

To create a binary dictionary, we need data created from the N-Gram Statistics Package (NSP), available at The script in the scripts/ folder serves this purpose.

A sample corpus can be found at

$ curl -o data/samples/big.txt

We can generate the desired statistics in the following way:

$ cd scripts


The script generates a simple word frequency list unigram.txt in OUTPUT_DIR, in which each line is of the form weight unigram. Example output:

79377 the
39997 of
38076 and
28604 to
21780 in
20910 a

The weight is simply the number of occurences of the corresponding word in the corpus.


The script then generates a lists of bi-, tri-, and four-grams (ngrams2.ll, ngrams3.ll, ngrams4.ll, also locaed in OUTPUT_DIR) of the form unigram<>unigram<>...<>rank weight (we ignore rank for now). Example output:

of<>the<>2 25053.6988
in<>the<>6 10335.9606
did<>not<>8 9798.6723

Generating dictionaries

To generate a binary dictionary using output of the NSP, a script in the python/ folder is available. Example usage:


Using dictionaries

Implementations in Python and C++ are currently available for loading a binary dictionary and querying it for:

  • Corrections
  • Completions (Python only)
  • Next-word predictions


Here is a simple usage in Python:

bindict = BinaryDictionary.from_file('../dictionaries/test/test.dict')
bindict.get_predictions(['hello']) # => [('there',10),('sir',3)]
bindict.get_corrections('yuur')    # => ['your','you','year']
bindict.get_completions('yo', 2)   # => ['you','your']


Here is a simple usage in C++:

BinaryDictionary bindict;

string phrase[] = {"how", "are"};
vector<weighted_string> holder;
vector<weighted_string> predictions = bindict.getPredictions(phrase, 2, holder, 4);

vector<weighted_string> holder;
vector<weighted_string> corrections = bindict.getCorrections("you", holder, 100);

Note that querying for word completions is not yet implemented in C++.

Unit tests

The unit tests are designed to be used with a simple dictionary, located at dictionaries/test/test.dict, and generated using the -t option:

$ python -t


The Python unit tests use the unittest module, and are available in python/

$ python


The C++ unit tests, located at cpp/tests/unit/test.cpp, are based on the UnitTest++ framework (included). Simply use the provided Makefile in the cpp folder to run the tests:

$ make test

Mastodon is released under the MIT license. See