Mastodon

A simple next-word prediction engine

Download .zip Download .tar.gz View on GitHub

Mastodon

A simple next-word prediction engine

Quick start

# Fetch a sample corpus
$ mkdir data
$ mkdir data/samples
$ curl http://norvig.com/big.txt -o data/samples/big.txt

# Generate stats using NSP
$ mkdir data/output
$ cd scripts
$ ./generate_stats.sh ../data/samples/big.txt ../data/output/

# Create binary dictionaries
$ cd ..
$ mkdir dictionaries
$ mkdir dictionaries/test
$ cd python
$ python makedict.py -u ../data/output/unigrams.txt -n ../data/output/ngrams2.ll,..$ /data/output/ngrams3.ll,../data/output/ngrams4.ll -o ../dictionaries/test/big.dict

# Create binary dictionaries for unit tests
$ python makedict.py -t
$ python unittests.py
$ cd ../cpp
$ make test

Generating statistics

To create a binary dictionary, we need data created from the N-Gram Statistics Package (NSP), available at http://www.d.umn.edu/~tpederse/nsp.html. The script generate_stats.sh in the scripts/ folder serves this purpose.

A sample corpus can be found at http://norvig.com/big.txt.

$ curl https://dl.dropbox.com/u/228601/8pen/big.txt -o data/samples/big.txt

We can generate the desired statistics in the following way:

$ cd scripts
$ ./generate_stats.sh INPUT_FILE OUTPUT_DIR

Unigrams

The script generates a simple word frequency list unigram.txt in OUTPUT_DIR, in which each line is of the form weight unigram. Example output:

79377 the
39997 of
38076 and
28604 to
21780 in
20910 a
...

The weight is simply the number of occurences of the corresponding word in the corpus.

N-grams

The script then generates a lists of bi-, tri-, and four-grams (ngrams2.ll, ngrams3.ll, ngrams4.ll, also locaed in OUTPUT_DIR) of the form unigram<>unigram<>...<>rank weight (we ignore rank for now). Example output:

of<>the<>2 25053.6988
in<>the<>6 10335.9606
did<>not<>8 9798.6723

Generating dictionaries

To generate a binary dictionary using output of the NSP, a script makedict.py in the python/ folder is available. Example usage:

$ python makedict.py -u UNIGRAM_FILE -n BIGRAM_FILE,TRIGRAM_FILE,FOURGRAM_FILE -o OUTPUT_FILE

Using dictionaries

Implementations in Python and C++ are currently available for loading a binary dictionary and querying it for:

  • Corrections
  • Completions (Python only)
  • Next-word predictions

Python

Here is a simple usage in Python:

bindict = BinaryDictionary.from_file('../dictionaries/test/test.dict')
bindict.get_predictions(['hello']) # => [('there',10),('sir',3)]
bindict.get_corrections('yuur')    # => ['your','you','year']
bindict.get_completions('yo', 2)   # => ['you','your']

C++

Here is a simple usage in C++:

BinaryDictionary bindict;
bindict.fromFile("../dictionaries/test/test.dict");

string phrase[] = {"how", "are"};
vector<weighted_string> holder;
vector<weighted_string> predictions = bindict.getPredictions(phrase, 2, holder, 4);

vector<weighted_string> holder;
vector<weighted_string> corrections = bindict.getCorrections("you", holder, 100);

Note that querying for word completions is not yet implemented in C++.

Unit tests

The unit tests are designed to be used with a simple dictionary, located at dictionaries/test/test.dict, and generated using the -t option:

$ python makedict.py -t

Python

The Python unit tests use the unittest module, and are available in python/unittests.py:

$ python unittests.py

C++

The C++ unit tests, located at cpp/tests/unit/test.cpp, are based on the UnitTest++ framework (included). Simply use the provided Makefile in the cpp folder to run the tests:

$ make test

Generating statistics

License

Mastodon is released under the MIT license. See LICENSE.md.