Python Package

Infini-gram offers a Python package, which allows you to run the infini-gram engine on your own machine with indexes stored locally. You can access all functionalities offered by the API Endpoint and the Web Interface, plus a little extra, while sparing yourself the annoying network latency and rate limits.

You can run the engine on our pre-built indexes, which we have made available for download. You can also build new indexes on datasets of your choice.

Getting Started

To make queries on a local index, you first need to instantiate an engine with this index, and then you can make queries by invoking the appropriate methods in the engine. Here’s a minimal example to get started:

>>> from infini_gram.engine import InfiniGramEngine
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_bos_token=False, add_eos_token=False)
>>> engine = InfiniGramEngine(index_dir='index/v4_pileval_llama', eos_token_id=tokenizer.eos_token_id)

>>> input_ids = tokenizer.encode('natural language processing')
>>> input_ids
[5613, 4086, 9068]
>>> engine.count(input_ids=input_ids)
{'count': 76, 'approx': False}

You can read about other query types in the Query Types section below.

This Python Package vs. the API Endpoint

The Python package has all functionalities of the API endpoint, plus a few extra features:

There is no hard upper limit on query parameters (e.g., max_support, max_clause_freq, max_diff_tokens, max_disp_len). The only limit will be your machine’s compute power.

There are a few other distinctions:

Inputting query strings is not allowed. You need to tokenize your query yourself.
CNF queries have separate method names (count_cnf(), find_cnf()) from simple queries (count(), find()).
The input field query_ids is replaced with more specific names (e.g., input_ids, cnf, prompt_ids, cont_ids).
The output does not contain fields token_ids, tokens, and latency.

Installation

Check your system and make sure it satisfies the following requirements:

You have Linux or MacOS. Sorry no Windows support :)

Supported architectures: x86_64 and i686 on Linux; x86_64 and arm64 on MacOS.

Your system needs to be little-endian. This should be the case for most modern machines.

Make sure you have Python >=3.11 (and strictly speaking, CPython, not PyPy or some other implementations).

Install this package: pip install infini-gram
If you’d like to run the engine on one of our pre-built indexes, download the index that you would like to query. For sake of performance, it is strongly recommended that you put the index on an SSD. See details in the Pre-built Indexes section below.
If none of the pre-built indexes fit your need, you can build new indexes on datasets of your own choice. See details in Indexing Custom Datasets.

Pre-built Indexes

We have made the following indexes publicly available on AWS S3:

Name	Documents	Tokens	Storage	Corpus	Tokenizer	S3 URL
`v4_olmoe-mix-0924-dclm_llama`	2,948,096,911	4,341,627,197,578	33TiB	olmoe-mix-0924 (the DCLM part)	Llama-2	s3://infini-gram/index/v4_olmoe-mix-0924-dclm_llama
`v4_olmoe-mix-0924-nodclm_llama`	133,343,623	233,848,504,469	1.8TiB	olmoe-mix-0924 (everything except DCLM)	Llama-2	s3://infini-gram/index/v4_olmoe-mix-0924-nodclm_llama
`v4_olmo-2-0325-32b-anneal-adapt_llama`	82,461,386	35,153,386,430	268GiB	dolmino-mix-1124 (except those already in pre-training); SFT; DPO; RLVR	Llama-2	s3://infini-gram/index/v4_olmo-2-0325-32b-anneal-adapt_llama
`v4_olmo-2-1124-13b-anneal-adapt_llama`	82,534,460	35,273,912,238	269GiB	dolmino-mix-1124 (except those already in pre-training); SFT; DPO; RLVR	Llama-2	s3://infini-gram/index/v4_olmo-2-1124-13b-anneal-adapt_llama
`v4_olmoe-0125-1b-7b-anneal-adapt_llama`	82,513,183	35,262,277,074	269GiB	dolmino-mix-1124 (except those already in pre-training); SFT; DPO; RLVR	Llama-2	s3://infini-gram/index/v4_olmoe-0125-1b-7b-anneal-adapt_llama
`v4_dolma-v1_7_llama`	3,403,336,408	2,604,642,372,173	20TiB	Dolma-v1.7	Llama-2	s3://infini-gram/index/v4_dolma-v1_7_llama
`v4_rpj_llama_s4`	931,361,530	1,385,942,948,192	8.9TiB	RedPajama	Llama-2	s3://infini-gram/index/v4_rpj_llama_s4
`v4_piletrain_llama`	210,607,728	383,299,322,520	2.5TiB	Pile-train	Llama-2	s3://infini-gram/index/v4_piletrain_llama
`v4_c4train_llama`	364,868,892	198,079,554,945	1.3TiB	C4-train	Llama-2	s3://infini-gram/index/v4_c4train_llama
`v4_dolma-v1_6-sample_llama`	13,095,416	9,178,218,956	62GiB	Dolma-v1.6-sample	Llama-2	s3://infini-gram/index/v4_dolma-v1_6-sample_llama
`v4_dolmasample_olmo`	13,095,416	8,039,098,124	53GiB	Dolma-v1.6-sample	OLMo	s3://infini-gram-lite/index/v4_dolmasample_olmo
`v4_pileval_llama`	214,670	393,769,120	2.3GiB	Pile-val	Llama-2	s3://infini-gram-lite/index/v4_pileval_llama
`v4_pileval_gpt2`	214,670	383,326,404	2.2GiB	Pile-val	GPT-2	s3://infini-gram-lite/index/v4_pileval_gpt2

Smaller indexes are stored in the s3://infini-gram-lite bucket and can be downloaded for free and without an AWS account. These indexes are v4_pileval_llama, v4_pileval_gpt2, and v4_dolmasample_olmo. To download, run command:

aws s3 cp --no-sign-request --recursive {S3_URL} {LOCAL_INDEX_PATH}

Larger indexes are stored in the s3://infini-gram bucket. To download these indexes, you need to pay for the data transfer fee (~$0.09 per GB according to AWS S3 pricing). Make sure you have correctly set up your AWS credentials before downloading these indexes. These indexes are v4_rpj_llama_s4, v4_piletrain_llama, and v4_c4train_llama. To download, run command:

aws s3 cp --request-payer requester --recursive {S3_URL} {LOCAL_INDEX_PATH}

Query Types

Prior to submitting any type of queries, you need to instatiate the engine with the index you would like to query. As an example, below we create an engine with the index for Pile-val (the validation set of Pile), which was created using the Llama-2 tokenizer:

>>> from infini_gram.engine import InfiniGramEngine
>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_bos_token=False, add_eos_token=False) # the tokenizer should match that of the index you load below
>>> engine = InfiniGramEngine(index_dir='index/v4_pileval_llama', eos_token_id=tokenizer.eos_token_id) # please replace index_dir with the local directory where you store the index

1. Count an n-gram (or a CNF of multiple n-grams)

This query type counts the number of occurrences of an n-gram, or a CNF of multiple n-grams.

1.1 Count simple queries

With simple queries, the engine counts the number of occurrences of a single n-gram in the corpus.

For example, to find out the number of occurrences of n-gram natural language processing in the Pile-val corpus:

>>> input_ids = tokenizer.encode('natural language processing')
>>> input_ids
[5613, 4086, 9068]

>>> engine.count(input_ids=input_ids)
{'count': 76, 'approx': False}

The approx field indicates whether the count is approximate. For simple queries with a single n-gram term, this is always False (the count is always exact). As you will see later, count for complex queries may be approximate.

If you submit an empty query, the engine returns the total number of tokens in the corpus:

>>> engine.count(input_ids=[])
{'count': 393769120, 'approx': False}

1.2 Count CNF queries

You can make more complex queries by connecting multiple n-grams with the AND/OR operators, in the CNF format, in which case the engine counts the number of times where this logical constraint is satisfied in the corpus. A CNF query is a triply-nested list. The top-level is a list of disjunctive clauses (which are eventually connected with the AND operator). Each disjuctive clause is a list of n-gram terms (which are eventually connected with the OR operator). And each n-gram term has the same format as input_ids above, i.e., a list of token ids.

# natural language processing OR artificial intelligence
>>> cnf = [
...     [tokenizer.encode('natural language processing'), tokenizer.encode('artificial intelligence')]
... ]
>>> cnf
[[[5613, 4086, 9068], [23116, 21082]]]

>>> engine.count_cnf(cnf=cnf)
{'count': 499, 'approx': False}

# natural language processing AND deep learning
>>> cnf = [
...     [tokenizer.encode('natural language processing')],
...     [tokenizer.encode('deep learning')],
... ]
>>> cnf
[[[5613, 4086, 9068]], [[6483, 6509]]]

>>> engine.count_cnf(cnf=cnf)
{'count': 6, 'approx': False}

# (natural language processing OR artificial intelligence) AND deep learning
>>> cnf = [
...     [tokenizer.encode('natural language processing'), tokenizer.encode('artificial intelligence')],
...     [tokenizer.encode('deep learning')],
... ])
>>> cnf
[[[5613, 4086, 9068], [23116, 21082]], [[6483, 6509]]]

>>> engine.count_cnf(cnf=cnf)
{'count': 19, 'approx': False}

Approximation: In case the CNF query contains AND operator(s), the engine needs to enumerate all occurrences of each clause and pick cases where they co-occur within reasonable distance. This distance is controlled by the optional parameter max_diff_tokens, which has a default value of 100. Increasing this value and you may get more counts:

# natural language processing AND deep learning
>>> engine.count_cnf(cnf=[
...     [tokenizer.encode('natural language processing')],
...     [tokenizer.encode('deep learning')],
... ], max_diff_tokens=1000)
{'count': 14, 'approx': False}

However, if one of the clauses have a too high count, it will be inpractical to enumerate all its occurrences. Our solution is to take a subsample of its occurrences when the count is higher than a threshold, controlled by the optional parameter max_clause_freq, which has a default value of 50000. When subsampling happens on any of the clauses, the count will be reported as approximate:

>>> engine.count(input_ids=tokenizer.encode('this'))
{'count': 739845, 'approx': False}
>>> engine.count(input_ids=tokenizer.encode('that'))
{'count': 1866317, 'approx': False}

# this AND that
>>> engine.count_cnf(cnf=[[tokenizer.encode('this')], [tokenizer.encode('that')]])
{'count': 982128, 'approx': True}

Increasing this value and you will get more accurate estimate of the count, and when this value is larger than (or equal to) the count of all clauses, the count becomes exact:

>>> engine.count_cnf(cnf=[[tokenizer.encode('this')], [tokenizer.encode('that')]], max_clause_freq=500000)
{'count': 430527, 'approx': True}

>>> engine.count_cnf(cnf=[[tokenizer.encode('this')], [tokenizer.encode('that')]], max_clause_freq=2000000)
{'count': 480107, 'approx': False}

2. Prob of the last token

This query type computes the n-gram LM probability of a token conditioning on a preceding prompt.

For example, to compute P(processing | natural language):

>>> input_ids = tokenizer.encode('natural language processing')
>>> input_ids
[5613, 4086, 9068]

>>> engine.prob(prompt_ids=input_ids[:-1], cont_id=input_ids[-1])
{'prompt_cnt': 257, 'cont_cnt': 76, 'prob': 0.29571984435797666}

In this case, prompt_cnt is the count of the 2-gram natural language, cont_cnt is the count of the 3-gram natural language processing, and prob is the division of these two counts.

If the prompt cannot be found in the corpus, the probability would be 0/0=NaN. In these cases we report prob = -1.0 to indicate an error:

>>> input_ids = tokenizer.encode('I love natural language processing')
>>> input_ids
[306, 5360, 5613, 4086, 9068]

>>> engine.prob(prompt_ids=input_ids[:-1], cont_id=input_ids[-1])
{'prompt_cnt': 0, 'cont_cnt': 0, 'prob': -1.0}

3. Next-token distribution

This query type computes the n-gram LM next-token distribution conditioning on a preceding prompt.

For example, this will return the token distribution following natural language:

>>> input_ids = tokenizer.encode('natural language')
>>> input_ids
[5613, 4086]

>>> engine.ntd(prompt_ids=input_ids)
{'prompt_cnt': 257, 'result_by_token_id': {13: {'cont_cnt': 1, 'prob': 0.0038910505836575876}, 297: {'cont_cnt': 1, 'prob': 0.0038910505836575876}, ..., 30003: {'cont_cnt': 1, 'prob': 0.0038910505836575876}}, 'approx': False}

result_by_token_id is a dict that maps token id to the probability of that token as a continuation of the prompt.

If the prompt cannot be found in the corpus, you will get an empty distribution:

>>> input_ids = tokenizer.encode('I love natural language processing')
>>> input_ids
[306, 5360, 5613, 4086, 9068]

>>> engine.ntd(prompt_ids=input_ids[:-1])
{'prompt_cnt': 0, 'result_by_token_id': {}, 'approx': False}

Approximation: For each occurrence of the prompt, the engine needs to inspect the token appearing after it. This is time-consuming and not feasible when prompt_cnt is large. After this prompt count crosses a threshold, the engine needs to downsample the number of cases it inspects, and the resulting distribution will become approximate (which will be reflected in the approx field). This threshold is controlled by the optional parameter max_support, which has a default value of 1000. For example, to get the unigram token distribution, you can query with an empty prompt and the result will be approximate:

>>> engine.ntd(prompt_ids=[])
{'prompt_cnt': 393769120, 'result_by_token_id': {12: {'cont_cnt': 1013873, 'prob': 0.00257479052699714}, 13: {'cont_cnt': 14333030, 'prob': 0.03639957851443506}, ..., 30934: {'cont_cnt': 489584, 'prob': 0.0012433275621003496}}, 'approx': True}

4. ∞-gram prob

This query type computes the ∞-gram LM probability of a token conditioning on a preceding prompt. It uses the longest suffix of the prompt that has a non-zero count in the corpus.

>>> input_ids = tokenizer.encode('I love natural language processing')
>>> input_ids
[306, 5360, 5613, 4086, 9068]

>>> engine.infgram_prob(prompt_ids=input_ids[:-1], cont_id=input_ids[-1])
{'prompt_cnt': 257, 'cont_cnt': 76, 'prob': 0.29571984435797666, 'suffix_len': 2}

The field suffix_len indicates the number of tokens in the longest suffix of the prompt. In this case, since [5613, 4086] can be found in the corpus, but [5360, 5613, 4086] cannot, the longest suffix is [5613, 4086], which has length 2.

5. ∞-gram next-token distribution

This query type computes the ∞-gram LM next-token distribution conditioning on a preceding prompt.

>>> input_ids = tokenizer.encode('I love natural language')
>>> input_ids
[306, 5360, 5613, 4086]

>>> engine.infgram_ntd(prompt_ids=input_ids, max_support=10)
{'prompt_cnt': 257, 'result_by_token_id': {297: {'cont_cnt': 32, 'prob': 0.1245136186770428}, 470: {'cont_cnt': 32, 'prob': 0.1245136186770428}, 508: {'cont_cnt': 1, 'prob': 0.0038910505836575876}, 8004: {'cont_cnt': 32, 'prob': 0.1245136186770428}, 9068: {'cont_cnt': 96, 'prob': 0.3735408560311284}, 24481: {'cont_cnt': 32, 'prob': 0.1245136186770428}, 29889: {'cont_cnt': 32, 'prob': 0.1245136186770428}}, 'approx': True, 'suffix_len': 2}

6. Search documents

This query type returns documents in the corpus that match your query.

6.1 Search with simple queries

With simple queries, the engine can return documents containing a single n-gram.

First, you need to call find() to get information about where the matching documents are located.

>>> input_ids = tokenizer.encode('natural language processing')
>>> input_ids
[5613, 4086, 9068]

>>> engine.find(input_ids=input_ids)
{'cnt': 76, 'segment_by_shard': [(365362993, 365363069)]}

The returned segment_by_shard is a list of 2-tuples, each tuple represents a range of “ranks” in one of the shards of the index, and each rank can be traced back to a matched document in that shard. The length of this list is equal to the total number of shards. For example, if you want to retrieve the first matched document in shard 0, you can do

>>> engine.get_doc_by_rank(s=0, rank=365362993, max_disp_len=10)
{'doc_ix': 47865, 'doc_len': 12932, 'disp_len': 10, 'metadata': '', 'token_ids': [363, 5164, 11976, 1316, 408, 5613, 4086, 9068, 518, 29992]}

The returned dict represents a document. You can see that the query input_ids [5613, 4086, 9068] is present in this document.

The ranges are left-inclusive and right-exclusive. To enumerate all documents, you can do something like

>>> find_result = engine.find(input_ids=input_ids)
>>> for s, (start, end) in enumerate(find_result['segment_by_shard']):
...     for rank in range(start, end):
...         doc = engine.get_doc_by_rank(s=s, rank=rank)

6.2 Search with CNF queries

With CNF queries, the engine can return documents that satisfy the logical constraint specified in the CNF.

You need to first call find_cnf() which returns locations of matching documents in a different protocol:

# natural language processing AND deep learning
>>> cnf = [
...     [tokenizer.encode('natural language processing')],
...     [tokenizer.encode('deep learning')],
... ]
>>> cnf
[[[5613, 4086, 9068]], [[6483, 6509]]]

>>> engine.find_cnf(cnf=cnf)
{'cnt': 6, 'approx': False, 'ptrs_by_shard': [[717544382, 377178100, 706194108, 25563710, 250933686, 706194476]]}

Note that the returned field is not segment_by_shard but rather ptrs_by_shard. For each shard, instead of having a range of “ranks”, now we get a list of “pointers”, and each pointer can be traced back to a matched document in that shard of the index. The length of the outer list is equal to the total number of shards. To get documents with these pointers, you need to call a different helper function:

# Get the document at pointer #2 in shard 0
>>> engine.get_doc_by_ptr(s=0, ptr=706194108, max_disp_len=20)
{'doc_ix': 191568, 'doc_len': 3171, 'disp_len': 20, 'metadata': '', 'token_ids': [29889, 450, 1034, 13364, 508, 367, 4340, 1304, 304, 7945, 6483, 6509, 2729, 5613, 4086, 9068, 9595, 1316, 408, 10013]}

You can see that both [5613, 4086, 9068] and [6483, 6509] are present in this document. (For illustration I use a small max_disp_len; since the default max_diff_tokens = 100, you might need to increase max_disp_len to see the document covering all clauses in the CNF query.)

To enumerate all documents, you can do something like

>>> find_result = engine.find_cnf(cnf=cnf)
>>> for s, ptrs in enumerate(find_result['ptrs_by_shard']):
...     for ptr in ptrs:
...         doc = engine.get_doc_by_ptr(s=s, ptr=ptr)