tantivy

Python bindings for the search engine library Tantivy.

Tantivy is a full text search engine library written in rust.

It is closer to Apache Lucene than to Elasticsearch and Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a library that can be used to build such a search engine. Tantivy is, in fact, strongly inspired by Lucene's design.

Example:
>>> import json
>>> import tantivy
>>> builder = tantivy.SchemaBuilder()
>>> title = builder.add_text_field("title", stored=True)
>>> body = builder.add_text_field("body")
>>> schema = builder.build()
>>> index = tantivy.Index(schema)
>>> doc = tantivy.Document()
>>> doc.add_text(title, "The Old Man and the Sea")
>>> doc.add_text(body, ("He was an old man who fished alone in a "
                        "skiff in the Gulf Stream and he had gone "
                        "eighty-four days now without taking a fish."))
>>> writer.add_document(doc)
>>> doc = schema.parse_document(json.dumps({
       "title": ["Frankenstein", "The Modern Prometheus"],
       "body": ("You will rejoice to hear that no disaster has "
                "accompanied the commencement of an enterprise which "
                "you have regarded with such evil forebodings.  "
                "I arrived here yesterday, and my first task is to "
                "assure my dear sister of my welfare and increasing "
                "confidence in the success of my undertaking.")
}))
>>> writer.add_document(doc)
>>> writer.commit()
>>> reader = index.reader()
>>> searcher = reader.searcher()
>>> query = index.parse_query("sea whale", [title, body])
>>> result = searcher.search(query, 10)
>>> assert len(result) == 1

version

__version__: str = 'tantivy v0.26.0, index_format v7'

DocAddress

class DocAddress:

DocAddress contains all the necessary information to identify a document given a Searcher object.

It consists in an id identifying its segment, and its segment-local DocId. The id used for the segment is actually an ordinal in the list of segment hold by a Searcher.

doc: int

The segment local DocId

segment_ord: int

The segment ordinal is an id identifying the segment hosting the document. It is only meaningful, in the context of a searcher.

Document

class Document:

Tantivy's Document is the object that can be indexed and then searched for.

Documents are fundamentally a collection of unordered tuples (field_name, value). In this list, one field may appear more than once.

Example:
>>> doc = tantivy.Document()
>>> doc.add_text("title", "The Old Man and the Sea")
>>> doc.add_text("body", ("He was an old man who fished alone in a "
...                       "skiff in the Gulf Stream and he had gone "
...                       "eighty-four days now without taking a fish."))
>>> doc
Document(body=[He was an ],title=[The Old Ma])

For simplicity, it is also possible to build a Document by passing the field values directly as constructor arguments.

Example:
>>> doc = tantivy.Document(title=["The Old Man and the Sea"], body=["..."])

As syntactic sugar, tantivy also allows the user to pass a single values if there is only one. In other words, the following is also legal.

Example:
>>> doc = tantivy.Document(title="The Old Man and the Sea", body="...")

For numeric fields, the [Document] constructor does not have any information about the type and will try to guess the type. Therefore, it is recommended to use the [Document::from_dict()], [Document::extract()], or Document::add_*() functions to provide explicit type information.

Example:
>>> schema = (
...     SchemaBuilder()
...         .add_unsigned_field("unsigned")
...         .add_integer_field("signed")
...         .add_float_field("float")
...         .build()
... )
>>> doc = tantivy.Document.from_dict(
...     {"unsigned": 1000, "signed": -5, "float": 0.4},
...     schema,
... )
def add_boolean(self, field_name: str, value: bool) -> None:

Add a boolean value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the value.
  • value (bool): The boolean that will be added to the document.
def add_bytes(self, field_name: str, bytes: bytes) -> None:

Add a bytes value to the document.

Arguments:
  • field_name (str): The field for which we are adding the bytes.
  • value (bytes): The bytes that will be added to the document.
def add_date(self, field_name: str, value: datetime.datetime) -> None:

Add a date value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the date.
  • value (datetime): The date that will be added to the document.
def add_facet(self, field_name: str, facet: Facet) -> None:

Add a facet value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the facet.
  • value (Facet): The Facet that will be added to the document.
def add_float(self, field_name: str, value: float) -> None:

Add a float value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the value.
  • value (f64): The float that will be added to the document.
def add_integer(self, field_name: str, value: int) -> None:

Add a signed integer value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the integer.
  • value (int): The integer that will be added to the document.
def add_ip_addr(self, field_name: str, ip_addr: str) -> None:

Add an IP address value to the document.

Arguments:
  • field_name (str): The field for which we are adding the IP address.
  • value (str): The IP address object that will be added to the document.

Raises a ValueError if the IP address is invalid.

def add_json(self, field_name: str, value: Any) -> None:

Add a JSON value to the document.

Arguments:
  • field_name (str): The field for which we are adding the JSON.
  • value (str | Dict[str, Any]): The JSON object that will be added to the document.

Raises a ValueError if the JSON is invalid.

def add_text(self, field_name: str, text: str) -> None:

Add a text value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the text.
  • text (str): The text that will be added to the document.
def add_unsigned(self, field_name: str, value: int) -> None:

Add an unsigned integer value to the document.

Arguments:
  • field_name (str): The field name for which we are adding the unsigned integer.
  • value (int): The integer that will be added to the document.
def extend(self, py_dict: dict, schema: Schema | None) -> None:

The type of the None singleton.

def from_dict( py_dict: dict, schema: Schema | None = None) -> Document:

The type of the None singleton.

def get_all(self, field_name: str) -> list[typing.Any]:

Get the all values associated with the given field.

Arguments:
  • field (Field): The field for which we would like to get the values.

Returns a list of values. The type of the value depends on the field.

def get_first(self, field_name: str) -> Any | None:

Get the first value associated with the given field.

Arguments:
  • field (Field): The field for which we would like to get the value.

Returns the value if one is found, otherwise None. The type of the value depends on the field.

is_empty: bool

True if the document is empty, False otherwise.

num_fields: int

Returns the number of added fields that have been added to the document

def to_dict(self) -> dict[str, list[typing.Any]]:

Returns a dictionary with the different field values.

In tantivy, Document can be hold multiple values for a single field.

For this reason, the dictionary, will associate a list of value for every field.

Explanation

class Explanation:

Represents an explanation of how a document matched a query.

def to_json(self) -> str:

Returns a JSON representation of the explanation.

Facet

class Facet:

A Facet represent a point in a given hierarchy.

They are typically represented similarly to a filepath. For instance, an e-commerce website could have a Facet for /electronics/tv_and_video/led_tv.

A document can be associated to any number of facets. The hierarchy implicitely imply that a document belonging to a facet also belongs to the ancestor of its facet. In the example above, /electronics/tv_and_video/ and /electronics.

def from_encoded(encoded_bytes: bytes) -> Facet:

Creates a Facet from its binary representation.

def from_string(cls, facet_string: str) -> Facet:

Create a Facet object from a string.

Arguments:
  • facet_string (str): The string that contains a facet.

Returns the created Facet.

def is_prefix_of(self, other: Facet) -> bool:

Returns true if another Facet is a subfacet of this facet.

Arguments:
  • other (Facet): The Facet that we should check if this facet is a subset of.
is_root: bool

Returns true if the facet is the root facet /.

def root(cls) -> Facet:

Create a new instance of the "root facet" Equivalent to /.

def to_path(self) -> list[str]:

Returns the list of segments that forms a facet path.

For instance //europe/france becomes ["europe", "france"].

def to_path_str(self) -> str:

Returns the facet string representation.

FieldType

class FieldType:

Tantivy's Type

Boolean = FieldType.Boolean
Bytes = FieldType.Bytes
Facet = FieldType.Facet
Float = FieldType.Float
Integer = FieldType.Integer
IpAddr = FieldType.IpAddr
Unsigned = FieldType.Unsigned

Filter

class Filter:

All Tantivy's builtin TokenFilters.

Example

filter = Filter.alpha_num()

Usage

In general, filter objects exist to be passed to the filter() method of a TextAnalyzerBuilder instance.

https://docs.rs/tantivy/latest/tantivy/tokenizer/index.html

def alphanum_only() -> Filter:

AlphaNumOnlyFilter

def ascii_fold() -> Filter:

AsciiFoldingFilter

def custom_stopword(stopwords: list[str]) -> Filter:

StopWordFilter (user-provided stop word list)

This variant of Filter.stopword() lets you provide your own custom list of stopwords.

Args:

  • stopwords (list(str)): a list of words to be removed.
def lowercase() -> Filter:

The type of the None singleton.

def remove_long(length_limit: int) -> Filter:

RemoveLongFilter

Args:

  • length_limit (int): max character length of token.
def split_compound(constituent_words: list[str]) -> Filter:

SplitCompoundWords

https://docs.rs/tantivy/latest/tantivy/tokenizer/struct.SplitCompoundWords.html

Args:

  • constituent_words (list(string)): words that make up compound word (must be in order).

Example:

# useless, contrived example:
compound_spliter = Filter.split_compounds(['butter', 'fly'])
# Will split 'butterfly' -> ['butter', 'fly'],
# but won't split 'buttering' or 'buttercupfly'
def stemmer(language: str) -> Filter:

Stemmer

def stopword(language: str) -> Filter:

StopWordFilter (builtin stop word list)

Args:

  • language (string): Stop words list language. Valid values: { "arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish" }

Index

class Index:

Create a new index object.

Arguments:
  • schema (Schema): The schema of the index.
  • path (str, optional): The path where the index should be stored. If no path is provided, the index will be stored in memory.
  • reuse (bool, optional): Should we open an existing index if one exists or always create a new one.

If an index already exists it will be opened and reused. Raises OSError if there was a problem during the opening or creation of the index.

def config_reader(self, reload_policy: str = 'commit', num_warmers: int = 0) -> None:

Configure the index reader.

Arguments:
  • reload_policy (str, optional): The reload policy that the IndexReader should use. Can be Manual or OnCommit.
  • num_warmers (int, optional): The number of searchers that the reader should create.
def exists(path: str) -> bool:

Check if the given path contains an existing index.

Arguments:
  • path: The path where tantivy will search for an index.

Returns True if an index exists at the given path, False otherwise.

Raises OSError if the directory cannot be opened.

def open(path: str) -> Index:

The type of the None singleton.

def parse_query( self, query: str, default_field_names: list[str] | None = None, field_boosts: dict[str, float] | None = None, fuzzy_fields: dict[str, tuple[bool, int, bool]] | None = None, conjunction_by_default: bool = False, allow_regexes: bool = False) -> Query:

Parse a query

Arguments:
  • query: the query, following the tantivy query language.
  • default_fields_names (List[Field]): A list of fields used to search if no field is specified in the query.
  • field_boosts: A dictionary keyed on field names which provides default boosts for the query constructed by this method.
  • fuzzy_fields: A dictionary keyed on field names which provides (prefix, distance, transpose_cost_one) triples making queries constructed by this method fuzzy against the given fields and using the given parameters. prefix determines if terms which are prefixes of the given term match the query. distance determines the maximum Levenshtein distance between terms matching the query and the given term. transpose_cost_one determines if transpositions of neighbouring characters are counted only once against the Levenshtein distance.
  • conjunction_by_default: If true, the query will be parsed as a conjunction query. Defaults to a disjunction query.
  • allow_regexes: If true, allow regexes in queries.
def parse_query_lenient( self, query: str, default_field_names: list[str] | None = None, field_boosts: dict[str, float] | None = None, fuzzy_fields: dict[str, tuple[bool, int, bool]] | None = None, conjunction_by_default: bool = False, allow_regexes: bool = False) -> tuple[Query, list[typing.Any]]:

Parse a query leniently.

This variant parses invalid query on a best effort basis. If some part of the query can't reasonably be executed (range query without field, searching on a non existing field, searching without precising field when no default field is provided...), they may get turned into a "match-nothing" subquery.

Arguments:
  • query: the query, following the tantivy query language.
  • default_fields_names (List[Field]): A list of fields used to search if no field is specified in the query.
  • field_boosts: A dictionary keyed on field names which provides default boosts for the query constructed by this method.
  • fuzzy_fields: A dictionary keyed on field names which provides (prefix, distance, transpose_cost_one) triples making queries constructed by this method fuzzy against the given fields and using the given parameters. prefix determines if terms which are prefixes of the given term match the query. distance determines the maximum Levenshtein distance between terms matching the query and the given term. transpose_cost_one determines if transpositions of neighbouring characters are counted only once against the Levenshtein distance.
  • conjunction_by_default: If true, the query will be parsed as a conjunction query. Defaults to a disjunction query.
  • allow_regexes: If true, allow regexes in queries.

Returns a tuple containing the parsed query and a list of errors.

Raises ValueError if a field in default_field_names is not defined or marked as indexed.

def register_fast_field_tokenizer(self, name: str, text_analyzer: TextAnalyzer) -> None:

Register a custom text analyzer for fast fields by name. (Confusingly, this is one of the places where Tantivy uses 'tokenizer' to refer to a TextAnalyzer instance.)

def register_tokenizer(self, name: str, text_analyzer: TextAnalyzer) -> None:

Register a custom text analyzer by name. (Confusingly, this is one of the places where Tantivy uses 'tokenizer' to refer to a TextAnalyzer instance.)

def reload(self) -> None:

Update searchers so that they reflect the state of the last .commit().

If you set up the the reload policy to be on 'commit' (which is the default) every commit should be rapidly reflected on your IndexReader and you should not need to call reload() at all.

schema: Schema

The schema of the current index.

def searcher(self) -> Searcher:

Returns a searcher

This method should be called every single time a search query is performed. The same searcher must be used for a given query, as it ensures the use of a consistent segment set.

def writer( self, heap_size: int = 128000000, num_threads: int = 0) -> IndexWriter:

Create a IndexWriter for the index.

The writer will be multithreaded and the provided heap size will be split between the given number of threads.

Arguments:
  • overall_heap_size (int, optional): The total target heap memory usage of the writer. Tantivy requires that this can't be less than 3000000 per thread. Lower values will result in more frequent internal commits when adding documents (slowing down write progress), and larger values will results in fewer commits but greater memory usage. The best value will depend on your specific use case.
  • num_threads (int, optional): The number of threads that the writer should use. If this value is 0, tantivy will choose automatically the number of threads.

Raises ValueError if there was an error while creating the writer.

IndexWriter

class IndexWriter:

IndexWriter is the user entry-point to add documents to the index.

To create an IndexWriter first create an Index and call the writer() method on the index object.

def add_document(self, doc: Document) -> int:

Add a document to the index.

If the indexing pipeline is full, this call may block.

Returns an opstamp, which is an increasing integer that can be used by the client to align commits with its own document queue. The opstamp represents the number of documents that have been added since the creation of the index.

def add_json(self, json: str) -> int:

Helper for the add_document method, but passing a json string.

If the indexing pipeline is full, this call may block.

Returns an opstamp, which is an increasing integer that can be used by the client to align commits with its own document queue. The opstamp represents the number of documents that have been added since the creation of the index.

def commit(self) -> int:

Commits all of the pending changes

A call to commit blocks. After it returns, all of the document that were added since the last commit are published and persisted.

In case of a crash or an hardware failure (as long as the hard disk is spared), it will be possible to resume indexing from this point.

Returns the opstamp of the last document that made it in the commit.

commit_opstamp: int

The opstamp of the last successful commit.

This is the opstamp the index will rollback to if there is a failure like a power surge.

This is also the opstamp of the commit that is currently available for searchers.

def delete_all_documents(self) -> None:

Deletes all documents from the index.

def delete_documents(self, field_name: str, field_value: Any) -> int:

The type of the None singleton.

def delete_documents_by_query(self, query: Query) -> int:

Delete all documents matching a given query.

Example:
schema_builder = SchemaBuilder()
schema_builder.add_text_field("title", fast=True)
schema = schema_builder.build()
index = Index(schema)
writer = index.writer()
source_doc = {
    "title": "Here is some text"
}
writer.add_json(json.dumps(source_doc))
writer.commit()
writer.wait_merging_threads()

query = index.parse_query("title:text")
writer = index.writer()
writer.delete_documents_by_query(query)
writer.commit()
writer.wait_merging_threads()
Arguments:
  • query (Query): The query to filter the deleted documents.

If the query is not valid raises ValueError exception. If the query is not supported raises Exception.

def delete_documents_by_term(self, field_name: str, field_value: Any) -> int:

Delete all documents containing a given term.

This method does not parse the given term and it expects the term to be already tokenized according to any tokenizers attached to the field. This can often result in surprising behaviour. For example, if you want to store UUIDs as text in a field, and those values have hyphens, and you use the default tokenizer which removes punctuation, you will not be able to delete a document added with particular UUID, by passing the same UUID to this method. In such workflows where deletions are required, particularly with string values, it is strongly recommended to use the "raw" tokenizer as this will match exactly. In situations where you do want tokenization to be applied, it is recommended to instead use the delete_documents_by_query method instead, which will delete documents matching the given query using the same query parser as used in search queries.

Arguments:
  • field_name (str): The field name for which we want to filter deleted docs.
  • field_value (PyAny): Python object with the value we want to filter.

If the field_name is not on the schema raises ValueError exception. If the field_value is not supported raises Exception.

def garbage_collect_files(self) -> None:

Detect and removes the files that are not used by the index anymore.

def rollback(self) -> int:

Rollback to the last commit

This cancels all of the update that happened before after the last commit. After calling rollback, the index is in the same state as it was after the last commit.

def wait_merging_threads(self) -> None:

If there are some merging threads, blocks until they all finish their work and then drop the IndexWriter.

This will consume the IndexWriter. Further accesses to the object will result in an error.

Occur

class Occur:

Tantivy's Occur

Must = Occur.Must
MustNot = Occur.MustNot
Should = Occur.Should

Order

class Order:

Enum representing the direction in which something should be sorted.

Asc = Order.Asc
Desc = Order.Desc

parse_query

def parse_query(query: str) -> dict[str, typing.Any]:

Parse a query string into an abstract syntax tree (AST).

Arguments:
  • query: The query string to parse.
Returns:

A dictionary representing the parsed query AST.

Raises:
  • ValueError: If the query has invalid syntax.

parse_query_lenient

def parse_query_lenient(query: str) -> tuple[dict[str, typing.Any], list[dict[str, typing.Any]]]:

Parse a query string leniently, recovering from syntax errors.

Arguments:
  • query: The query string to parse.
Returns:

A tuple containing: - A dictionary representing the parsed query AST - A list of error dictionaries describing syntax errors

Query

class Query:

Tantivy's Query

def all_query() -> Query:

Construct a Tantivy's AllQuery

def and_must_match(self, *queries: Query) -> Query:

Convenience method to combine queries with AND (MUST) logic. Returns a query matching documents that match this query and every given query. Accepts any number of queries, so a list can be passed with argument unpacking: query.and_must_match(*queries).

def and_must_not_match(self, *queries: Query) -> Query:

Convenience method to combine queries with AND NOT (MUST NOT) logic. Returns a query matching documents that match this query and none of the given queries. Accepts any number of queries, so a list can be passed with argument unpacking: query.and_must_not_match(*queries).

def boolean_query( subqueries: Sequence[tuple[Occur, Query]], minimum_number_should_match: int | None = None) -> Query:

Construct a Tantivy's BooleanQuery

def boost_query(query: Query, boost: float) -> Query:

Construct a Tantivy's BoostQuery

def const_score_query(query: Query, score: float) -> Query:

Construct a Tantivy's ConstScoreQuery

def disjunction_max_query( subqueries: Sequence[Query], tie_breaker: float | None = None) -> Query:

Construct a Tantivy's DisjunctionMaxQuery

def empty_query() -> Query:

Construct a Tantivy's EmptyQuery

def exists_query( fast_field_name: str, json_subpaths: bool = False) -> Query:

Construct a Tantivy's ExistsQuery Executing a search with this query will fail if the specified field doesn’t exists or is not a fast field.

Arguments

  • fast_field_name - Field name to be searched.
  • json_subpaths - If true, check all the subpaths inside a JSON field
def explain( self, searcher: Searcher, doc_address: DocAddress) -> Explanation:

Explain how this query matches a given document.

Arguments

  • searcher (Searcher): The searcher used to perform the search.
  • doc_address (DocAddress): The address of the document to explain.

Returns

  • Explanation: An object containing detailed information about how the document matched the query, with a to_json() method.
def fuzzy_term_query( schema: Schema, field_name: str, text: str, distance: int = 1, transposition_cost_one: bool = True, prefix=False) -> Query:

Construct a Tantivy's FuzzyTermQuery

Arguments

  • schema - Schema of the target index.
  • field_name - Field name to be searched.
  • text - String representation of the query term.
  • distance - (Optional) Edit distance you are going to allow. When not specified, the default is 1.
  • transposition_cost_one - (Optional) If true, a transposition (swapping) cost will be 1; otherwise it will be 2. When not specified, the default is true.
  • prefix - (Optional) If true, prefix levenshtein distance is applied. When not specified, the default is false.
def more_like_this_document_fields_query( schema: Schema, document_fields: dict[str, typing.Any | list[typing.Any]], min_doc_frequency: int | None = 5, max_doc_frequency: int | None = None, min_term_frequency: int | None = 2, max_query_terms: int | None = 25, min_word_length: int | None = None, max_word_length: int | None = None, boost_factor: float | None = 1.0, stop_words: list[str] = []) -> Query:

Construct a Tantivy's MoreLikeThisQuery from caller-provided field values.

def more_like_this_query( doc_address: DocAddress, min_doc_frequency: int | None = 5, max_doc_frequency: int | None = None, min_term_frequency: int | None = 2, max_query_terms: int | None = 25, min_word_length: int | None = None, max_word_length: int | None = None, boost_factor: float | None = 1.0, stop_words: list[str] = []) -> Query:

The type of the None singleton.

def or_should_match(self, *queries: Query) -> Query:

Convenience method to combine queries with OR (SHOULD) logic. Returns a query matching documents that match this query or any of the given queries. Accepts any number of queries, so a list can be passed with argument unpacking: query.or_should_match(*queries).

def phrase_prefix_query( schema: Schema, field_name: str, words: list[str | tuple[int, str]]) -> Query:

Construct a Tantivy's PhrasePrefixQuery with custom offsets and slop

Arguments

  • schema - Schema of the target index.
  • field_name - Field name to be searched.
  • words - Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.
def phrase_query( schema: Schema, field_name: str, words: list[str | tuple[int, str]], slop: int = 0) -> Query:

Construct a Tantivy's PhraseQuery with custom offsets and slop

Arguments

  • schema - Schema of the target index.
  • field_name - Field name to be searched.
  • words - Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.
  • slop - (Optional) The number of gaps permitted between the words in the query phrase. Default is 0.
def range_query( schema: Schema, field_name: str, field_type: FieldType, lower_bound: ~_RangeType | None = None, upper_bound: ~_RangeType | None = None, include_lower: bool = True, include_upper: bool = True, use_inverted_index: bool = False) -> Query:

Construct a range query over a numeric, date, or IP address field.

Pass None for lower_bound or upper_bound to leave that side unbounded. Both bounds cannot be None; use Query.all_query() to match all documents. Setting include_lower or include_upper to False while the corresponding bound is None is an error—unbounded sides are always inclusive by definition.

Arguments

  • schema - Schema of the target index.
  • field_name - Field name to be searched.
  • field_type - Type of the field (FieldType.Integer, FieldType.Float, FieldType.Date, etc.).
  • lower_bound - Lower bound value, or None for unbounded.
  • upper_bound - Upper bound value, or None for unbounded.
  • include_lower - Whether the lower bound is inclusive. Ignored (and must be True) when lower_bound is None.
  • include_upper - Whether the upper bound is inclusive. Ignored (and must be True) when upper_bound is None.
  • use_inverted_index - If True, use an inverted index range query instead of a fast-field range query.
def regex_phrase_query( schema: Schema, field_name: str, words: list[str | tuple[int, str]], slop: int = 0) -> Query:

Construct a Tantivy's PhraseQuery with custom offsets and slop

Arguments

  • schema - Schema of the target index.
  • field_name - Field name to be searched.
  • words - Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.
  • slop - (Optional) The number of gaps permitted between the words in the query phrase. Default is 0.
def regex_query( schema: Schema, field_name: str, regex_pattern: str) -> Query:

Construct a Tantivy's RegexQuery

def term_query( schema: Schema, field_name: str, field_value: Any, index_option: str = 'position') -> Query:

Construct a Tantivy's TermQuery

def term_set_query( schema: Schema, field_name: str, field_values: Sequence[Any]) -> Query:

Construct a Tantivy's TermSetQuery

Schema

class Schema:

Tantivy schema.

The schema is very strict. To build the schema the SchemaBuilder class is provided.

SchemaBuilder

class SchemaBuilder:

Tantivy has a very strict schema. You need to specify in advance whether a field is indexed or not, stored or not.

This is done by creating a schema object, and setting up the fields one by one.

Examples:
>>> builder = tantivy.SchemaBuilder()
>>> title = builder.add_text_field("title", stored=True)
>>> body = builder.add_text_field("body")
>>> schema = builder.build()
def add_boolean_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add a new boolean field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_bytes_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False, index_option: str = 'position') -> SchemaBuilder:

Add a fast bytes field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the bytes options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
def add_date_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add a new date field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the date options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_facet_field(self, name: str) -> SchemaBuilder:

Add a Facet field to the schema.

Arguments:
  • name (str): The name of the field.
def add_float_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add a new float field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_integer_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add a new signed integer field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_ip_addr_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add an IP address field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the IP address options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
def add_json_field( self, name: str, stored: bool = False, tokenizer_name: str = 'default', index_option: str = 'position') -> SchemaBuilder:

Add a new json field to the schema.

Arguments:
  • name (str): the name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • fast (bool, optional): Set the text options as a fast field. A fast field is a column-oriented fashion storage for tantivy. Text fast fields will have the term ids stored in the fast field. The fast field will be a multivalued fast field. It is recommended to use the "raw" tokenizer, since it will store the original text unchanged. The "default" tokenizer will store the terms as lower case and this will be reflected in the dictionary.
  • tokenizer_name (str, optional): The name of the tokenizer that should be used to process the field. Defaults to 'default'
  • index_option (str, optional): Sets which information should be indexed with the tokens. Can be one of 'position', 'freq' or 'basic'. Defaults to 'position'. The 'basic' index_option records only the document ID, the 'freq' option records the document id and the term frequency, while the 'position' option records the document id, term frequency and the positions of the term occurrences in the document.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_text_field( self, name: str, stored: bool = False, fast: bool = False, tokenizer_name: str = 'default', index_option: str = 'position') -> SchemaBuilder:

Add a new text field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • fast (bool, optional): Set the text options as a fast field. A fast field is a column-oriented fashion storage for tantivy. Text fast fields will have the term ids stored in the fast field. The fast field will be a multivalued fast field. It is recommended to use the "raw" tokenizer, since it will store the original text unchanged. The "default" tokenizer will store the terms as lower case and this will be reflected in the dictionary.
  • tokenizer_name (str, optional): The name of the tokenizer that should be used to process the field. Defaults to 'default'
  • index_option (str, optional): Sets which information should be indexed with the tokens. Can be one of 'position', 'freq' or 'basic'. Defaults to 'position'. The 'basic' index_option records only the document ID, the 'freq' option records the document id and the term frequency, while the 'position' option records the document id, term frequency and the positions of the term occurrences in the document.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def add_unsigned_field( self, name: str, stored: bool = False, indexed: bool = False, fast: bool = False) -> SchemaBuilder:

Add a new unsigned integer field to the schema.

Arguments:
  • name (str): The name of the field.
  • stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
  • indexed (bool, optional): If true sets the field to be indexed.
  • fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.

Returns the associated field handle. Raises a ValueError if there was an error with the field creation.

def build(self) -> Schema:

Finalize the creation of a Schema.

Returns a Schema object. After this is called the SchemaBuilder cannot be used anymore.

def is_valid_field_name(name: str) -> bool:

The type of the None singleton.

Searcher

class Searcher:

Tantivy's Searcher class

A Searcher is used to search the index given a prepared Query.

def aggregate(self, query: Query, agg: dict) -> dict:

Execute an aggregation query and return the results as a dict.

Arguments:
  • query (Query): The query that filters the documents to aggregate over.
  • agg (dict): The aggregation specification as a Python dict.

Returns a dict containing the aggregation results.

def cardinality(self, query: Query, field_name: str) -> float:

Returns the cardinality of a query.

Arguments:
  • query (Query): The query that will be used for the search.
  • field_name (str): The field for which to compute the cardinality.

Returns the cardinality.

def doc( self, doc_address: DocAddress) -> Document:

Fetches a document from Tantivy's store given a DocAddress.

Arguments:
  • doc_address (DocAddress): The DocAddress that is associated with the document that we wish to fetch.

Returns the Document, raises ValueError if the document can't be found.

def doc_freq(self, field_name: str, field_value: Any) -> int:

Return the overall number of documents containing the given term.

def fast_field_values( self, field_name: str, doc_addresses: list[DocAddress]) -> list[int | float | bool | None]:

Read a numeric fast field for a batch of DocAddresses without fetching stored documents.

Fast fields are column-oriented and support O(1) random access by segment-local DocId. Use this instead of doc().to_dict()[field] when you only need a single numeric field for many documents.

The field type is resolved from the schema automatically: u64 and i64 fields return Python int; f64 fields return Python float; bool fields return Python bool.

Arguments:
  • field_name: Name of a u64, i64, f64, or bool field declared with fast=True.
  • doc_addresses: List of DocAddress objects (e.g. from search().hits).
Returns:

A list of values in the same order as doc_addresses. None is returned for any address where the column is absent (e.g. a segment written before the field was added to the schema).

Raises:
  • ValueError: if the field does not exist, is not a fast field, or has an unsupported type (only u64, i64, f64, and bool are supported).
num_docs: int

Returns the overall number of documents in the index.

num_segments: int

Returns the number of segments in the index.

def search( self, query: Query, limit: int = 10, count: bool = True, order_by_field: str | None = None, offset: int = 0, order: Order = <Order.Desc: 2>, weight_by_field: str | None = None) -> SearchResult:

Search the index with the given query and collect results.

Arguments:
  • query (Query): The query that will be used for the search.
  • limit (int, optional): The maximum number of search results to return. Defaults to 10.
  • count (bool, optional): Should the number of documents that match the query be returned as well. Defaults to true.
  • order_by_field (str, optional): Name of a field that the results should be ordered by. The field must be declared as a fast field when building the schema. Supported field types: Text, Unsigned, Integer, Float, Boolean and Date.
  • offset (int, optional): The offset from which the results have to be returned.
  • order (Order, optional): The order in which the results should be sorted. If not specified, defaults to descending.
  • weight_by_field (str, optional): Name of a field that the results should be weighted by. The field must be declared as a fast field when building the schema. Note, this only works for Float, Integer and Unsigned fields. The given field value is first transformed using the formula log2(2.0 + value) and then multiplied with the original score. This means that a weight field value of 0.0 results in no change to the original score. If the weight value is negative, it is treated as 0.0.

Returns SearchResult object whose hits is a list of (order_key, DocAddress) tuples. When no order_by_field is given, order_key is a float score. When ordering by a field, order_key matches the field's Python type (int, float, bool, or str), except for date fields which return an int of nanoseconds since the epoch.

Raises a ValueError if there was an error with the search.

def terms_with_prefix( self, field_name: str, prefix: str, filter_query: Query | None = None, limit: int | None = None) -> list[tuple[str, int]]:

Walk the term dictionary for field_name and return all terms that begin with prefix, together with their document frequencies.

Arguments:
  • field_name: Name of an indexed text field in the schema.
  • prefix: Only terms beginning with this string are returned. An empty string returns all terms in the field.
  • filter_query: Optional Query. When provided, each term's count reflects only documents matched by the query (e.g. for permission filtering). Counts are still summed across segments.
  • limit: If given, only the top-limit entries (by count) are returned.
Returns:

[(term, count), ...] sorted by count descending, then alphabetically. Terms present in multiple segments have their counts summed.

Raises:
  • ValueError: if the field does not exist or is not a text field.

SearchResult

class SearchResult:

Object holding a results successful search.

count

How many documents matched the query. Only available if count was set to true during the search.

hits: list[tuple[typing.Any, DocAddress]]

The list of tuples that contains the scores and DocAddress of the search results.

Snippet

class Snippet:

A fragment of a document with highlighted search terms.

Contains a text fragment (a window around the matched terms) and the byte ranges within that fragment that matched the query.

def fragment(self) -> str:

Returns the text fragment that contains the highlighted terms.

def highlighted(self) -> list[tantivy.tantivy.Range]:

Returns the highlighted ranges within the fragment.

The offsets are relative to the string returned by fragment(), not the original document text.

def to_html(self) -> str:

Returns the fragment as HTML with matched terms wrapped in <b> tags.

SnippetGenerator

class SnippetGenerator:
def create( searcher: Searcher, query: Query, schema: Schema, field_name: str) -> SnippetGenerator:

The type of the None singleton.

def set_max_num_chars(self, max_num_chars: int) -> None:

The type of the None singleton.

def snippet_from_doc(self, doc: Document) -> Snippet:

The type of the None singleton.

TextAnalyzer

class TextAnalyzer:

Tantivy's TextAnalyzer

Do not instantiate this class directly. Use the TextAnalyzerBuilder class instead.

def analyze(self, text: str) -> list[str]:

Tokenize a string Args:

  • text (string): text to tokenize. Returns:
  • list(string): a list of tokens/words.

TextAnalyzerBuilder

class TextAnalyzerBuilder:

Tantivy's TextAnalyzerBuilder

Example

my_analyzer: TextAnalyzer = (
    TextAnalyzerBuilder(Tokenizer.simple())
    .filter(Filter.lowercase())
    .filter(Filter.ngram())
    .build()
)

https://docs.rs/tantivy/latest/tantivy/tokenizer/struct.TextAnalyzerBuilder.html

TextAnalyzerBuilder(tokenizer: Tokenizer)
def build(self) -> TextAnalyzer:

Build final TextAnalyzer object.

Returns:

  • TextAnalyzer with tokenizer and filters baked in.

Tip: TextAnalyzer's analyze(text) -> tokens method lets you easily check if your analyzer is working as expected.

def filter( self, filter: Filter) -> TextAnalyzerBuilder:

Add filter to the builder.

Args:

  • filter (Filter): a Filter object. Returns:
  • TextAnalyzerBuilder: A new instance of the builder

Note: The builder is _not_ mutated in-place.

Tokenizer

class Tokenizer:

All Tantivy's built-in tokenizers in one place. Each static method, e.g. Tokenizer.simple(), creates a wrapper around a Tantivy tokenizer.

Example:

tokenizer = Tokenizer.regex(r"\w+")

Usage

In general, tokenizer objects' only reason for existing is to be passed to TextAnalyzerBuilder(tokenizer=)

https://docs.rs/tantivy/latest/tantivy/tokenizer/index.html

def facet() -> Tokenizer:

FacetTokenizer

def ngram( min_gram: int = 2, max_gram: int = 3, prefix_only: bool = False) -> Tokenizer:

NgramTokenizer

Args:

  • min_gram (int): Minimum character length of each ngram.
  • max_gram (int): Maximum character length of each ngram.
  • prefix_only (bool, optional): If true, ngrams must count from the start of the word.
def raw() -> Tokenizer:

Raw Tokenizer

def regex(pattern: str) -> Tokenizer:

Regextokenizer

def simple() -> Tokenizer:

SimpleTokenizer

def whitespace() -> Tokenizer:

WhitespaceTokenizer