tantivy
Python bindings for the search engine library Tantivy.
Tantivy is a full text search engine library written in rust.
It is closer to Apache Lucene than to Elasticsearch and Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a library that can be used to build such a search engine. Tantivy is, in fact, strongly inspired by Lucene's design.
Example:
>>> import json >>> import tantivy >>> builder = tantivy.SchemaBuilder() >>> title = builder.add_text_field("title", stored=True) >>> body = builder.add_text_field("body") >>> schema = builder.build() >>> index = tantivy.Index(schema) >>> doc = tantivy.Document() >>> doc.add_text(title, "The Old Man and the Sea") >>> doc.add_text(body, ("He was an old man who fished alone in a " "skiff in the Gulf Stream and he had gone " "eighty-four days now without taking a fish.")) >>> writer.add_document(doc) >>> doc = schema.parse_document(json.dumps({ "title": ["Frankenstein", "The Modern Prometheus"], "body": ("You will rejoice to hear that no disaster has " "accompanied the commencement of an enterprise which " "you have regarded with such evil forebodings. " "I arrived here yesterday, and my first task is to " "assure my dear sister of my welfare and increasing " "confidence in the success of my undertaking.") })) >>> writer.add_document(doc) >>> writer.commit() >>> reader = index.reader() >>> searcher = reader.searcher() >>> query = index.parse_query("sea whale", [title, body]) >>> result = searcher.search(query, 10) >>> assert len(result) == 1
version
DocAddress
DocAddress contains all the necessary information to identify a document given a Searcher object.
It consists in an id identifying its segment, and its segment-local DocId. The id used for the segment is actually an ordinal in the list of segment hold by a Searcher.
Document
Tantivy's Document is the object that can be indexed and then searched for.
Documents are fundamentally a collection of unordered tuples (field_name, value). In this list, one field may appear more than once.
Example:
>>> doc = tantivy.Document() >>> doc.add_text("title", "The Old Man and the Sea") >>> doc.add_text("body", ("He was an old man who fished alone in a " ... "skiff in the Gulf Stream and he had gone " ... "eighty-four days now without taking a fish.")) >>> doc Document(body=[He was an ],title=[The Old Ma])
For simplicity, it is also possible to build a Document by passing the field
values directly as constructor arguments.
Example:
>>> doc = tantivy.Document(title=["The Old Man and the Sea"], body=["..."])
As syntactic sugar, tantivy also allows the user to pass a single values if there is only one. In other words, the following is also legal.
Example:
>>> doc = tantivy.Document(title="The Old Man and the Sea", body="...")
For numeric fields, the [Document] constructor does not have any
information about the type and will try to guess the type.
Therefore, it is recommended to use the [Document::from_dict()],
[Document::extract()], or Document::add_*() functions to provide
explicit type information.
Example:
>>> schema = ( ... SchemaBuilder() ... .add_unsigned_field("unsigned") ... .add_integer_field("signed") ... .add_float_field("float") ... .build() ... ) >>> doc = tantivy.Document.from_dict( ... {"unsigned": 1000, "signed": -5, "float": 0.4}, ... schema, ... )
Add a boolean value to the document.
Arguments:
- field_name (str): The field name for which we are adding the value.
- value (bool): The boolean that will be added to the document.
Add a bytes value to the document.
Arguments:
- field_name (str): The field for which we are adding the bytes.
- value (bytes): The bytes that will be added to the document.
Add a date value to the document.
Arguments:
- field_name (str): The field name for which we are adding the date.
- value (datetime): The date that will be added to the document.
Add a facet value to the document.
Arguments:
- field_name (str): The field name for which we are adding the facet.
- value (Facet): The Facet that will be added to the document.
Add a float value to the document.
Arguments:
- field_name (str): The field name for which we are adding the value.
- value (f64): The float that will be added to the document.
Add a signed integer value to the document.
Arguments:
- field_name (str): The field name for which we are adding the integer.
- value (int): The integer that will be added to the document.
Add an IP address value to the document.
Arguments:
- field_name (str): The field for which we are adding the IP address.
- value (str): The IP address object that will be added to the document.
Raises a ValueError if the IP address is invalid.
Add a JSON value to the document.
Arguments:
- field_name (str): The field for which we are adding the JSON.
- value (str | Dict[str, Any]): The JSON object that will be added to the document.
Raises a ValueError if the JSON is invalid.
Add a text value to the document.
Arguments:
- field_name (str): The field name for which we are adding the text.
- text (str): The text that will be added to the document.
Add an unsigned integer value to the document.
Arguments:
- field_name (str): The field name for which we are adding the unsigned integer.
- value (int): The integer that will be added to the document.
Get the all values associated with the given field.
Arguments:
- field (Field): The field for which we would like to get the values.
Returns a list of values. The type of the value depends on the field.
Get the first value associated with the given field.
Arguments:
- field (Field): The field for which we would like to get the value.
Returns the value if one is found, otherwise None. The type of the value depends on the field.
Returns a dictionary with the different field values.
In tantivy, Document can be hold multiple
values for a single field.
For this reason, the dictionary, will associate a list of value for every field.
Explanation
Represents an explanation of how a document matched a query.
Facet
A Facet represent a point in a given hierarchy.
They are typically represented similarly to a filepath. For instance, an e-commerce website could have a Facet for /electronics/tv_and_video/led_tv.
A document can be associated to any number of facets. The hierarchy implicitely imply that a document belonging to a facet also belongs to the ancestor of its facet. In the example above, /electronics/tv_and_video/ and /electronics.
Create a Facet object from a string.
Arguments:
- facet_string (str): The string that contains a facet.
Returns the created Facet.
Returns true if another Facet is a subfacet of this facet.
Arguments:
- other (Facet): The Facet that we should check if this facet is a subset of.
FieldType
Tantivy's Type
Filter
All Tantivy's builtin TokenFilters.
Example
filter = Filter.alpha_num()
Usage
In general, filter objects exist to be passed to the filter() method of a TextAnalyzerBuilder instance.
StopWordFilter (user-provided stop word list)
This variant of Filter.stopword() lets you provide your own custom list of stopwords.
Args:
- stopwords (list(str)): a list of words to be removed.
SplitCompoundWords
https://docs.rs/tantivy/latest/tantivy/tokenizer/struct.SplitCompoundWords.html
Args:
- constituent_words (list(string)): words that make up compound word (must be in order).
Example:
# useless, contrived example:
compound_spliter = Filter.split_compounds(['butter', 'fly'])
# Will split 'butterfly' -> ['butter', 'fly'],
# but won't split 'buttering' or 'buttercupfly'
StopWordFilter (builtin stop word list)
Args:
- language (string): Stop words list language. Valid values: { "arabic", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "italian", "norwegian", "portuguese", "romanian", "russian", "spanish", "swedish", "tamil", "turkish" }
Index
Create a new index object.
Arguments:
- schema (Schema): The schema of the index.
- path (str, optional): The path where the index should be stored. If no path is provided, the index will be stored in memory.
- reuse (bool, optional): Should we open an existing index if one exists or always create a new one.
If an index already exists it will be opened and reused. Raises OSError if there was a problem during the opening or creation of the index.
Configure the index reader.
Arguments:
- reload_policy (str, optional): The reload policy that the
IndexReader should use. Can be
ManualorOnCommit. - num_warmers (int, optional): The number of searchers that the reader should create.
Check if the given path contains an existing index.
Arguments:
- path: The path where tantivy will search for an index.
Returns True if an index exists at the given path, False otherwise.
Raises OSError if the directory cannot be opened.
Parse a query
Arguments:
- query: the query, following the tantivy query language.
- default_fields_names (List[Field]): A list of fields used to search if no field is specified in the query.
- field_boosts: A dictionary keyed on field names which provides default boosts for the query constructed by this method.
- fuzzy_fields: A dictionary keyed on field names which provides (prefix, distance, transpose_cost_one)
triples making queries constructed by this method fuzzy against the given fields
and using the given parameters.
prefixdetermines if terms which are prefixes of the given term match the query.distancedetermines the maximum Levenshtein distance between terms matching the query and the given term.transpose_cost_onedetermines if transpositions of neighbouring characters are counted only once against the Levenshtein distance. - conjunction_by_default: If true, the query will be parsed as a conjunction query. Defaults to a disjunction query.
- allow_regexes: If true, allow regexes in queries.
Parse a query leniently.
This variant parses invalid query on a best effort basis. If some part of the query can't reasonably be executed (range query without field, searching on a non existing field, searching without precising field when no default field is provided...), they may get turned into a "match-nothing" subquery.
Arguments:
- query: the query, following the tantivy query language.
- default_fields_names (List[Field]): A list of fields used to search if no field is specified in the query.
- field_boosts: A dictionary keyed on field names which provides default boosts for the query constructed by this method.
- fuzzy_fields: A dictionary keyed on field names which provides (prefix, distance, transpose_cost_one)
triples making queries constructed by this method fuzzy against the given fields
and using the given parameters.
prefixdetermines if terms which are prefixes of the given term match the query.distancedetermines the maximum Levenshtein distance between terms matching the query and the given term.transpose_cost_onedetermines if transpositions of neighbouring characters are counted only once against the Levenshtein distance. - conjunction_by_default: If true, the query will be parsed as a conjunction query. Defaults to a disjunction query.
- allow_regexes: If true, allow regexes in queries.
Returns a tuple containing the parsed query and a list of errors.
Raises ValueError if a field in default_field_names is not defined or marked as indexed.
Register a custom text analyzer for fast fields by name. (Confusingly, this is one of the places where Tantivy uses 'tokenizer' to refer to a TextAnalyzer instance.)
Register a custom text analyzer by name. (Confusingly, this is one of the places where Tantivy uses 'tokenizer' to refer to a TextAnalyzer instance.)
Update searchers so that they reflect the state of the last .commit().
If you set up the the reload policy to be on 'commit' (which is the default) every commit should be rapidly reflected on your IndexReader and you should not need to call reload() at all.
Returns a searcher
This method should be called every single time a search query is performed. The same searcher must be used for a given query, as it ensures the use of a consistent segment set.
Create a IndexWriter for the index.
The writer will be multithreaded and the provided heap size will be split between the given number of threads.
Arguments:
- overall_heap_size (int, optional): The total target heap memory usage of the writer. Tantivy requires that this can't be less than 3000000 per thread. Lower values will result in more frequent internal commits when adding documents (slowing down write progress), and larger values will results in fewer commits but greater memory usage. The best value will depend on your specific use case.
- num_threads (int, optional): The number of threads that the writer should use. If this value is 0, tantivy will choose automatically the number of threads.
Raises ValueError if there was an error while creating the writer.
IndexWriter
IndexWriter is the user entry-point to add documents to the index.
To create an IndexWriter first create an Index and call the writer() method on the index object.
Add a document to the index.
If the indexing pipeline is full, this call may block.
Returns an opstamp, which is an increasing integer that can be used
by the client to align commits with its own document queue.
The opstamp represents the number of documents that have been added
since the creation of the index.
Helper for the add_document method, but passing a json string.
If the indexing pipeline is full, this call may block.
Returns an opstamp, which is an increasing integer that can be used
by the client to align commits with its own document queue.
The opstamp represents the number of documents that have been added
since the creation of the index.
Commits all of the pending changes
A call to commit blocks. After it returns, all of the document that were added since the last commit are published and persisted.
In case of a crash or an hardware failure (as long as the hard disk is spared), it will be possible to resume indexing from this point.
Returns the opstamp of the last document that made it in the commit.
The opstamp of the last successful commit.
This is the opstamp the index will rollback to if there is a failure like a power surge.
This is also the opstamp of the commit that is currently available for searchers.
The type of the None singleton.
Delete all documents matching a given query.
Example:
schema_builder = SchemaBuilder() schema_builder.add_text_field("title", fast=True) schema = schema_builder.build() index = Index(schema) writer = index.writer() source_doc = { "title": "Here is some text" } writer.add_json(json.dumps(source_doc)) writer.commit() writer.wait_merging_threads() query = index.parse_query("title:text") writer = index.writer() writer.delete_documents_by_query(query) writer.commit() writer.wait_merging_threads()
Arguments:
- query (Query): The query to filter the deleted documents.
If the query is not valid raises ValueError exception. If the query is not supported raises Exception.
Delete all documents containing a given term.
This method does not parse the given term and it expects the term to be
already tokenized according to any tokenizers attached to the field. This
can often result in surprising behaviour. For example, if you want to store
UUIDs as text in a field, and those values have hyphens, and you use the
default tokenizer which removes punctuation, you will not be able to delete
a document added with particular UUID, by passing the same UUID to this
method. In such workflows where deletions are required, particularly with
string values, it is strongly recommended to use the
"raw" tokenizer as this will match exactly. In situations where you do
want tokenization to be applied, it is recommended to instead use the
delete_documents_by_query method instead, which will delete documents
matching the given query using the same query parser as used in search queries.
Arguments:
- field_name (str): The field name for which we want to filter deleted docs.
- field_value (PyAny): Python object with the value we want to filter.
If the field_name is not on the schema raises ValueError exception. If the field_value is not supported raises Exception.
Detect and removes the files that are not used by the index anymore.
Rollback to the last commit
This cancels all of the update that happened before after the last commit. After calling rollback, the index is in the same state as it was after the last commit.
If there are some merging threads, blocks until they all finish
their work and then drop the IndexWriter.
This will consume the IndexWriter. Further accesses to the
object will result in an error.
Occur
Tantivy's Occur
Order
Enum representing the direction in which something should be sorted.
parse_query
Parse a query string into an abstract syntax tree (AST).
Arguments:
- query: The query string to parse.
Returns:
A dictionary representing the parsed query AST.
Raises:
- ValueError: If the query has invalid syntax.
parse_query_lenient
Parse a query string leniently, recovering from syntax errors.
Arguments:
- query: The query string to parse.
Returns:
A tuple containing: - A dictionary representing the parsed query AST - A list of error dictionaries describing syntax errors
Query
Tantivy's Query
Convenience method to combine queries with AND (MUST) logic.
Returns a query matching documents that match this query and every
given query. Accepts any number of queries, so a list can be passed
with argument unpacking: query.and_must_match(*queries).
Convenience method to combine queries with AND NOT (MUST NOT) logic.
Returns a query matching documents that match this query and none of
the given queries. Accepts any number of queries, so a list can be
passed with argument unpacking: query.and_must_not_match(*queries).
Construct a Tantivy's BooleanQuery
Construct a Tantivy's ExistsQuery Executing a search with this query will fail if the specified field doesn’t exists or is not a fast field.
Arguments
fast_field_name- Field name to be searched.json_subpaths- If true, check all the subpaths inside a JSON field
Explain how this query matches a given document.
Arguments
searcher(Searcher): The searcher used to perform the search.doc_address(DocAddress): The address of the document to explain.
Returns
Explanation: An object containing detailed information about how the document matched the query, with a to_json() method.
Construct a Tantivy's FuzzyTermQuery
Arguments
schema- Schema of the target index.field_name- Field name to be searched.text- String representation of the query term.distance- (Optional) Edit distance you are going to allow. When not specified, the default is 1.transposition_cost_one- (Optional) If true, a transposition (swapping) cost will be 1; otherwise it will be 2. When not specified, the default is true.prefix- (Optional) If true, prefix levenshtein distance is applied. When not specified, the default is false.
Construct a Tantivy's MoreLikeThisQuery from caller-provided field values.
The type of the None singleton.
Convenience method to combine queries with OR (SHOULD) logic.
Returns a query matching documents that match this query or any of
the given queries. Accepts any number of queries, so a list can be
passed with argument unpacking: query.or_should_match(*queries).
Construct a Tantivy's PhrasePrefixQuery with custom offsets and slop
Arguments
schema- Schema of the target index.field_name- Field name to be searched.words- Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.
Construct a Tantivy's PhraseQuery with custom offsets and slop
Arguments
schema- Schema of the target index.field_name- Field name to be searched.words- Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.slop- (Optional) The number of gaps permitted between the words in the query phrase. Default is 0.
Construct a range query over a numeric, date, or IP address field.
Pass None for lower_bound or upper_bound to leave that side unbounded.
Both bounds cannot be None; use Query.all_query() to match all documents.
Setting include_lower or include_upper to False while the corresponding
bound is None is an error—unbounded sides are always inclusive by definition.
Arguments
schema- Schema of the target index.field_name- Field name to be searched.field_type- Type of the field (FieldType.Integer,FieldType.Float,FieldType.Date, etc.).lower_bound- Lower bound value, orNonefor unbounded.upper_bound- Upper bound value, orNonefor unbounded.include_lower- Whether the lower bound is inclusive. Ignored (and must beTrue) whenlower_boundisNone.include_upper- Whether the upper bound is inclusive. Ignored (and must beTrue) whenupper_boundisNone.use_inverted_index- IfTrue, use an inverted index range query instead of a fast-field range query.
Construct a Tantivy's PhraseQuery with custom offsets and slop
Arguments
schema- Schema of the target index.field_name- Field name to be searched.words- Word list that constructs the phrase. A word can be a term text or a pair of term text and its offset in the phrase.slop- (Optional) The number of gaps permitted between the words in the query phrase. Default is 0.
Schema
Tantivy schema.
The schema is very strict. To build the schema the SchemaBuilder class is
provided.
SchemaBuilder
Tantivy has a very strict schema. You need to specify in advance whether a field is indexed or not, stored or not.
This is done by creating a schema object, and setting up the fields one by one.
Examples:
>>> builder = tantivy.SchemaBuilder()>>> title = builder.add_text_field("title", stored=True) >>> body = builder.add_text_field("body")>>> schema = builder.build()
Add a new boolean field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add a fast bytes field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the bytes options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Add a new date field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the date options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add a new float field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add a new signed integer field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add an IP address field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the IP address options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Add a new json field to the schema.
Arguments:
- name (str): the name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- fast (bool, optional): Set the text options as a fast field. A fast field is a column-oriented fashion storage for tantivy. Text fast fields will have the term ids stored in the fast field. The fast field will be a multivalued fast field. It is recommended to use the "raw" tokenizer, since it will store the original text unchanged. The "default" tokenizer will store the terms as lower case and this will be reflected in the dictionary.
- tokenizer_name (str, optional): The name of the tokenizer that should be used to process the field. Defaults to 'default'
- index_option (str, optional): Sets which information should be indexed with the tokens. Can be one of 'position', 'freq' or 'basic'. Defaults to 'position'. The 'basic' index_option records only the document ID, the 'freq' option records the document id and the term frequency, while the 'position' option records the document id, term frequency and the positions of the term occurrences in the document.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add a new text field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- fast (bool, optional): Set the text options as a fast field. A fast field is a column-oriented fashion storage for tantivy. Text fast fields will have the term ids stored in the fast field. The fast field will be a multivalued fast field. It is recommended to use the "raw" tokenizer, since it will store the original text unchanged. The "default" tokenizer will store the terms as lower case and this will be reflected in the dictionary.
- tokenizer_name (str, optional): The name of the tokenizer that should be used to process the field. Defaults to 'default'
- index_option (str, optional): Sets which information should be indexed with the tokens. Can be one of 'position', 'freq' or 'basic'. Defaults to 'position'. The 'basic' index_option records only the document ID, the 'freq' option records the document id and the term frequency, while the 'position' option records the document id, term frequency and the positions of the term occurrences in the document.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Add a new unsigned integer field to the schema.
Arguments:
- name (str): The name of the field.
- stored (bool, optional): If true sets the field as stored, the content of the field can be later restored from a Searcher. Defaults to False.
- indexed (bool, optional): If true sets the field to be indexed.
- fast (bool, optional): Set the numeric options as a fast field. A fast field is a column-oriented fashion storage for tantivy. It is designed for the fast random access of some document fields given a document id.
Returns the associated field handle. Raises a ValueError if there was an error with the field creation.
Searcher
Tantivy's Searcher class
A Searcher is used to search the index given a prepared Query.
Execute an aggregation query and return the results as a dict.
Arguments:
- query (Query): The query that filters the documents to aggregate over.
- agg (dict): The aggregation specification as a Python dict.
Returns a dict containing the aggregation results.
Returns the cardinality of a query.
Arguments:
- query (Query): The query that will be used for the search.
- field_name (str): The field for which to compute the cardinality.
Returns the cardinality.
Fetches a document from Tantivy's store given a DocAddress.
Arguments:
- doc_address (DocAddress): The DocAddress that is associated with the document that we wish to fetch.
Returns the Document, raises ValueError if the document can't be found.
Return the overall number of documents containing the given term.
Read a numeric fast field for a batch of DocAddresses without fetching stored documents.
Fast fields are column-oriented and support O(1) random access by segment-local DocId. Use this instead of doc().to_dict()[field] when you only need a single numeric field for many documents.
The field type is resolved from the schema automatically: u64 and i64 fields return Python int; f64 fields return Python float; bool fields return Python bool.
Arguments:
- field_name: Name of a u64, i64, f64, or bool field declared with fast=True.
- doc_addresses: List of DocAddress objects (e.g. from search().hits).
Returns:
A list of values in the same order as doc_addresses. None is returned for any address where the column is absent (e.g. a segment written before the field was added to the schema).
Raises:
- ValueError: if the field does not exist, is not a fast field, or has an unsupported type (only u64, i64, f64, and bool are supported).
Search the index with the given query and collect results.
Arguments:
- query (Query): The query that will be used for the search.
- limit (int, optional): The maximum number of search results to return. Defaults to 10.
- count (bool, optional): Should the number of documents that match the query be returned as well. Defaults to true.
- order_by_field (str, optional): Name of a field that the results should be ordered by. The field must be declared as a fast field when building the schema. Supported field types: Text, Unsigned, Integer, Float, Boolean and Date.
- offset (int, optional): The offset from which the results have to be returned.
- order (Order, optional): The order in which the results should be sorted. If not specified, defaults to descending.
- weight_by_field (str, optional): Name of a field that the results
should be weighted by. The field must be declared as a fast
field when building the schema. Note, this only works for
Float, Integer and Unsigned fields. The given field value is first
transformed using the formula
log2(2.0 + value)and then multiplied with the original score. This means that a weight field value of 0.0 results in no change to the original score. If the weight value is negative, it is treated as 0.0.
Returns SearchResult object whose hits is a list of (order_key,
DocAddress) tuples. When no order_by_field is given, order_key is
a float score. When ordering by a field, order_key matches the
field's Python type (int, float, bool, or str), except for date fields
which return an int of nanoseconds since the epoch.
Raises a ValueError if there was an error with the search.
Walk the term dictionary for field_name and return all terms that
begin with prefix, together with their document frequencies.
Arguments:
- field_name: Name of an indexed text field in the schema.
- prefix: Only terms beginning with this string are returned. An empty string returns all terms in the field.
- filter_query: Optional Query. When provided, each term's count reflects only documents matched by the query (e.g. for permission filtering). Counts are still summed across segments.
- limit: If given, only the top-
limitentries (by count) are returned.
Returns:
[(term, count), ...]sorted by count descending, then alphabetically. Terms present in multiple segments have their counts summed.
Raises:
- ValueError: if the field does not exist or is not a text field.
SearchResult
Object holding a results successful search.
How many documents matched the query. Only available if count was set
to true during the search.
The list of tuples that contains the scores and DocAddress of the search results.
Snippet
A fragment of a document with highlighted search terms.
Contains a text fragment (a window around the matched terms) and the byte ranges within that fragment that matched the query.
Returns the highlighted ranges within the fragment.
The offsets are relative to the string returned by fragment(),
not the original document text.
SnippetGenerator
TextAnalyzer
Tantivy's TextAnalyzer
Do not instantiate this class directly.
Use the TextAnalyzerBuilder class instead.
TextAnalyzerBuilder
Tantivy's TextAnalyzerBuilder
Example
my_analyzer: TextAnalyzer = (
TextAnalyzerBuilder(Tokenizer.simple())
.filter(Filter.lowercase())
.filter(Filter.ngram())
.build()
)
https://docs.rs/tantivy/latest/tantivy/tokenizer/struct.TextAnalyzerBuilder.html
Build final TextAnalyzer object.
Returns:
- TextAnalyzer with tokenizer and filters baked in.
Tip: TextAnalyzer's analyze(text) -> tokens method lets you
easily check if your analyzer is working as expected.
Tokenizer
All Tantivy's built-in tokenizers in one place. Each static method, e.g. Tokenizer.simple(), creates a wrapper around a Tantivy tokenizer.
Example:
tokenizer = Tokenizer.regex(r"\w+")
Usage
In general, tokenizer objects' only reason
for existing is to be passed to
TextAnalyzerBuilder(tokenizer=