Elasticsearch: How to store term vectors - elasticsearch

I am working on a project where I heavily use Elasticsearch and leverage the moreLikeThis query to implement some features.
The official documentation for the MLT query states the following:
In order to speed up analysis, it could help to store term vectors at
index time, but at the expense of disk usage.
In the **How it works* section. The idea now is then to tune the mapping so store the pre calculated term vectors. The problem is that it seems unclear from the documentation how exactly this should be done. On one side, in the MLT documentation, they provide and example mapping that looks like this:
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
"mappings": {
"movies": {
"properties": {
"title": {
"type": "string",
"term_vector": "yes"
},
"description": {
"type": "string"
},
"tags": {
"type": "string",
"fields" : {
"raw": {
"type" : "string",
"index" : "not_analyzed",
"term_vector" : "yes"
}
}
}
}
}
}
}
On the other side, in the Term Vectors documentation, they provide a mapping in the Example 1 section that looks like this
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
....
This should create an index that stores term vectors, payloads etc.
Now the question is: which of the mapping should be used? Is it a flaw in the documentation or am I missing something?

You are right it doesn't seem to be explicitly mentioned in the current version of documents however in the upcoming release 2.0 documents there is a more detailed explanation.
Term vectors contain information about the terms produced by the
analysis process, including:
a list of terms.
the position (or order) of each term.
the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a
particular document.
The term_vector setting accepts:
no: No term vectors are stored. (default)
yes: Just the terms in the field are stored
with_positions: Terms and positions are stored
with_offsets: Terms and character offsets are stored
with_positions_offsets: Terms, positions, and character offsets are stored

Related

What is the correct setup for ElasticSearch 7.6.2 highlighting with FVH?

How to properly setup highlighting search words in huge documents using fast vector highlighter?
I've tried documentation and the following settings for the index (as Python literal, commented alternative settings, which I also tried, with store and without):
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"members": {
"dynamic": "strict",
"properties": {
"url": {
"type": "text",
"term_vector": "with_positions_offsets",
#"index_options": "offsets",
"store": True
},
"title": {
"type": "text",
#"index_options": "offsets",
"term_vector": "with_positions_offsets",
"store": True
},
"content": {
"type": "text",
#"index_options": "offsets",
"term_vector": "with_positions_offsets",
"store": True
}
}
}
}
}
Search done by the following query (again, commented places were tried one by one, in some combinations):
{
"query": {
"multi_match": {
"query": term,
"fields": ["url", "title", "content"]
},
},
"_source": {
#"includes": ["url", "title", "_id"],
# "excludes": ["content"]
},
"highlight": {
"number_of_fragments": 40,
"fragment_size": 80,
"fields": {
"content": {"matched_fields": ["content"]},
#"content": {"type": "fvh", "matched_fields": ["content"]},
#"title": {"type": "fvh", "matched_fields": ["title"]},
}
}
}
The problem is, that when FVH is not used, ElasticSearch complains that "content" field is too large. (And I do not want to increase the allowed size). When I add "fvh" type, ES complain that terms vectors are needed: Even though I've checked those are there by querying document info (offsets, starts, etc):
the field [content] should be indexed with term vector with position
offsets to be used with fast vector highlighter
It seems like:
When I omit "type": "fvh", it is not used even though documentation mentions it's the default when "term_vector": "with_positions_offsets".
I can see term vectors in the index, but ES does not find them. (indirectly, when indexing with term vectors the index is almost twice as large)
All the trials included removing old index and adding it again.
It's also so treacherous, that it fails only when a large document is encountered. Highlights are there for queries, where documents are small.
What is the proper way to setup highlights in ElasticSearch 7, free edition (I tried under Ubuntu with binary deb from the vendor)?
The fvh highlighter uses the Lucene Fast Vector highlighter. This highlighter can be used on fields with term_vector set to with_positions_offsets in the mapping. The fast vector highlighter requires setting term_vector to with_positions_offsets which increases the size of the index.
you can define a mapping like below for your field.
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
while querying for highlight fields, you need to use "type" : "fvh"
The fast vector highlighter will be used by default for the text field because term vectors are enabled.

I want to find exact term of sub string, exact term not just part of the term

I have group of json documents from wikidata (http://www.wikidata.org) to index to elasticsearch for search.
It has several fields. For example, it looks like below.
{
eId:Q25338
eLabel:"The Little Prince, Little Prince",
...
}
Here, what I want to do is for user to search 'exact term', not part of the term. Meaning, if a user search 'prince', I don't want to show this document in the search result. When user types the whole term 'the little prince' or 'little prince', I want to make this json included in the search result, namely.
Should I pre-process all the comma separate sentence (some eLabel has tens of elements in the list) and make it bunch of different documents and make the keyword term field respectively?
If not, how can I make a mapping file to make this search as expected?
My current Mappings.json.
"mappings": {
"entity": {
"properties": {
"eLabel": { # want to replace
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"eid": {
"type": "keyword"
} ,
"subclass": {
"type": "boolean"
} ,
"pLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"prop_id": {
"type": "keyword"
} ,
"pType": {
"type": "keyword"
} ,
"way": {
"type": "keyword"
} ,
"chain": {
"type": "integer"
} ,
"siteKey": {
"type": "keyword"
},
"version": {
"type": "integer"
},
"docId": {
"type": "integer"
}
}
}
}
Should I pre-process all the comma separate sentence (some eLabel has tens of elements in the list) and make it bunch of different documents and make the keyword term field respectively?
This is exactly what you should do. Elasticsearch can't process the comma-separated list for you. It will think your data is just 1 whole string. But if you preprocess it, and then make the resulting field a Keyword field, that will work very well - it's exactly what the Keyword field type is designed for. I'd recommend using a Term query to search for exact matches. (As opposed to a Match query, a Term query does not analyse the incoming query and is thus more efficient.)

Partial update into large document

I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP
4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

How can I get a search term with a space to be one search term

I have an elasticsearch index, with a field called "name" with a mapping as follows:
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
Now let's say I have a record "Brooklyn Technical High School".
I would like somebody searching for "brooklyn t*" to have that show up. For example: http://myserver/_search?q=name:brooklyn+t*
It seems however to be tokening the search term, and searching for both "brooklyn" and "t", because I get back results like: "Ps 335 Granville T Woods".
I would like it to search the not_analyzed term using the whole term. Enclosing it in quotes doesn't seem to help either.
You need to use the term query -
Term query wont analyzer/tokenize the string before it apply the search.
{
"query": {
"term": {
"user": "kimchy"
}
}
}

Resources