Partial update into large document - elasticsearch

I'm facing the problem about performance. My application is about chatting.
I designed mapping index with nested object like below.
{
"conversation_id-v1": {
"mappings": {
"stream": {
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
},
"comments": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"message": {
"type": "text",
"fields": {
"analyzerName": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "analyzerName"
},
"language": {
"type": "langdetect",
"analyzer": "_keyword",
languages: ["en", "ko", "ja"]
}
}
}
}
}
}
}
}
}
}
** actually have a lot of fields
A document has around 4,000 nested objects. When I upsert data into document, It peak the cpu to 100% also disk i/o in case write. Input ratio around 1000/s.
How can I tuning to improve performance?
Hardware
3x 2vCPUs 13GB on GCP

4000 nested fields sounds like a lot - if I were you, I would look long and hard at your mapping design to be very certain you actually need that many nested fields.
Quoting from the docs:
Internally, nested objects index each object in the array as a separate hidden document.
Since a document has to be fully reindexed on update, you're indexing 4000 documents with a single update.
Why so many fields?
The reason you gave in the comments for needing so many fields
I'd like to search comments in nested and come with their parent stream for display.
makes me think that you may be mixing two concerns here.
ElasticSearch is meant for search, and your mapping should be optimized for search. If your mapping shape is dictated by the way you want to display information, then something is wrong.
Design your index around search
Note that by "search" I mean both indexing and querying.
For the use case you have, it seems like you could:
Index only the comments, with a reference (some id) to the parent stream in the indexed comment document.
After you get the search results (a list of comments) back from the search index, you can retrieve each comment along with its parent stream from some other data source (e.g. a relational database).
The point is, it may be much more efficient to re-retrieve the comment along with whatever else you want from some other source that is more better than ElasticSearch at joining data.

Related

Elasticsearch Index Design based on very big nested array (could have more than 300,000 records)

We have a following index schema:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"data": {
"properties": {
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
},
"type": "nested"
}
}
}
}
Id is a unique identifier. Here data field is an array and it could have more than 300,000 objects and may be more. Is it sensible and correct way to index this kind of data? Or we should change our design and make the schema like following:
PUT index_data
{
"aliases": {},
"mappings": {
"properties": {
"Id": {
"type": "integer"
},
"title": {
"type": "text"
},
"FieldName": {
"type": "text"
},
"FieldValue": {
"type": "text"
}
}
}
}
In this design, we cant use Id as a document id because with this design, id would be repeating. If we have 300,000 FieldName and FieldValues for one Id, Id would be repeating 300,000 time. The challenge here is to generate our custom id using some mechanism. Because we need to handle both insert and update cases.
In the first approach one document size would be too large so that it could contain an array of 300,000 objects or may be more.
In second approach, we would have too many documents. 75370611530 is the number we currently have. This is the number of FieldNames and FieldValues we have. How should we handle this kind of data? Which approach would be better? What should be the size of shards in this index?
I noticed that the current mapping is not nested. I assume you would need to be nested as the query seems to be "find value for key = key1".
If it is known that 300K objects are expected - It may not be a good idea. ES soft limit is 10K. Indexing issues are going to give trouble with this approach in addition to possible slow queries.
I doubt if indexing 75 billion documents for this purpose is useful - given the resources required, though it is feasible and will work.
May be consider RDBMS?

What is the correct setup for ElasticSearch 7.6.2 highlighting with FVH?

How to properly setup highlighting search words in huge documents using fast vector highlighter?
I've tried documentation and the following settings for the index (as Python literal, commented alternative settings, which I also tried, with store and without):
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"members": {
"dynamic": "strict",
"properties": {
"url": {
"type": "text",
"term_vector": "with_positions_offsets",
#"index_options": "offsets",
"store": True
},
"title": {
"type": "text",
#"index_options": "offsets",
"term_vector": "with_positions_offsets",
"store": True
},
"content": {
"type": "text",
#"index_options": "offsets",
"term_vector": "with_positions_offsets",
"store": True
}
}
}
}
}
Search done by the following query (again, commented places were tried one by one, in some combinations):
{
"query": {
"multi_match": {
"query": term,
"fields": ["url", "title", "content"]
},
},
"_source": {
#"includes": ["url", "title", "_id"],
# "excludes": ["content"]
},
"highlight": {
"number_of_fragments": 40,
"fragment_size": 80,
"fields": {
"content": {"matched_fields": ["content"]},
#"content": {"type": "fvh", "matched_fields": ["content"]},
#"title": {"type": "fvh", "matched_fields": ["title"]},
}
}
}
The problem is, that when FVH is not used, ElasticSearch complains that "content" field is too large. (And I do not want to increase the allowed size). When I add "fvh" type, ES complain that terms vectors are needed: Even though I've checked those are there by querying document info (offsets, starts, etc):
the field [content] should be indexed with term vector with position
offsets to be used with fast vector highlighter
It seems like:
When I omit "type": "fvh", it is not used even though documentation mentions it's the default when "term_vector": "with_positions_offsets".
I can see term vectors in the index, but ES does not find them. (indirectly, when indexing with term vectors the index is almost twice as large)
All the trials included removing old index and adding it again.
It's also so treacherous, that it fails only when a large document is encountered. Highlights are there for queries, where documents are small.
What is the proper way to setup highlights in ElasticSearch 7, free edition (I tried under Ubuntu with binary deb from the vendor)?
The fvh highlighter uses the Lucene Fast Vector highlighter. This highlighter can be used on fields with term_vector set to with_positions_offsets in the mapping. The fast vector highlighter requires setting term_vector to with_positions_offsets which increases the size of the index.
you can define a mapping like below for your field.
"mappings": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
while querying for highlight fields, you need to use "type" : "fvh"
The fast vector highlighter will be used by default for the text field because term vectors are enabled.

Elasticsearch: better to have more values or more fields?

Suppose to have an index with documents describing vehicles.
Your index needs to deal with two different type of vehicles: motorcycle and car.
Which of the following mapping is better from a performance point of view?
(nested is required for my purposes)
"vehicle": {
"type": "nested",
"properties": {
"car": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
},
"motorcycle": {
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
}
}
}
}
}
or this one:
"vehicle": {
"type": "nested",
"properties": {
"model": {
"type": "string"
},
"cost": {
"type": "integer"
},
"vehicle_type": {
"type": "string" ### "car", "motorcycle"
}
}
}
The second one is more readable and thin.
But the drawback that I'll have is that when I make my queries, if I want to focus only on "car", I need to put this condition as part of the query.
If I use the first mapping, I just need to have a direct access to the stored field, without adding overhead to the query.
The first mapping, where cars and motorcycles are isolated in different fields, is more likely to be faster. The reason is that you have one less filter to apply as you already know, and because of the increased selectivity of the queries (e.g less documents for a given value of vehicle.car.model than just vehicle.model)
Another option would be to create two distinct indexes car and motorcycle, possibly with the same index template.
In Elasticsearch, a query is processed by a single-thread per shard. That means, if you split your index in two, and query both in a single request, it will be executed in parallel.
So, when needed to query only one of cars or motorcycles, it's faster simply because indexes are smaller. And when it comes to query both cars and motorcycles it could also be faster by using more threads.
EDIT: one drawback of the later option you should know, the inner lucene dictionary will be duplicated, and if values in cars and motorcycles are quite identical, it doubles the list of indexed terms.

how to index questions and answers in elaticsearch

I am doing a project to index questions and answers of a website in elasticsearch (version 6) for search purpose.
I have first thought of creating two indexes as shown below, one for questions and one for answers.
questions mapping:
{"mappings": {
"question": {
"properties": {
"title":{
"type":"text"
},
"question": {
"type": "text"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
answers mapping:
{"mappings": {
"answer": {
"properties": {
"answer":{
"type":"text"
},
"answerId": {
"type": "keyword"
},
"questionId":{
"type":"keyword"
}
}
}
}
}
I have used multimatch query along with term and top_hits aggregation to search the indexed Q&As (referred question).I used this method to remove the duplicates from the search results. As answers or the question itself of the same question can appear in the result. I only want one entry per question in the results. the problem I am facing is to paginate the results. there is no possible way to paginate aggregation in elasticsearch. It can only paginate hits not aggregations.
then I thought of saving the both question and answers in one document, answers in a Json array. the problem with this approach is that there is no clean way to add, remove, update a specific answer in a given question document. only way I found was using a groovy script (referred question). which is deprecated in elasticsearch v6 AFAIK.
Is there a better and clean way to design this ?
Thanks.
Parent-Child Relationship
Use the parent-child relationship. It is similar to the nested model, and allows association of one entity with another. You can associate one document type with another, in a one-to-many relationship.
More information on here: https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html
Child documents can be added, changed, or deleted without affecting the parent nor other children. You can do pagination on the parent documents using the Scroll API.
Child documents can be retrieved using the has_parent join.
The trade-off: you do not have to take care of duplicates and pagination problems, but parent-child queries can be 5 to 10 times slower than the equivalent nested query.
Your mapping can be like the following:
PUT /my-index
{
"mappings": {
"question": {
"properties": {
"title": {
"type": "text"
},
"question": {
"type": "text"
},
"questionId": {
"type": "keyword"
}
}
},
"answer": {
"_parent": {
"type": "question"
},
"properties": {
"answer": {
"type": "text"
},
"answerId": {
"type": "keyword"
},
"questionId": {
"type": "keyword"
}
}
}
}
}

elasticsearch: sorting by values of matching nested document

I chose nested documents to realize a multilingual book search with common book data in root of the doc and edition data in nested docs. The mapping:
{
"book": {
"properties": {
"bookinfo": {
...
},
"editions": {
"type": "nested",
"properties": {
"editionid": {
"type": "long",
"store": "yes",
"index": "no"
},
"title_author": {
"type": "string",
"store": "no",
"index": "analyzed"
},
"title": {
"type": "string",
"store": "yes",
"index": "not_analyzed"
},
"languageid": {
"type": "short",
"store": "yes",
"index": "no"
},
"ratings": {
"type": "integer",
"store": "no"
}
}
}
}
}
}
Different editions of one book go in the nested doc - that can be different languages but also just different publishers, isbn and so on. Sometimes even the title differs from editions in the same language.
When searching the document (on the title_author field) I need to know the other nested doc information like languageid and ratings to boost the matching score according to the users language skills and relevance of the edition.
The reason why I don't put every edition in a separate document is that I only want to have one hit (the best matching one) per book. And ElasticSearch doesn't have a UNIQUE functionality. And I need pagination. So whenever I change a result set after querying with double books inside, pagination of ElasticSearch breaks.
Nested sorting functionality doesn't seem to help here, since it sorts over all nested documents of one book.
How do I access the information of the matching nested doc?
And if that is not achievable, how could I solve this by a multi search?
In order to access the nested documents fields you can use:
doc['editions. languageid'].value
And for the boosting part, try some of the examples here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html
Is that what you're looking for?

Resources