I was working on a project that needs to index a bunch of Products and their Variants into ElasticSearch. Variants have the same schema as Products in DB. So naturally, I started with designing a mapping that is exactly the same as the Product schema and index products and variants as their own documents.
But later, when I accidentally tried to index variants as nested objects inside products, the indexing process is 3x-5x faster (tested several times locally with 1000 products&5 variants, 2000 products&10 variants, and 25000 products&5 variants). The mapping looks something like the below:
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
variants: [
{
id: keyword
name: text
sku: keyword
price: long
color: keyword
...
}
]
So the question is why is that. Since the data size would be the same, a nested mapping will cause a longer index time due to 2x fields. Also, I'm using _bulk API to index products with their variants in each API call, so the request count would be the same.
Thanks in advance for any suggestions on why is this.
PS: I'm running ElasticSearch 6.7 locally
Just trying to answer the question, "why indexing time is different."
Nested documents are indexed differently. Internally nested documents are indexed as separate documents - but indexed as a single block within Lucene.
Suppose your document contains two variants in the nested data structure. In that case, the total number of documents indexed will be 3 ( 1 parent doc + 2 variants as separate docs) - internally by .[addDocuments()][1] Lucene call. This guarantees documents are indexed in a single block and available to query using nested query ( nested query joins these documents in runtime).
This results in a different indexing behavior. In your case - it got faster, but say if you have thousands of variants per product, too many nested structures can give you indexing problems. There are some limits to avoid it.
Related
I would like to index document with nested field. There will be some cases when some documents is going to have more than 10k elements (even 80k). Also I'm going to query this index with inner_hits using from and size parameters (I need to paginate nested objects). My question is : Is it a good approach to use nested field when I need to paginate the list with huge amount of data or it will be better to denormalize the model ?
We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.
I use Elasticsearch to store system vulnerabilities. Right now my typical entry is
{
_id: somenadomid
_source: {
"ip": "10.10.10.10",
"vuln_name": "v1",
"vuln_type": 1
}
This approach has the advantage to simplify queries ("number of machines with a vuln of type 1" -> an aggregation, "number of vulnerabilities" - a query_all search and associated totalvalue, ...).
It aslo has drawbacks, in particular:
the information is heavily demultiplied: the information about one host is copied over all vulnerabilities
there are as many lines as vulnerabilities, and not hosts (50x more in average)
the natural container is "host" and not "vulnerability" - it can be updated, deleted, etc. more easily.
I am therefore considering changing the scheme to a "host" base one:
{
_id: machine1
_source: {
"ip": "10.10.10.10",
"vuln": [
{
"name": "v1",
"type": 1
},
{
"name": "v2",
"type": 1
}
]
}
The problem I am running into is that I still fundamentally query vulnerabilities and do not know how to "explode" them in a query.
Specifically (I believe my problem will gravitate around this family of queries), how can I query
the total number of vulnerabilities of type 1 (not the hosts - there can be several vulns of type 1 per host, the basic query retrieves the entries, which are hosts)
the same as above, but with some filtering on, say, the vulnerability name ("number of vulnerabilities of type 1 with "Microsoft" in the name) - the filtering is on a feature of the vulnerability and not the host)
Just to give you a simple overview,
In Elasticsearch you have two way to mange nested data, you can use Nested Object or Inner Object, behind the scene they are completely different.
The nested type is a specialized version of the object datatype that allows arrays of objects to be indexed and queried independently of each other.
Nested docs are stored in the same Lucene block as each other, which helps read/query performance.
Reading a nested doc is faster than the equivalent parent/child.
Updating a single field in a nested document (parent or nested children) forces ES to reindex the entire nested document. This can be very expensive for large nested docs
"Cross referencing" nested documents is impossible
Best suited for data that does not change frequently
Inner Object is an objects embedded inside the parent document.
Easy, fast, performant Only applicable when one-to-one relationships
are maintained No need for special queries Nested
Please have a look the following link for further information the difference between Inner Object and Nested Object.
https://www.elastic.co/blog/managing-relations-inside-elasticsearch
In order to query and aggregate(To Get the total Number) have look the following links:
Query : https://www.elastic.co/guide/en/elasticsearch/guide/master/nested-objects.html
Aggregations :
https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html
I need to do aggregation by 7 fields in elasticsearch, then retrieve data TopHits and do some calculations Sum and Avg. Is there any possibility to get latest buckets of hits and calculations without many loops/recursive?
According to Elasticsearch documentation:
"The terms aggregation does not support collecting terms from multiple fields in the same document. The reason is that the termsagg doesn’t collect the string term values themselves, but rather uses global ordinals to produce a list of all of the unique values in the field. Global ordinals results in an important performance boost which would not be possible across multiple fields.
There are two approaches that you can use to perform a terms agg across multiple fields:
Script
Use a script to retrieve terms from multiple fields. This disables the global ordinals optimization and will be slower than collecting terms from a single field, but it gives you the flexibility to implement this option at search time.
copy_to field
If you know ahead of time that you want to collect the terms from two or more fields, then use copy_to in your mapping to create a new dedicated field at index time which contains the values from both fields. You can aggregate on this single field, which will benefit from the global ordinals optimization."
Source: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_multi_field_terms_aggregation
EDIT: If you use copy_to field, there is not reason to index it since so you do not need to analyze it, for this you only have to change its mapping:
"metaFieldName" => [
"type" => "string",
"index" => "not_analyzed"
]
I have a question regarding the setup of my elasticsearch database index... I have created a table which I have rivered to index in elasticsearch. The table is built from a script that queries multiple tables to denormalize data making it easier to index by a unique id 1:1 ratio
An example of a set of fields I have is street, city, state, zip, which I can query on, but my question is , should I be keeping those fields individually indexed , or be concatenating them as one big field like address which contains all of the previous fields into one? Or be putting in the extra time to setup parent-child indexes?
The use case example is I have a customer with billing info coming from one direction, I want to query elasticsearch to see if that customer already exists, or at least return the closest result
I know this question is more conceptual than programming, I just can't find any information of best practices.
Concatenation
For the first part of your question: I wouldn't concatenate the different fields into a field containing all information. Having multiple fields gives you the advantage of calculating facets and aggregates on those fields, e.g. how many customers are from a specific city or have a specific zip. You can still use a match or multimatch query to query for information from different fields.
In addition to having the information in separate fields I would use multifields with an analyzed and not_analyzed part (fieldname.raw). This again allows for aggregates, facets and sorting.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html
Think of 'New York': if you analyze it will be stored as ['New', 'York'] and you will not be able to see all People from 'New York'. What you'd see are all people from 'New' and 'York'.
_all field
There is a special _all field in elasticsearch which does the concatenation in the background. You don't have to do it yourself. It is possible to enable/disable it.
Parent Child relationship
Concerning the part whether to use nested objects or parent child relationship: I think that using a parent child relationship is more appropriate for your case. Nested objects are stored in a 'flattened' way, i.e. the information from the nested objects in arrays is stored as being part of one object. Consider the following example:
You have an order for a client:
client: 'Samuel Thomson'
orderline: 'Strong Thinkpad'
orderline: 'Light Macbook'
client: 'Jay Rizzi'
orderline: 'Strong Macbook'
Using nested objects if you search for clients who ordered 'Strong Macbook' you'd get both clients. This because 'Samuel Thomson' and his orders are stored altogether, i.e. ['Strong' 'Thinkpad' 'Light' 'Macbook'], there is no distinction between the two orderlines.
By using parent child documents, the orderlines for the same client are not mixed and preserve their identity.