Query performance when applying the "Great mapping refactoring" - elasticsearch

Our applications' entities are dynamic, we don't know how many properties they'll have or what their type will be.
Up until now, we've indexed our data in the following way:
{
"message": "some string",
"count": 1,
"date": "2015-06-01"
}
After reading the following blog:
We've understood that it's better to index the data like this:
{
"data": [
{
"key": "message",
"str_val": "some_string"
},
{
"key": "count",
"int_val": 1
},
{
"key": "date",
"date_val": "2015-06-01"
}
]
}
We were wondering how the index would work in terms of nested aggregations.
will the mapping refactoring above damage the indexing time (and/or the query/aggregation time) due to the fact that now, every entity will be nested one level deeper?
We have thousands of different object types, hence our mapping file is huge. That slows down the indexing time, so a mapping refactoring is highly necessary.
Are you aware of any disadvantages when it comes to refactoring our mapping as explained in the blog above?

Related

Ingesting / enriching / transforming data in one elasticsearch index with dynamic information from a second one

I would like to dynamically enrich an existing index based on the (weighted) term frequencies given in a second index.
Imagine I have one index with one field I want to analyze (field_of_interest):
POST test/_doc/1
{
"field_of_interest": "The quick brown fox jumps over the lazy dog."
}
POST test/_doc/2
{
"field_of_interest": "The quick and the dead."
}
POST test/_doc/3
{
"field_of_interest": "The lazy quack was quick to quip."
}
POST test/_doc/4
{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! "
}
and a second one (scores) with pairs of keywords and weights:
POST scores/_doc/1
{
"term": "quick",
"weight": 1
}
POST scores/_doc/2
{
"term": "brown",
"weight": 2
}
POST scores/_doc/3
{
"term": "lazy",
"weight": 3
}
POST scores/_doc/4
{
"term": "green",
"weight": 4
}
I would like to define and perform some kind of analysis, ingestion, transform, enrichment, re-indexing or whatever to dynamically add a new field points to the first index that is the aggregation (sum) of the weighted number of occurrences of each of the search terms from the second index in the field_of_interest in the first index. So after performing this operation, I would want a new index to look something like this (some fields omitted):
{
"_id":"1",
"_source":{
"field_of_interest": "The quick brown fox jumps over the lazy dog.",
"points": 6
}
},
{
"_id":"2",
"_source":{
"field_of_interest": "The quick and the dead.",
"points": 1
}
},
{
"_id":"3",
"_source":{
"field_of_interest": "The lazy quack was quick to quip.",
"points": 4
}
},
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"points": 9
}
}
If possible, it may even be interesting to get individual fields for each of the terms, listing the weighted sum of the occurrences, e.g.
{
"_id":"4",
"_source":{
"field_of_interest": "Quick, quick, quick, you lazy, lazy guys! ",
"quick": 3,
"brown": 0,
"lazy": 6,
"green": 0,
"points": 9
}
}
The question I now have is how to go about this in Elasticsearch. I am fairly new to Elastic, and there are many concepts that seem promising, but so far I have not been able to pinpoint even a partial solution.
I am on Elasticsearch 7.x (but would be open to move to 8.x) and want to do this via the API, i.e. without using Kibana.
I first thought of an _ingest pipeline with an _enrich policy, since I am kind of trying to add information from one index to another. But my understanding is that the matching does not allow for a query, so I don't see how this could work.
I also looked at _transform, _update_by_query, custom scoring, _term_vector but to be honest, I am a bit lost.
I would appreciate any pointers whether what I want to do can be done with Elasticsearch (I assumed it would kind of be the perfect tool) and if so, which of the many different Elasticsearch concept would be most suitable for my use case.
Follow this sequence of steps:
/_scroll every document in the second index.
Search for it in the first index (simple match query)
Increment the points by a script update operation on every matching document.
Having individual words as fields in the first index is not a good idea. We do not know which words are going to be found inside the sentences, and so your index mapping will explode witha lot of dynamic fields, which is not desirable. A better way is to add a nested mapping to the first index. With the following mapping:
{
"words" : {
"type" : "nested",
"properties" : {
"name" : {"type" : "keyword"},
"weight" : {"type" : "float"}
}
}
}
THen you simply append to this array, for every word that is found. "points" can be a seperate field.
What you want to do has to be done client side. There is no inbuilt way to handle such an operation.
HTH.

Index main-object, sub-objects, and do a search on sub-objects (that return sib-objects)

I've an object like it (simplified here), Each strain have many chromosomes, that have many locus, that have many features, that have many products, ... Here I just put 1 of each.
The structure in json is:
{
"name": "my strain",
"public": false,
"authorized_users": [1, 23, 51],
"chromosomes": [
{
"name": "C1",
"locus": [
{
"name": "locus1",
"features": [
{
"name": "feature1",
"products": [
{
"name": "product1"
//...
}
]
}
]
}
]
}
]
}
I want to add this object in Elasticsearch, for the moment I've add objects separatly: locus, features and products. It's okay to do a search (I want type a keyword, watch in name of locus, name of features, and name of products), but I need to duplicate data like public and authorized_users, in each subobject.
Can I register the whole object in elasticsearch and just do a search on each locus level, features and products ? And get it individually ? (no return the Strain object)
Yes you can search at any level (ie, with a query like "chromosomes.locus.name").
But as you have arrays at each level, you will have to use nested objects (and nested query) to get exactly what you want, which is a bit more complex:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.3/query-dsl-nested-query.html
For your last question, no, you cannot get subobjects individually, elastic returns the whole json source object.
If you want only data from subobjects, you will have to use nested aggregations.

Elasticsearch: querying on both nested object properties and parent properties

I have some documents which have nested objects inside nested objects :
{
"started_at": 1455088063966,
"ended_at": 1455088131966,
"tags": [{
"type": "transfer",
"at": 1455088064462,
"events": [{
"type": "transfer_processed",
"at": 1455088131981
}]
}, {
"at": 1455088138232,
"item": "tag",
"type": "info"
}]
}
Here, the main document has several nested objects (the tags), and for each tag there are several nested objects (the events).
I would like to get all the documents where the events of type transfer_processed occured 60000 milliseconds after the tags of type transfer.
For this, I would need to query on both tags.at, tags.type, tags.events.at and tags.events.type. And I can't figure out how: I only manage to query on the tags.events properties, or only on the tags properties, not both.
Nested objects are actually separate Lucene documents under the hood, so you are essentially trying to "join" multiple documents together to do your comparisons. Unfortunately, this is not supported by Elasticsearch.
Have a look at this similar question and answer which explain it well.

elasticsearch can't understand signed date

I'm trying to save a document with the current values on Elasticsearch (1.7) from my ruby script.
{
"P577 ": [{
"snaktype": "value",
"property": "P577",
"hash": "2a7ea4b81277334f08c4cd9efbce76001505a481",
"datavalue": {
"value": {
"time": "+2015-10-16T00:00:00Z",
"timezone": 0,
"before": 0,
"after": 0,
"precision": 11,
"calendarmodel": "http://www.wikidata.org/entity/Q1985727"
},
"type": "time"
},
"datatype": "time"
}]
}
turns out that ES don't know how to handle with the time (+2015-10-16T00:00:00Z).
Is there a way to make ES understand this kind of date?
I know I can maybe using mapping, but what I showed is a very small piece of a giant JSON, with lots of nested nodes (like here)
Unfortunately, the only way to do this is with a custom mapping and changing how ES detects dates:
What I've done in the past is to
Insert a document.
Extract the mapping (see how here).
Update the mapping to my liking.
Do a PUT of the new mapping... And to do this, you might need to create a different type, as sometimes ES cannot replace a mapping on the fly.
I've never done this with dates, but had to do it a few times with the way a field is indexed.
"+2015-10-16T00:00:00Z" remove plus sign and try it,.
"2015-10-16T00:00:00Z"

ElasticSearch performance when querying by element type

Assume that we have a dataset containing a collection of domains { domain.com, domain2.com } and also a collection of users { user#domain.com, angryuser#domain2.com, elastic#domain3.com }.
Being so lets assume that both domains and users have several attributes in common, such as "domain", and when the attribute name matches, also do the mapping and possible values.
Then we load up our elasticsearch index with all collections separating them by type, domain and user.
Obviously in our system we would have many more users compared to domains so when querying for domain related data, the expectation is that it would be much faster to filter the query by the type of the attribute right?
My question is, having around 5 million users and 200k domains, why is that when my index only contains domain data, users were deleted, queries run much faster than filtering the objects based on their type? Shouldn't it be at least around similar performance ? On my current status we can match 20 domains per second if there are no users on the index, but it drops to 4 when we load up the users, even though we still filter by type.
Maybe it is something that im missing as im new to elasticsearch.
UPDATE:
This is the query basically
"query" : {
"flt_field": {
"domain_address": {
"like_text": "chroma",
"fuzziness": 0.3
}
}
}
And the mapping is something like this
"user": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
},
"domain": {
"properties": {
...,
"domain_address": {
"type": "string",
"boost": 2.4,
"similarity": "linear"
}
}
}
Other fields in .... but their mapping should not influence the outcome ???

Resources