Best way to index arbitrary attribute value pairs on elastic search - elasticsearch

I am trying to index documents on elastic search, which have attribute value pairs. Example documents:
{
id: 1,
name: "metamorphosis",
author: "franz kafka"
}
{
id: 2,
name: "techcorp laptop model x",
type: "computer",
memorygb: 4
}
{
id: 3,
name: "ss2014 formal shoe x",
color: "black",
size: 42,
price: 124.99
}
Then, I need queries like:
1. "author" EQUALS "franz kafka"
2. "type" EQUALS "computer" AND "memorygb" GREATER THAN 4
3. "color" EQUALS "black" OR ("size" EQUALS 42 AND price LESS THAN 200.00)
What is the best way to store these documents for efficiently querying them? Should I store them exactly as shown in the examples? Or should I store them like:
{
fields: [
{ "type": "computer" },
{ "memorygb": 4 }
]
}
or like:
{
fields: [
{ "key": "type", "value": "computer" },
{ "key": "memorygb", "value": 4 }
]
}
And how should I map my indices for being able to perform both my equality and range queries?

If someone is still looking for an answer, I wrote a post about how to index arbitrary data into Elasticsearch and then to search by specific fields and values. All this, without blowing up your index mapping.
The post: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to create special index described in the post. Then you will need to flatten your data using the flattenData function https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4. Then, the flattened data can be safely indexed into Elasticsearch index.
For example:
flattenData({
id: 1,
name: "metamorphosis",
author: "franz kafka"
});
Will produce:
[
{
"key": "id",
"type": "long",
"key_type": "id.long",
"value_long": 1
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "metamorphosis"
},
{
"key": "author",
"type": "string",
"key_type": "author.string",
"value_string": "franz kafka"
}
]
And
flattenData({
id: 2,
name: "techcorp laptop model x",
type: "computer",
memorygb: 4
});
Will produce:
[
{
"key": "id",
"type": "long",
"key_type": "id.long",
"value_long": 2
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "techcorp laptop model x"
},
{
"key": "type",
"type": "string",
"key_type": "type.string",
"value_string": "computer"
},
{
"key": "memorygb",
"type": "long",
"key_type": "memorygb.long",
"value_long": 4
}
]
Then you can use build Elasticsearch queries to query your data. Every query should specify both the key and type of value. If you are unsure of what keys or types the index has, you can run an aggregation to find out, this is also discussed in the post.
For example, to find a document where author == "franz kafka" you need to execute the following query:
{
"query": {
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "author"}},
{"match": {"flatData.value_string": "franz kafka"}}
]
}
}
}
}
}
To find documents where type == "computer" and memorygb > 4 you need to execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "type"}},
{"match": {"flatData.value_string": "computer"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "memorygb"}},
{"range": {"flatData.value_long": {"gt": 4}}}
]
}
}
}
}
]
}
}
}
Here, because we want same document match both conditions, we are using outer bool query with a must clause wrapping two nested queries.

Elastic Search is a schema-less data store which allows dynamic indexing of new attributes and there is no performance impact in having optional fields. You first mapping is absolutely fine and you can have boolean queries around your dynamic attributes.
There is no inherent performance benefit by making them nested fields, they will anyways be flattened on indexing like fields.type , fields.memorygb etc.
On the contrary your last mapping where you try to store as key-value pairs, will have a performance impact, since you will have to query on 2 different indexed fields i.e where key='memorygb' and value =4
Have a look at the documentation about dynamic mapping:
One of the most important features of Elasticsearch is its ability to be schema-less. There is no performance overhead if an object is
dynamic, the ability
to turn it off is provided as a safety mechanism so "malformed"
objects won’t, by mistake, index data that we do not wish to be
indexed.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html

you need filtered query look from here :
you have to use together range query with match query

Related

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

Sort similar data by property

I have the following data:
[
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "5"
},
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "45"
}
]
I use the following query to get data:
{
"query": {
"bool": {
"must": {
"match": {
"DocumentId": "85"
}
},
"should": [
{
"match": {
"figureText": {
"boost": 5,
"query": "General Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"match": {
"descriptionShort": {
"boost": 4,
"query": "Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"term": {
"partNumber": {
"boost": 1,
"value": "1012626-001FG05"
}
}
}
]
}
}
}
Currently, it will returns the item with "itemNumeric" = 45 and I would like to get itemNumeric = "5" (the lowest).
Is a tips exists to do that ? I tried with "sort":[{"itemNumeric":"desc"}]
Thx
Looking at your comment, you can resolve the issue in two ways.
Solution 1: Updating your mapping, so that your query would work as expected now
PUT my_index/_mapping/_doc
{
"properties": {
"itemNumeric": {
"type": "text",
"fielddata": true
}
}
}
Solution 2: Check the mapping of your itemNumeric field in case if your mapping has been created dynamically, you field itemNumeric would be multi-field.
"itemNumeric": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
In this case you can have your sorting logic applied on itemNumeric.keyword field.
"sort":[{"itemNumeric.keyword":"desc"}]
In elasticsearch, whenever you have text data, it is always recommended to have two fields created for it. One of type text so that you can apply full text queries and other of type keyword so that you can use if to implement sorting or any aggregation operations.
Solution 1 is not recommended as ES official documentation mentions below reason
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting
the inverted index. Note that this can however use significant memory.
I'd suggest to read about multi-field and fielddata so that you will have more clarity on what's happening.

Finding which nested entry matched an elasticsearch query

Say I'm indexing elasticsearch data like so:
{"entities": {
"type": "firstName",
"value": "Barack",
},
{
"type": "lastName",
"value": "Obama"
}}
I'd like users to be able to add custom attributes, so I don't know every possible value of "type" ahead of time.
My mappings might look like:
typename:
entities:
type: nested
If I do a match query for the text "Obama", with highlighting, is there a way to get back the full nested "entity" which matched? I would like to know if my query for "Obama" matched the firstName or the lastName.
I was able to solve this with inner_hits (thanks Andrei!)
{
"query": {
"nested": {
"query": {
{"match": {"entities.name": "Obama"}}
}
},
"inner_hits": {
"highlight": {
"fields": {
"entities.name": {}
}
}
}
}
}

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

ElasticSearch: Labelling documents with matching search term

I'm using elasticsearch 1.7 and am in need of a way to label documents with what part of a query_string query they match.
I've been experimenting with highlighting, but found that it gets a bit messy with some cases. I'd love to have the document tagged with matching search terms.
Here is the query that I'm using: ( note this is a ruby hash that later gets encoded to JSON )
{
query: {
query_string: {
fields: ["title^10", "keywords^4", "content"],
query: query_string,
use_dis_max: false
}
},
size: 20,
from: 0,
sort: [
{ pub_date: { order: :desc }},
{ _score: { order: :desc }}
]
}
The query_string variable is based off user followed topics and might look something like this: "(the AND walking AND dead) OR (iphone) OR (video AND games)"
Is there any option I can use so that documents returned would have a property matching a search term like the walking dead or (the AND walking AND dead)
If you're ready to switch to using bool/should queries, you can split the match on each field and use named queries, then in the results you'll get the name of the query that matched.
It goes basically like this: in a bool/should query, you add one query_string query per field and name the query so as to identify that field (e.g. title_query for the title field, etc)
{
"query": {
"bool": {
"should": [
{
"query_string": {
"fields": [
"title^10"
],
"query": "query_string",
"use_dis_max": false,
"_name": "title_query"
}
},
{
"query_string": {
"fields": [
"keywords^4"
],
"query": "query_string",
"use_dis_max": false,
"_name": "keywords_query"
}
},
{
"query_string": {
"fields": [
"content"
],
"query": "query_string",
"use_dis_max": false,
"_name": "content_query"
}
}
]
}
}
}
In the results, you'll then get below the _source another array called matched_queries which contains the name of the query that matched the returned document.
"_source": {
...
},
"matched_queries": [
"title_query"
],

Resources