Searching and sorting on dynamic fields in ElasticSearch - sorting

Have created an object of colors, which contains the hex code of colors along with it's percentage,
/colors/mapping returns
{
"colors": {
"mappings": {
"colors_array": {
"dynamic": "strict",
"properties": {
"colors": {
"dynamic": "true",
"properties": {
"#2f4f4f": {
"type": "long"
},
"#808080": {
"type": "long"
},
"#bc8f8f": {
"type": "long"
},
"#dc143c": {
"type": "long"
},
"#e6e6fa": {
"type": "long"
}
}
}
}
}
}
}
}
Note: Color hex codes are created as dynamic mapping objects
How exactly should I formulate the query such that I pass specific colors and the results are sorted on the basis of aggregated matching percentage i.e. sum the percentage of the colors that match and sort the results on the basis of this figure?

Related

How to detect whether elasticsearch has enabled dynamic field

I don't know whether my index has enabled/disabled dynamic field. When I use get index mapping command it just responses these informations:
GET /my_index1/_mapping
{
"my_index1": {
"mappings": {
"properties": {
"goodsName": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"auditTime": {
"type": "long"
},
"createUserId": {
"type": "long"
}
}
}
}
}
If you don't explicitly set the dynamic to false or strict, it will be true by default. If you explicitly set that, you will see that in your mappings:
{
"mappings": {
"dynamic": false,
"properties": {
"name": {
"type": "text"
}
}
}
}
And when you index the following document:
{"name":"products", "clickCount":1, "bookingCount":2, "isPromoted":1}
Only the field name will be indexed, the rest won't. If you call the _mapping endpoint again, it will give you the exact mappings above.

How to avoid index explosion in ElasticSearch

I have two docs from the same index that originally look like this (only _source value is shown here)
{
"id" : "3",
"name": "Foo",
"property":{
"schemaId":"guid_of_the_RGB_schema_defined_extenally",
"value":{
"R":255,
"G":100,
"B":20
}
}
}
{
"id" : "2",
"name": "Bar",
"property":{
"schemaId":"guid_of_the_HSL_schema_defined_extenally",
"value":{
"H":255,
"S":100,
"L":20
}
}
}
The schema(used for validation of value) is stored outside of ES since it has nothing to do with the indexing.
If I don't define mapping, the value field will be consider Object mapping. And its subfield will grow once there is a new subfield.
Currently, ElasticSearch supports Flattened mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html to prevent this explosion in the index. However it has a limited support for searching for inner field due to its restriction: As with queries, there is no special support for numerics — all values in the JSON object are treated as keywords. When sorting, this implies that values are compared lexicographically.
I need to be able to query the index to find the document match a given doc (e.g. B in the range [10,30])
So far I come up with a solution that structure my doc like this
{
"id":4,
"name":"Boo",
"property":
{
"guid_of_the_normalized_RGB_schema_defined_extenally":
{
"R":0.1,
"G":0.2,
"B":0.5
}
}
Although it does not solve my issue of the explosion in mapping, it mitigates some other issue.
My mapping now will look similar like this for the field property
"property": {
"properties": {
"guid_of_the_RGB_schema_defined_extenally": {
"properties": {
"B": {
"type": "long"
},
"G": {
"type": "long"
},
"R": {
"type": "long"
}
}
},
"guid_of_the_normalized_RGB_schema_defined_extenally": {
"properties": {
"B": {
"type": "float"
},
"G": {
"type": "float"
},
"R": {
"type": "float"
}
},
"guid_of_the_HSL_schema_defined_extenally": {
"properties": {
"B": {
"type": "float"
},
"G": {
"type": "float"
},
"R": {
"type": "float"
}
}
}
}
}
This solve the issue with the case where the field have the same name but different data type.
Can someone suggest me a solution that could solve the explosion of indices with out suffering from the limit that the Flattened has in searching?
To avoid mapping explosion, the best solution is to normalize your data better.
You can set "dynamic": "strict", in your mapping, then a doc will be rejected if it contains a field which is not already in the mapping.
After that, you can still add new fields but you will have to add them explicitly in the mapping before.
You can add a pipeline to clean up and normalize your data before ingestion.
If you don't want, or cannot reindex:
To make your query easy even if you can not know the "middle" part of your key, you can use a multimatch with a star.
GET myindex/_search
{
"query": {
"multi_match": {
"query": 0.5,
"fields": ["property.*.B"]
}
}
}
But you will still not be able to sort it as you want.
For ordering on multiple 'unknown' field names without touching the data, you can use a script: https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html
But maybe you could simplify the whole process by adding a dynamic template to your index.
PUT test/_mapping
{
"dynamic_templates": [
{
"unified_red": {
"path_match": "property.*.R",
"mapping": {
"type": "float",
"copy_to": "unified_color.R"
}
}
},
{
"unified_green": {
"path_match": "property.*.G",
"mapping": {
"type": "float",
"copy_to": "unified_color.G"
}
}
},
{
"unified_blue": {
"path_match": "property.*.B",
"mapping": {
"type": "float",
"copy_to": "unified_color.B"
}
}
}
],
"properties": {
"unified_color": {
"properties": {
"R": {
"type": "float"
},
"G": {
"type": "float"
},
"B": {
"type": "float"
}
}
}
}
}
Then you'll be able to query any value with the same query :
GET test/_search
{
"query": {
"range": {
"unified_color.B": {
"gte": 0.1,
"lte": 0.6
}
}
}
}
For already existing fields, you'll have to add the copy_to by yourself on the mapping, and after that run an _update_by_query to populate them.

Return only top level fields in elasticsearch query?

I have a document that has nested fields. Example:
"mappings": {
"blogpost": {
"properties": {
"title": { "type": "text" },
"body": { "type": "text" },
"comments": {
"type": "nested",
"properties": {
"name": { "type": "text" },
"comment": { "type": "text" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
}
}
}
}
}
Can the query be modified so that the response only contains non-nested fields?
In this example, the response would only contain body and title.
Using _source you can exclude/include fields
GET /blogpost/_search
{
"_source":{
"excludes":["comments"]
}
}
But you have to explicitly put the field names inside exclude, I'm searching for a way to exclude all nested fields without knowing their field name
You can achieve that but in a static way, which means you entered the field(s) name using excludes keyword, like:
GET your_index/_search
{
"_source": {
"excludes": "comments"
},
"query": {
"match_all" : {}
}
}
excludes can take an array of strings; not just one string.

Elasticsearch: Schema without mapping?

According to Elasticsearch's roadmap, mapping types are going to be completely removed at 7.x
How are we going to give a schema structure to Documents without mapping?
For example how would we replace this (A Doc/mapping_type with 3 fields of specific data type):
PUT twitter
{
"mappings": {
"user": {
"properties": {
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" }
}
}
}
They are going to remove types (user in you example) from mapping, because there is only 1 type per index now, the rest will be the same:
PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" }
}
}
}
}
As you can see, there is no user type anymore.

Elasticsearch queries slow performance

We have setup elasticsearch cluster with 7 nodes. Each node having configuration like 16G RAM, 8 Core cpu, centos 6.
Elasticsearch Version : 1.3.0
Heap Memory is - 9000m
1 Master (Non data)
1 Capable master (Non data)
5 Data node
Having 10 indices, In which one index having 55 million documents [ 254Gi (508Gi with replica) ] size rest all indices having approx 20k documents.
Every 1 seconds there are 5-10 new documents are indexing.
But problem is search is bit slow. Almost taking average of 2000 ms to 5000 ms. Some queries are in 1 secs.
Mapping:
{
"my_index": {
"mappings": {
"product": {
"_id": {
"path": "product_refer_id"
},
"properties": {
"product_refer_id": {
"type": "string"
},
"body": {
"type": "string"
},
"cat": {
"type": "string"
},
"cat_score": {
"type": "float"
},
"compliant": {
"type": "string"
},
"created": {
"type": "integer"
},
"facets": {
"properties": {
"ItemsPerCategoryCount": {
"properties": {
"terms": {
"properties": {
"field": {
"type": "string"
},
"size": {
"type": "long"
}
}
}
}
}
}
},
"fields": {
"type": "string"
},
"from": {
"type": "string"
}
"id": {
"type": "string"
},
"image": {
"type": "string"
},
"lang": {
"type": "string"
},
"main_cat": {
"properties": {
"Technology": {
"type": "double"
}
}
},
"md5_product": {
"type": "string"
},
"post_created": {
"type": "long"
},
"query": {
"properties": {
"bool": {
"properties": {
"must": {
"properties": {
"query_string": {
"properties": {
"default_field": {
"type": "string"
},
"query": {
"type": "string"
}
}
},
"range": {
"properties": {
"main_cat.Technology": {
"properties": {
"gte": {
"type": "string"
}
}
},
"sub_cat.Technology.computers": {
"properties": {
"gte": {
"type": "string"
}
}
}
}
},
"term": {
"properties": {
"product.secondary_cat": {
"type": "string"
}
}
}
}
}
}
},
"match_all": {
"type": "object"
}
}
},
"secondary_cat": {
"type": "string"
},
"secondary_cat_score": {
"type": "float"
},
"size": {
"type": "long"
},
"sort": {
"properties": {
"_uid": {
"type": "string"
}
}
},
"sub_cat": {
"properties": {
"Technology": {
"properties": {
"audio": {
"type": "double"
},
"computers": {
"type": "double"
},
"gadgets": {
"type": "double"
},
"geekchic": {
"type": "double"
}
}
}
}
},
"title": {
"type": "string"
},
"product": {
"type": "string"
}
}
}
}
}
}
We are using Default Analyzer.
Any Suggestion? Does this configuration is not enough?
Looks like the indices can not fit into memory, so there will be some more disk I/O going on. Do you use SSDs? If not you should get some.
Besides this your nodes need more resources (memory, CPU) to handle that index size.
I am a little surprised about the sizes here: ~250 GB for "just" 55 million documents is huge and I don't see you are storing any bigger blobs there (I might be mistaken, its hard to see just from the mapping definition). Maybe you can consider to keep some data not analyzed in case you don't need to query it, but just retrieve it. That would reduce the index size.
Except this I have no other ideas, without knowing all the relevant infrastructure in more detail.
To add to Torsten Engelbrecht's answer, default analyzer might be part of the culprit. This analyzer will index every form of each word as a separate token, meaning that a single verb in a language with complex conjugation can be indexed a dozen times. Also, that degrades the quality of the search results. The same applies if your documents contain formatting information (HTML markup ?).
More, stop words are disabled by default, meaning that each "the", "a"... in english for instance will be indexed as well.
You should consider using localized analyzers (snowball analyzer maybe ?) and stop words for the language used in your documents in order to limit the inverted index size and this way, increase performance.
Also, consider making not_analyzed fields as md5, urls, ids, and other sorts of unsearchable fields.

Resources