Elasticsearch indexing homogenous objects under dynamic keys - elasticsearch

The kind of document we want to index and query contains variable keys but are grouped into a common root key as follows:
{
"articles": {
"0000000000000000000000000000000000000001": {
"crawled_at": "2016-05-18T19:26:47Z",
"language": "en",
"tags": [
"a",
"b",
"d"
]
},
"0000000000000000000000000000000000000002": {
"crawled_at": "2016-05-18T19:26:47Z",
"language": "en",
"tags": [
"b",
"c",
"d"
]
}
},
"articles_count": 2
}
We want to able to ask: what documents contains articles with tags "b" and "d", with language "en".
The reason why we don't use list for articles, is that elasticsearch can efficiently and automatically merge documents with partial updates. The challenge however is to index the objects inside under the variable keys. One possible way we tried is to use dynamic_templates as follows:
{
"sources": {
"dynamic": "strict",
"dynamic_templates": [
{
"article_template": {
"mapping": {
"fields": {
"crawled_at": {
"format": "dateOptionalTime",
"type": "date"
},
"language": {
"index": "not_analyzed",
"type": "string"
},
"tags": {
"index": "not_analyzed",
"type": "string"
}
}
},
"path_match": "articles.*"
}
}
],
"properties": {
"articles": {
"dynamic": false,
"type": "object"
},
"articles_count": {
"type": "integer"
}
}
}
}
However this dynamic template fails because when documents are inserted, the following can be found in the logs:
[2016-05-30 17:44:45,424][WARN ][index.codec] [node]
[main] no index mapper found for field:
[articles.0000000000000000000000000000000000000001.language] returning
default postings format
Same for the two other fields as well. When I try to query for the existence of a certain article, or even articles it doesn't return any document (no error but empty hits):
curl -LsS -XGET 'localhost:9200/main/sources/_search' -d '{"query":{"exists":{"field":"articles"}}}'
When I query for the existence of articles_count, it returns everything. Is there a minor error in what we are trying to achieve, for example in the schema: the definition of articles as a property and in the dynamic template? What about the types and dynamic false? The path seems correct. Maybe this is not possible to define templates for objects in variable-keys, but it should be according to the documentation.
Otherwise, what alternatives are possible without changing the document if possible?
Notes: we have other types in the same index main that also have these fields like language, I ignore if it could influence. The version of ES we are using is 1.7.5 (we cannot upgrade to 2.X for now).

Related

Elasticsearch searchable synthetic fields

Provided that in a source document (JSON) exist a couple of fields named, a and b,
that are of type long, I would like to construct a synthetic field (e.g. c)
by concatenating the values of the previous fields with an underscore and
index it as keyword.
That is, I am looking into a feature that could be supported with an imaginary, partial, mapping like this:
...
"a": { "type": "long" },
"b": { "type": "long" },
"c": {
"type": "keyword"
"expression": "${a}_${b}"
},
...
NOTE: The mapping above was made up just for the sake of the example. It is NOT valid!
So what I am looking for, is if there is a feature in elasticsearch, a recipe or hint to support
this requirement. The field need not be registered in _source, just need to be searchable.
There are 2 steps to this -- a dynamic_mapping and an ingest_pipeline.
I'm assuming your field c is non-trivial so you may want to match that field in a dynamic template using a match and assign the keyword mapping to it:
PUT synthetic
{
"mappings": {
"dynamic_templates": [
{
"c_like_field": {
"match_mapping_type": "string",
"match": "c*",
"mapping": {
"type": "keyword"
}
}
}
],
"properties": {
"a": {
"type": "long"
},
"b": {
"type": "long"
}
}
}
}
Then you can set up a pipeline which'll concatenate your a & b:
PUT _ingest/pipeline/combined_ab
{
"description" : "Concatenates fields a & b",
"processors" : [
{
"set" : {
"field": "c",
"value": "{{_source.a}}_{{_source.b}}"
}
}
]
}
After ingesting a new doc (with the activated pipeline!)
POST synthetic/_doc?pipeline=combined_ab
{
"a": 531351351351,
"b": 251531313213
}
we're good to go:
GET synthetic/_search
yields
{
"a":531351351351,
"b":251531313213,
"c":"531351351351_251531313213"
}
Verify w/ GET synthetic/_mapping too.

Elasticsearch: Duplicate properties in a single record

I have to find every document in Elasticsearch that has duplicate properties. My mapping looks something like this:
"type": {
"properties": {
"thisProperty": {
"properties" : {
"id":{
"type": "keyword"
},
"other_id":{
"type": "keyword"
}
}
}
The documents I have to find have a pattern like this:
"thisProperty": [
{
"other_id": "123",
"id": "456"
},
{
"other_id": "123",
"id": "456"
},
{
"other_id": "4545",
"id": "789"
}]
So, I need to find any document by type that has repeat property fields. Also I cannot search by term because I do not what the value of either Id field is. So far the API hasn't soon a clear way to do this via query and the programmatic approach is possible but cumbersome. Is it possible to get this result set in a elasticsearch query? If so, how?
(The version of Elasticsearch is 5.3)

ElasticSearch [term] query doesn't support multiple fields

I have the following data structure in ElasticSearch:
"assets": {
"type": "nested",
"properties": {
"assetId": {
"type": "keyword"
},
"assetSource": {
"type": "keyword"
},
}
Say I want to exclude the result that 'assetSource' has value 'Web'
I used Term(field='assets.assetSource', query='web') in the query.exclude, but since assets is multi-field, it complains [term] query doesn't support multiple fields
How do I work around this problem?
Stupid me, I should have used the Term filter like this:
Term(**{'assets.assetSource':'vault'})

Why elasticsearch dynamic templates create explicit fields in the mapping?

The document that I want to index is as follows
{
"Under Armour": 0.16667,
"Skechers": 0.14774,
"Nike": 0.24404,
"New Balance": 0.11905,
"SONOMA Goods for Life": 0.11236
}
Fields under this node are dynamic, which means when documents are getting added various fields(brands) will come with those documents.
If I create an index without specifying a mapping, ES says "maximum number of fields (1000) have been reached". Though we can increase this value, it is not a good practice.
In order to support the above document, I created a mapping as follows and created an index.
{
"mappings": {
"my_type": {
"dynamic_templates": [
{
"template1":{
"match_mapping_type": "double",
"match": "*",
"mapping": {
"type": "float"
}
}
}
]
}
}
}
When I add above document to the created index and checked the mapping of the index again. It looks like as below.
{
"my_index": {
"mappings": {
"my_type": {
"dynamic_templates": [
{
"template1": {
"match": "*",
"match_mapping_type": "double",
"mapping": {
"type": "float"
}
}
}
],
"properties": {
"New Balance": {
"type": "float"
},
"Nike": {
"type": "float"
},
"SONOMA Goods for Life": {
"type": "float"
},
"Skechers": {
"type": "float"
},
"Under Armour": {
"type": "float"
}
}
}
}
}
}
If you clearly see the mapping that I created earlier and the mapping when I added a document to the index is different. It added fields statically added to the mapping. When I keep adding more documents, new fields will be added to the mapping (which will end up with maximum number of fields(1000) has been reached).
My question is,
The mapping that I mentioned above is correct for the above mentioned document.
If it is correct, why new fields are added to the mapping?
According to the posts that I read, increasing the number of fields in an index is not a good practice it may increase the resource usage.
In this case, when there are enormous number of brands are there and new brands to be introduced.
The proper solution for such a case is, introduce key-value pairs. (Probably I need to do a transformation during ETL)
{
"brands": [
{
"key": "Under Armour",
"value": 0.16667
},
{
"key": "Skechers",
"value": 0.14774
},
{
"key": "Nike",
"value": 0.24404
}
]
}
When the data is formatted as above, the map won't be change.
A good reading that I found was
https://www.elastic.co/blog/found-beginner-troubleshooting#keyvalue-woes
Thanks #Val for the suggestion

Date_histogram Elasticsearch facet can't find field

I am using the date_histogram facet to find results based on a Epoch timestamp. The results are displayed on a histogram, with the date on the x-axis and count of events on the y-axis. Here is the code that I have that doesn't work:
angular.module('controllers', [])
.controller('FacetsController', function($scope, $http) {
var payload = {
query: {
match: {
run_id: '9'
}
},
facets: {
date: {
date_histogram: {
field: 'event_timestamp',
factor: '1000',
interval: 'second'
}
}
}
}
It works if I am using
field: '#timestamp'
which is in ISO8601 format; however, I need it to now work with Epoch timestamps.
Here is an example of what's in my Elasticsearch, maybe this can lead to some answers:
{"#version":"1",
"#timestamp":"2014-07-04T13:13:35.372Z","type":"automatic",
"installer_version":"0.3.0",
"log_type":"access.log","user_id":"1",
"event_timestamp":"1404479613","run_id":"9"}
},
When I run this, I receive this error:
POST 400 (Bad Request)
Any ideas as to what could be wrong here? I don't understand why I'd have such a difference from using the two different fields, as the only difference is the format. I researched as best I could and discovered I should be using 'factor', but that didn't seem to solve my problem. I am probably making a silly beginner mistake!
You need to set the indexing initially. Elasticsearch is good at defaults but it is not possible for it to determine if the provided value is a timestamp, integer or string. So its your job to tell Elasticsearch about the same.
Let me explain by example. Lets consider the following document is what you are trying to index:
{
"#version": "1",
"#timestamp": "2014-07-04T13:13:35.372Z",
"type": "automatic",
"installer_version": "0.3.0",
"log_type": "access.log",
"user_id": "1",
"event_timestamp": "1404474613",
"run_id": "9"
}
So initially you don't have an index and you index your document by making an HTTP request like so:
POST /test/date_experiments
{
"#version": "1",
"#timestamp": "2014-07-04T13:13:35.372Z",
"type": "automatic",
"installer_version": "0.3.0",
"log_type": "access.log",
"user_id": "1",
"event_timestamp": "1404474613",
"run_id": "9"
}
This creates a new index called test and a new doc type in index test called date_experiments.
You can check the mapping of this doc type date_experiments by doing so:
GET /test/date_experiments/_mapping
And what you get in the result is an auto-generated mapping that was generated by Elasticsearch:
{
"test": {
"date_experiments": {
"properties": {
"#timestamp": {
"type": "date",
"format": "dateOptionalTime"
},
"#version": {
"type": "string"
},
"event_timestamp": {
"type": "string"
},
"installer_version": {
"type": "string"
},
"log_type": {
"type": "string"
},
"run_id": {
"type": "string"
},
"type": {
"type": "string"
},
"user_id": {
"type": "string"
}
}
}
}
}
Notice that the type of event_timestamp field is set to string. Which is why your date_histogram is not working. Also notice that the type of #timestamp field is already date because you pushed the date in the standard format which made easy for Elasticsearch to recognize your intention was to push a date in that field.
Drop this mapping by sending a DELETE request to /test/date_experiments and lets start from the beginning.
This time instead of pushing the document first, we will make the mapping according to our requirements so that our event_timestamp field is considered as a date.
Make the following HTTP request:
PUT /test/date_experiments/_mapping
{
"date_experiments": {
"properties": {
"#timestamp": {
"type": "date"
},
"#version": {
"type": "string"
},
"event_timestamp": {
"type": "date"
},
"installer_version": {
"type": "string"
},
"log_type": {
"type": "string"
},
"run_id": {
"type": "string"
},
"type": {
"type": "string"
},
"user_id": {
"type": "string"
}
}
}
}
Notice that I have changed the type of event_timestamp field to date. I have not specified a format because Elasticsearch is good at understanding a few standard formats like in the case of #timestamp field where you pushed a date. In this case, Elasticsearch will be able to understand that you are trying to push a UNIX timestamp and convert it internally to treat it as a date and allow all date operations on it. You can specify a date format in the mapping just in case the dates you are pushing are not in any standard formats.
Now you can start indexing your documents and starting running your date queries and facets the same way as you were doing earlier.
You should read more about mapping and date format.

Resources