ElasticSearch - Unique Tags for multiple documents (indexing) - elasticsearch

We would like a unique Tag and multiple values in elastic search : to be clearer. We need to do a timeserie graph. So we get values between 2 dates. But of course we have different kinds of data. That where our tags comes. We want to search our tags with an autoCompletion, then choose our values with the dates.
{tag :["sdfsf", "fddsfsd", "fsdfsf"]
{
values : 145.45
date : "2004-10-23"
},
{
values : 556.09
date : "2010-02-13"
}
}
After, a bit of research we found the parent/child technique but because we want to do a completion on tag (in the parent), we need an aggregation which is impossible in ES with "has_parent".
Our solutions is to do :
{
{
tag :["sdfsf", "fddsfsd", "fsdfsf"],
values : 145.45,
date : "2004-10-23"
},
{
tag :null,
values : 556.09,
date : "2010-02-13"
}, {etc...}
}
So we only have one tag easy to check with completion. But it's kind of "ugly".
Does anybody have a correct way to do what we want to do ?
thx in advance

Related

comparing data between different mappings

I am relatively new to Elasticsearch so I apologies if the terms are not accurate. I have a few indexes and a few almost identical indexes but with less fields in the mapping.
(the original indexes has data and the new ones with less fields are empty)
how can I compare the data and insert the relevant documents into the new indexes with less fields?
for example original index mapping:
{
“first_name” : ”Dana”,
“last_name” : ”Leon”,
“birth_date” : “1990-01-09“,
“social_media” : {
“facebook_id” : ”K8426dN”,
“google_id” : ”8764873”,
“linkedin_id” : ”Gdna”
}
}
new mapping with less fields
{
“first_name” : ”Dana”,
“last_name” : ”Leon”,
“social_media” : {
“facebook_id” : ”K8426dN”,
“google_id” : ”8764873”,
“linkedin_id” : ”Gdna”
}
}
Thanks
You can use reindex by script:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#docs-reindex-change-name
In the "script" you'll need to specify the fields, that you want to remove like:
ctx._source.remove("birth_date")"
The second option is to use ingest pipeline with "remove" proccessor:
https://www.elastic.co/guide/en/elasticsearch/reference/current/remove-processor.html, and to do reindex with default pipeline definition into settings, but this will be harder to implement

Conditional sum metric (sub-total column) in Kibana data table

I need to display subtotal columns in a Kibana data table. Not filtering the entire table, but only certain columns.
I've seen posts about doing conditional counts in a metric's JSON input field:
{
"script":{
"inline": "doc['SomeField'].value == 'SomeValue' ? 1 : 0",
"lang": "painless"
}
}
But no reference to conditional sums of numeric data. My loosely expressed need:
sum(btyes) where category = [write]
Alternatively, the Kibana Enhanced Table plugin was suggested as a way to implement computed columns.
Is it possible to achieve conditional sums using JSON input on a specific data table metric? Is anyone using the plugin? Should it be done upstream in an elasticsearch index? What is best practice?
Solution is a simple change to show the actual value in the true condition, rather than a 1 for counting :
{
"script" : "doc['category.keyword'].value == 'write' ? doc['bytes'].value : 0"
}

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

Finding duplicate documents

I have some documents whose ids are randomly generated. The issue here is I need to find the duplicates amongst these documents. I have three fields which should not be identical for two documents. So how to check for duplicates based on multiple fields?
Sample documents
document 1 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
document 2 = {
"process" : "operations",
"processId" : 334233,
"country" : "UK"
}
document 3 = {
"process" : "business",
"processId" : 5433321,
"country" : "US"
}
Here as you can see, document 1 and document 3 are the same, but they are having different Ids in my database,so exist as separate documents. So on run I need to find the above as duplicates and if possible keep only one.
The best option here would be to model your document around doc ID. Now for each unique document , create a docID which is a hash of the content of the document. This makes sure that only one unique document exists across the index. Next use _create API to create documents. This will fail all requests on over write document with same document ID.
You can further read about other duplication issues and its solutions here.

Resources