How can I show a table with the sum of value x of all childeren within Kibana - elasticsearch

I'm have an elasticsearch database with documents stored the following way(, seperates the documents):
{
"path":"path/to/data"
"kind": "type1"
},
{
"path":"path/to/data/values1"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/values2"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/datasub"
"kind": "type1"
},
{
"path":"path/to/data/datasub/values1"
"kind": "type2"
"x": 1
}
Now I want the create table view/chart show all type2's with all the sum of x of all their childeren.
So I expect the total of path/to/data to be 5 and the total of path/to/data/datasub 1.
To consider: the depth of this structure could theoretically be unlimited
I'm running Elastichsearch 7 and Kibana 7 and I want to use the table visualisation to start with but I would like to be able to use this kind of aggregation throughout multiple visualisations. I have Googles a lot and found all kinds of Elastichsearch queries but nothing on how to achieve this in Kibana.
All help is much appreciated

For those who run into the same question:
The solution I ended up using is to split the path in to tokens prior to importing it into Elasticsearch. So consider a document having a path like "/this/is/a/path". This becomes the following array in the document:
[
"/this",
"/this/is",
"/this/is/a",
"/this/is/a/path"
]
You can then use a terms aggregation on it with various metrics to calculate your desired measurements.

Related

How do I SUM the average from diferent souce in elasticsearch?

Good morning.
First of all I want to say that I am new in elastic so maybe this question is too easy but I don't know how to do it.
I have 3 source (3 kafka broker) generating the same metric:
$curl http://brokerX:7771/jolokia/read/kafka.server:type=Produce,user=*/byte-rate
{
"request": {
"mbean": "kafka.server:type=Produce,user=*",
"attribute": "byte-rate",
"type": "read"
},
"value": {
"kafka.server:type=Produce,user=USWNBIB01": {
"byte-rate": 55956.404059932334
},
"kafka.server:type=Produce,user=ngbi": {
"byte-rate": 19778.793941126038
},
"kafka.server:type=Produce,user=admin": {
"byte-rate": 2338235.8307990045
}
},
"timestamp": 1654588517,
"status": 200
}
and ingested in elastic by jolokia
Example of a record (I have only put some fields):
Field Value
_id giMtPYEB2QGR_VpVmCz4
_index idx-ls-confluent-metrics-ro-2022.06.07-000424
...
agent.type metricbeat
...
event.module jolokia
host.name broker1
index_template confluent-metrics
...
jolokia.jolokia_metrics.mbean kafka.server:type=Produce,user=sena
jolokia.jolokia_metrics.UserByteRate 885,160.3
logstash.pipeline bi-confluent
metricset.name jmx
...
I need a dashboard (stacked vertical bar), where I have the sum of the average of all the brokers.
When I create the dashboard, if I put in the vertical axis average(jolokia.jolokia_metrics.UserByteRate), I get the average of all nodes (but not the sum of the average), but if I put the sum(jolokia.jolokia_metrics.UserByteRate), I get a higher value than I should:
Example:
and the actual value should be the sum of:
"byte-rate": 2935617.4496644298
"byte-rate": 3328181.9137749737
"byte-rate": 2874583.589457018
Almost 9MB not 23MB.
I think the problem is that I need to sum(average(jolokia.jolokia_metrics.UserByteRate)), but this formula it is not accepted by elastic
The Formula sum(average(jolokia.jolokia_metrics.UserByteRate)) cannot be parsed.
If I put in the formula average(jolokia.jolokia_metrics.UserByteRate), the average of the whole broker appears, but I want the sum of that.
I do not know if I have been able to explain me well

Elasticsearch - query based on event frequency

I have multiple indexes to store user tracking log. In which there is 1 index is index-pageview. How can I query out the list of users who viewed the page 10 times between 2021-12-11 and 2021-12-13 using IOS operating system?
Log example:
index: index-pageview
[
{
"user_id": 1,
"session_id": "xxx",
"timestamp": "2021-12-11 hh:mm:ss",
"platform": "IOS"
},
{
"user_id": 1,
"session_id": "yyy",
"timestamp": "2021-12-13 hh:mm:ss",
"platform": "Android"
}
]
You can try building a normal bool query on timestamp and platform and then either terms aggregation (possibly with min_doc_count: 10) or collapse on user_id. Both ways will have some limitations though:
aggregation might be slower (needs benchmarking)
aggregation bucket number is limited (at 10k by default)
collapse will work on at most size docs at a time (capped at 10k as well) so you might need scrolling and app-side processing
Though performance of these might be pretty poor. If you need to run queries like those very often I would consider using another storage (SQL? Something more fancy?)

Joining logstash with parent record

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

elasticsearch more_like_this query is taking long time to run

I have the below more_like_this query to elasticsearch.
I run this in a loop for 15 times with different art_title and art_tags each time. For some articles the time it takes is very less but for some articles in the loop it takes too long to execute. Is there anything which I can do to optimize this query. Any help is appreciated.
bodyquery={
"query":
{"bool":
{"should":
[
{"more_like_this":
{
"like_text": art_title,
"fields": ["title"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
},
{"more_like_this":
{
"like_text": art_tags,
"fields": ["tags"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
}
]
}
}
}
I believe you might have solved this already by now but depending on the content of your indexed docs and the analyzers applied to the fields you are looking at, this can take a wide range of time to complete. Think how similarity works and how it will be calculated for your documents and you probably will find the answer. Also, you can use the explain param to get a Lucene detailed step-by-step response to the question
, but just in case I want to add: it is virtually impossible to determine anything without more details:
What your mappings look like
How are those fields analyzed
What version of ES are you using
Your ES setup
Also, describe in english what are you trying to retrieve: "I want documents in the catalog index that have a title similar to art_title and/or a tag similar to art_tag".
There is reference to the syntax in HERE if you are using the latest version of ES
Cheers

is _id of document affects on scoring?

I add two same documents the only different thing is _id of documents (I restart scenario for each of them and I do not add them sequentially. to be sure my test is correct)
one of them changes order of result of this query and one of them does not:
GET index_for_test/business/_search
{
"query": {
"multi_match": {
"query": "italian",
"type": "most_fields",
"fields": [ "name^2", "categories" ]
}
}
}
my original question was:
https://github.com/elastic/elasticsearch/issues/10341
as mentioned here: https://groups.google.com/forum/?fromgroups=&hl=en-GB#!topic/elasticsearch/VWqA_P4zzH8
my answer is in this documentation:
https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch
documents are spread in 5 shards by default and queries run with an algorithm that scores documents in each shard and then fetch them, in small data this ends to inaccurate result so if the database is small it is better to run you queries with search_type=dfs_query_then_fetch but it has scalability problems and should be changed when it grows

Resources