Elastic Ingest Pipeline split field and create a nested field - elasticsearch

Dear freindly helpers,
I have an index that is fed by a database via Kafka. Now this database holds a field that aggregates a couple of pieces of information like so key/value; key/value; (don't ask for the reason, I have no idea who designed it liked that and why ;-) )
93/4; 34/12;
it can be empty, or it can hold 1..n key/value pairs.
I want to use an ingest pipeline and ideally have a "nested" field which holds all values that are in tha field.
Probably like this:
{"categories":
{ "93": 7,
"82": 4
}
}
The use case is the following: we want to visualize the sum of a filtered number of these categories (they tell me how many minutes a specific process took longer) and relate them in ranges.
Example: I filter categories x, y ,z and then group how many documents for the day had no delay, which had a delay up to 5 minutes and which had a delay between 5 and 15 minutes.
I have tried to get the fields neatly separated with the kv processor and wanted to work from there on but it was a complete wrong approach I guess.
"kv": {
"field": "IncomingField",
"field_split": ";",
"value_split": "/",
"target_field": "delays",
"ignore_missing": true,
"trim_key": "\\s",
"trim_value": "\\s",
"ignore_failure": true
}
When I test the pipeline it seems ok
"delays": {
"62": "3",
"86": "2"
}
but there are two things that don't work.
I can't know upfront how many of these combinations I have and thus converting the values from string t int in the same pipeline is an issue.
When I want to create a kibana index pattern I end up with many fields like delay.82 and delay.82.keyword which does not make sense at all for the usecase as I can't filter (get only the sum of delays where the key is one of x,y,z) and aggregate.
I have looked into other processors (dorexpander) but can't really get my head around how to get this working.
I hope my question is clear (I lack english skills, sorry) and that someone can point me at the right direction.
Thank you very much!

You should rather structure them as an array of objects with shared accessors, for instance:
[ {key: 93, value: 7}, ...]
That way, you'll be able to aggregate on categories.key and categories.value.
So this means iterating the categories' entrySet() using a custom script processor like so:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "extracts k/v pairs",
"processors": [
{
"script": {
"source": """
def categories = ctx.categories;
def kv_pairs = new ArrayList();
for (def pair : categories.entrySet()) {
def k = pair.getKey();
def v = pair.getValue();
kv_pairs.add(["key": k, "value": v]);
}
ctx.categories = kv_pairs;
"""
}
}
]
},
"docs": [
{
"_source": {
"categories": {
"82": 4,
"93": 7
}
}
}
]
}
P.S.: Do make sure your categories field is mapped as nested b/c otherwise you'll lose the connections between the keys & the values (also called flattening).

Related

Binning Data With Two Timestamps

I'm posting because I have found no content surrounding this topic.
My goal is essentially to produce a time-binned graph that plots some aggregated value. For Example. Usually this would be a doddle, since there is a single timestamp for each value, making it relatively straight forward to bin.
However, my problem lies in having two timestamps for each value - a start and an end. Similar to a gantt chart, here is an example of my plotted data. I essentially want to bin the values (average) for when the timelines exist within said bin (bin boundaries could be where a new/old task starts/ends). Likeso.
I'm looking for a basic example or an answer to whether this is even supported, in Vega-Lite. My current working example would yield no benefit to this discussion.
I see that you found a Vega solution, but I think in Vega-Lite what you were looking for was something like the following. You put the start field in "x" and the end field in x2, add bin and type to x and all should work.
"encoding": {
"x": {
"field": "start_time",
"bin": { "binned": true },
"type": "temporal",
"title": "Time"
},
"x2": {
"field": "end_time"
}
}
I lost my old account, but I was the person who posted this. Here is my solution to my question. The value I am aggregating here is the sum of times the timelines for each datapoint is contained within each bin.
First you want to use a join aggregate to get the max and min times your data extend to. You could also hardcode this.
{
type: joinaggregate
fields: [
startTime
endTime
]
ops: [
min
max
]
as: [
min
max
]
}
You want to find a step for your bins, you can hard code this later or use a formula and write this into a new field.
You want to create two new fields in your data that is a sequence between the max and min, and the other the same sequence offset by your step.
{
type: formula
expr: sequence(datum.min, datum.max, datum.step)
as: startBin
}
{
type: formula
expr: sequence(datum.min + datum.step, datum.max + datum.step, datum.step)
as: endBin
}
The new fields will be arrays. So if we go ahead and use a flatten transform we will get a row for each data value in each bin.
{
type: flatten
fields: [
startBin
endBin
]
}
You then want to calculate the total time your data spans across each specific bin. In order to do this you will need to round up the start time to the bin start and round down the end time to the bin end. Then taking the difference between the start and end times.
{
type: formula
expr: if(datum.startTime<datum.startBin, datum.startBin, if(datum.startTime>datum.endBin, datum.endBin, datum.startTime))
as: startBinTime
}
{
type: formula
expr: if(datum.endTime<datum.startBin, datum.startBin, if(datum.endTime>datum.endBin, datum.endBin, datum.endTime))
as: endBinTime
}
{
type: formula
expr: datum.endBinTime - datum.startBinTime
as: timeInBin
}
Finally, you just need to aggregate the data by the bins and sum up these times. Then your data is ready to be plotted.
{
type: aggregate
groupby: [
startBin
endBin
]
fields: [
timeInBin
]
ops: [
sum
]
as: [
timeInBin
]
}
Although this solution is long, it is relatively easily to implement in the transform section of your data. From my experience this runs fast and just displays how versatile Vega can be. Freedom to visualisations!

Maps vs Lists in Elasticsearch for optimized query performance

I have some data I will be putting into Elasticsearch, and want to decide on a format that will optimize query performance. The query will be in words: "Is ID X in category Y?". I have a fixed number of categories (small, say, 5), and possibly a large number of IDs to put into each category (currently in the dozens, but of indeterminate size in the future). Each ID will be in at most one category (possibly none).
Format 1:
{
"field1": "value1",
...
"categories": {
"category1": ["id10", "id24", "id38",...],
...
"category5": ["id62", "id19", "id82" ...]
}
}
or
Format 2:
{
"field1": "value1",
...
"categories": {
"id1": "category4",
"id2": "category2",
"id3": "category1",
...
}
}
Which data format would be preferred? The latter format has linear lookup time, but possibly many keys.
I think method 1 is better, Id will be more in the future, if you press method 2, then you may need to close the categories index or increase the number of index fields, and using method 1 can be more convenient to determine the type of a single id (indeOf).There are pros and cons. Maybe there's a better way.

One large Elasticsearch lookup index, or several smaller ones?

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?
How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

Protocol buffers Fieldmask on Collections within resource

If I want to update the "amount" field within a particular element inside "f_units" collection in the below resource (protocol buffer), how will the FieldMask look like to update the amount field? Does the field mask operate on array index for collections?
{
"f_sel": {
"f_units": [
{
"id": "1",
"amount": {
"coefficient": 1000,
"exponent": -2
}
},
{
"id": "2",
"amount": {
"coefficient": 2000,
"exponent": -2
}
}
]
}
}
Will it be "f_sel.f_units.0.amount" ? How can I update the amount using FieldMask?
As far as I know, there is no way to replace individual elements of a repeated field with an index in a FieldMask.
Instead, you'd update the amount field for the element within f_units you wish to change and set the FieldMask to
"f_sel.f_units"
It would be slightly more efficient to only have to send a delta to the original list, but it would be hard to prevent bugs. For example, what if the proto was modified in the meantime and the specified index (presuming there was a way to specify one) for the repeated field was no longer in range?
As an aside, Google does propose the concept of MergeOptions which defines semantics for how repeated fields are to be handled when merging. Currently, it appears they intend for you either to replace the repeated field in its entirety or append to the end of the destination field. Both of these merging strategies avoid the aforementioned bug that could be caused by specifying an invalid index.

Relative Performance of ElasticSearch on inner fields vs outer fields

All other things being equal, including indexing, I'm wondering if it is more performant to search on fields closer to the root of the document.
For example, lets say we have a document with a customer ID. Two ways to store this:
{
"customer_id": "xyz"
}
and
{
"customer": {
"id": "xyz"
}
}
Will it be any slower to search for documents where "customer.id = 'xyq'" than to search for documents where "customer_id = 'xyz'" ?
That's pure syntactic sugar. The second form, i.e. using object type, will be flattened out and internally stored as
"customer.id": "xyz"
Hence, both forms you described are semantically equivalent as far as what gets indexed into ES, i.e.:
"customer_id": "xyz"
"customer.id": "xyz"

Resources