Joining logstash with parent record - elasticsearch

I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud

An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}

Related

Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.
The scenario is the following
I have a index "data":
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "A1", "B"]
}
The requirement says that data can be queried by:
some text search in the context field
that belongs to a specific type or category
So far, so simple, so good.
This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way
For example:
create the data
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A"]
}
Then it was decided that all data with type=T1 must belong to both A & B categories.
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "B"]
}
If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.
Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?
Something like it:
Data:
{
"id": "00001",
"content" : "some text here ..",
"type": "T1"
}
DataCategories:
{
"type": "T1"
"categories" : ["A", "B"]
}
Is it acceptable/possible?
This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.
Possible options:
1 - For the best performance, reindex. It may not be practical depending on the velocity of your change
2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.
3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.
4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.
Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Reindexing more than 10k documents in Elasticsearch

Let's say I have an index- A. It contains 26k documents. Now I want to change a field status with type as Keyword. As I can't change A's status field type which is already existing, I will create a new index: B with my setting my desired type.
I followed reindex API:
POST _reindex
{
"source": {
"index": "A",
"size": 10000
},
"dest": {
"index": "B",
"version_type": "external"
}
}.
But the problem is, here I can migrate only 10k docs. How to copy the rest?
How can I copy all the docs without losing any?
delete the size: 10000 and problem will be solved.
by the way the size field in Reindex API means that what batch size elasticsearch should use to fetch and reindex docs every time. by default the batch size is 100. (you thought it means how many document you want to reindex)

Joining two indexes in Elastic Search like a table join

I am relatively new to this elastic search. So, I have an index called post which contain documents like this:
{
"id": 1,
"link": "https:www.instagram.com/p/XXXXX/",
"profile_id": 11,
"like_count": 100,
"comment_count": 12
}
I have another index called profile which contain documents like this:
{
"id": 11,
"username": "superman",
"name": "Superman",
"followers": 12312
}
So, as you guys can see, I have all profiles data under the index called profile and all posts data under the index called post. The "profile_id" present in the post document is linked with the "id" present in the profile document.
Is there any way, when I am querying the post index and filtering out the post documents the profile data will also appear along with the post document based on the "profile_id" present in the post document? Or somehow fetch the both data doing a multi-index search?
Thank you guys in advance, any help will be appreciated.
For the sake of performance, Elasticsearch encourages you to denormalize your data and model your documents accordingly to the responses you wish to get from your queries. However, in your case, I would suggest defining the relation post-profile by using a Join datatype (link to Elastic documentation) and using the parent-join queries to run your searches (link to Elastic documentation).

How can I show a table with the sum of value x of all childeren within Kibana

I'm have an elasticsearch database with documents stored the following way(, seperates the documents):
{
"path":"path/to/data"
"kind": "type1"
},
{
"path":"path/to/data/values1"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/values2"
"kind": "type2"
"x": 2
},
{
"path":"path/to/data/datasub"
"kind": "type1"
},
{
"path":"path/to/data/datasub/values1"
"kind": "type2"
"x": 1
}
Now I want the create table view/chart show all type2's with all the sum of x of all their childeren.
So I expect the total of path/to/data to be 5 and the total of path/to/data/datasub 1.
To consider: the depth of this structure could theoretically be unlimited
I'm running Elastichsearch 7 and Kibana 7 and I want to use the table visualisation to start with but I would like to be able to use this kind of aggregation throughout multiple visualisations. I have Googles a lot and found all kinds of Elastichsearch queries but nothing on how to achieve this in Kibana.
All help is much appreciated
For those who run into the same question:
The solution I ended up using is to split the path in to tokens prior to importing it into Elasticsearch. So consider a document having a path like "/this/is/a/path". This becomes the following array in the document:
[
"/this",
"/this/is",
"/this/is/a",
"/this/is/a/path"
]
You can then use a terms aggregation on it with various metrics to calculate your desired measurements.

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

Resources