Elasticsearch merge multiple indexes based on common field - elasticsearch

I'm using ELK to generate views out of the data from two different DB. One is mysql other one is PostgreSQL. There is no way of writing join query between those two DB instance. But I have a common field call "nic". Following are the documents from each index.
MySQL
index: user_detail
"_id": "871123365V",
"_source": {
"type": "db-poc-user",
"fname": "Iraj",
"#version": "1",
"field_lname": "Sanjeewa",
"nic": "871456365V",
"#timestamp": "2020-07-22T04:12:00.376Z",
"id": 2,
"lname": "Santhosh"
}
PostgreSQL
Index: track_details
"_id": "871456365V",
"_source": {
"#version": "1",
"nic": "871456365V",
"#timestamp": "2020-07-22T04:12:00.213Z",
"track": "ELK",
"type": "db-poc-ceg"
},
I want to merge both index in to single index using common field "nic". And create new index. So I can create visualization on Kibana. How can this be achieved?
Please note that each document in new index should have
"nic,fname,lname,track" as fields. Not the aggregation.

I would leverage the enrich processor to achieve this.
First, you need to create an enrich policy (use the smallest index, let's say it's user_detail):
PUT /_enrich/policy/user-policy
{
"match": {
"indices": "user_detail",
"match_field": "nic",
"enrich_fields": ["fname", "lname"]
}
}
Then you can execute that policy in order to create an enrichment index
POST /_enrich/policy/user-policy/_execute
The next step requires you to create an ingest pipeline that uses the above enrich policy/index:
PUT /_ingest/pipeline/user_lookup
{
"description" : "Enriching user details with tracks",
"processors" : [
{
"enrich" : {
"policy_name": "user-policy",
"field" : "nic",
"target_field": "tmp",
"max_matches": "1"
}
},
{
"script": {
"if": "ctx.tmp != null",
"source": "ctx.putAll(ctx.tmp); ctx.remove('tmp');"
}
},
{
"remove": {
"field": ["#version", "#timestamp", "type"]
}
}
]
}
Finally, you're now ready to create your target index with the joined data. Simply leverage the _reindex API combined with the ingest pipeline we've just created:
POST _reindex
{
"source": {
"index": "track_details"
},
"dest": {
"index": "user_tracks",
"pipeline": "user_lookup"
}
}
After running this, the user_tracks index will contain exactly what you need, for instance:
{
"_index" : "user_tracks",
"_type" : "_doc",
"_id" : "0uA8dXMBU9tMsBeoajlw",
"_score" : 1.0,
"_source" : {
"fname" : "Iraj",
"nic" : "871456365V",
"lname" : "Santhosh",
"track" : "ELK"
}
}
If your source indexes ever change (new users, changed names, etc), you'll need to re-run the above steps, but before doing it, you need to delete the ingest pipeline and the ingest policy (in that order):
DELETE /_ingest/pipeline/user_lookup
DELETE /_enrich/policy/user-policy
After that you can freely re-run the above steps.
PS: Just note that I cheated a bit since the record in user_detail doesn't have the same nic in your example, but I guess it was a copy/paste issue.

Related

Elasticsearch query nested object

I have this record in elastic:
{
"FirstName": "Winona",
"LastName": "Ryder",
"Notes": "<p>she is an actress</p>",
"Age": "40-50",
"Race": "Caucasian",
"Gender": "Female",
"HeightApproximation": "No",
"Armed": false,
"AgeCategory": "Adult",
"ContactInfo": [
{
"ContactPoint": "stranger#gmail.com",
"ContactType": "Email",
"Details": "Details of tv show",
}
]
}
I want to query inside the contact info object and I used the query below but I dont get any result back:
{
"query": {
"nested" : {
"path" : "ContactInfo",
"query" : {
"match" : {"ContactInfo.Details" : "Details of tv show"}
}
}
}
}
I also tried:
{
"query": {
"term" : { "ContactInfo.ContactType" : "email" }
}
}
here is the mapping for contact info:
"ContactInfo":{
"type": "object"
}
I think I know the issue which is the field is not set as nested in mapping, is there a way to still query nested without changing the mapping, I just want to avoid re-indexing data if its possible.
I'm pretty new to elastic search so need your help.
Thanks in advance.
Elasticsearch has no concept of inner objects.
Some important points from Elasticsearch official documentation on Nested field type
The nested type is a specialized version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.
If you need to index arrays of objects and to maintain the independence of each object in the array, use the nested datatype instead of the object data type.
Internally, nested objects index each object in the array as a separate hidden document, such that that each nested object can be queried independently of the others with the nested query.
Refer to this SO answer, to get more details on this
Adding a working example with index mapping, search query, and search result
You have to reindex your data, after applying nested data type
Index Mapping:
{
"mappings": {
"properties": {
"ContactInfo": {
"type": "nested"
}
}
}
}
Search Query:
{
"query": {
"nested" : {
"path" : "ContactInfo",
"query" : {
"match" : {"ContactInfo.Details" : "Details of tv show"}
}
}
}
}
Search Result:
"hits": [
{
"_index": "stof_64269180",
"_type": "_doc",
"_id": "1",
"_score": 1.1507283,
"_source": {
"FirstName": "Winona",
"LastName": "Ryder",
"Notes": "<p>she is an actress</p>",
"Age": "40-50",
"Race": "Caucasian",
"Gender": "Female",
"HeightApproximation": "No",
"Armed": false,
"AgeCategory": "Adult",
"ContactInfo": [
{
"ContactPoint": "stranger#gmail.com",
"ContactType": "Email",
"Details": "Details of tv show"
}
]
}
}
]

Is there a way to enable _source on existing data?

I create an Index without _source field (considerations of memory).
I want to enable this field on the existing data , there is a way to do that?
For example:
I will create dummy-index :
PUT /dummy-index?pretty
{
"mappings": {
"_doc": {
"_source": {
"enabled": false
}
}
}
}
and I will add the next document :
PUT /dummy-index/_doc/1?pretty
{
"name": "CoderIl"
}
I will get only the hit metadata when I search without the name field
{
"_index" : "dummy-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0
}
the question if I could change the _soruce to enable and when I search again I'll get the missing data (in this example "name" field) -
{
"_index" : "dummy-index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0
"_source" : {
"name" :CoderIl"
}
}
As clarified in the chat, the issue is
_source field is disabled.
In search result he wants what was stored in the fields which is returned as part if _source if enabled like below
_source" : {
"name" :CoderIl"
}
Now in order to achieve it, store option must be enabled on the field, please note this can't be changed dynamically and you have to re-index data again with updated mapping.
Example
Index mapping
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"name": {
"type": "text"
},
"title" :{
"type" : "text",
"store" : true
}
}
}
}
Index sample docs
{
"name" : "coderIL"
}
{
"name" : "coderIL",
"title" : "seconds docs"
}
**Search doc with fields content using store fields
{
"stored_fields": [
"title"
],
"query": {
"match": {
"name": "coderIL"
}
}
}
And search result
"hits": [
{
"_index": "without_source",
"_type": "_doc",
"_id": "1",
"_score": 0.18232156
},
{
"_index": "without_source",
"_type": "_doc",
"_id": "2",
"_score": 0.18232156,
"fields": {
"title": [
"seconds docs"
]
}
}
]
store option on field controls that, from the same official doc
By default, field values are indexed to make them searchable, but they
are not stored. This means that the field can be queried, but the
original field value cannot be retrieved.
Usually this doesn’t matter. The field value is already part of the
_source field, which is stored by default. If you only want to retrieve the value of a single field or of a few fields, instead of
the whole _source, then this can be achieved with source filtering.
As mentioned on the doc, by default its disabled, and if you want to save space, you can enable it on specific fields and need to re-index the data again
Edit: Index option controls(enabled by default) whether field is indexed or not(this is required for searching on the field) and store option controls whether it's stored or not, this is used if you want to get the non-analyzed value ie what you sent to ES in your index request, which based on field type goes through text analysis and part of index option, refer this SO question for more info.

Replace document in Elasticsearch index with field condition

If I have indexed a document in Elasticsearch that contains a datetime parameter, or some kind of sequence number, can I update/replace the entire document with a new version if, and only if, the value in my new document is greater than that in the currently indexed document?
Searching has shown me so far how I can affect the values of specific fields through scripting, but I'm not sure if I can use a script or operation as an update criterion, and replace the whole document if it's met.
To be more specific, we have a document object that contains a timestamp of when it was placed on the queue for processing, and since we may have multiple processors pulling things off the queue we would like to ensure that we only index documents newer than the one we already have in the index, discarding any old changes.
Try to use the _update_by_query Api.
Update By Query
Example:
Mappings
PUT my_index
{
"mappings": {
"properties": {
"user": {
"type": "keyword"
},
"timestamp": {
"type": "keyword"
}
}
}
}
Indexing documents
POST my_index/_doc/1
{
"user":"user1",
"timestamp":1234
}
POST my_index/_doc/2
{
"user":"user2",
"timestamp":1235
}
Update By Query
Let's update only documents with timestamp greater than 1234.
POST /my_index/_update_by_query
{
"script": {
"source": "ctx._source.user='new user';", ----> updating field user
"lang": "painless"
},
"query": {
"range": {
"timestamp": {
"gt": 1234
}
}
}
}
You can update other fields or insert new ones, just play with "source": "ctx._source.user='new user';ctx._source.timestamp=456";ctx._source.new_field=value"
Results
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"user": "new user",
"timestamp": 1235
}
}
Hope this helps

How to index same doc in different indices with different routing

I need to be able to index the same document in different indexes with different routing value.
Basically the problem to solve is to be able to calculate complex aggregations about payment information from the perspective of payer and collector. For example, "payments made / received in the last 15 days grouped by status"
I was wondering how we can achieve this using ElasticSearch bulk api.
Is it possible to achieve this without generating redundancy in the ndjson? Something like this for example:
POST _bulk
{ "index" : { "_index" : "test_1", "_id" : "1", "routing": "1234" } }
{ "index" : { "_index" : "test_2", "_id" : "1", "routing": "5678" } }
{ "field1" : "value1" }
I looked for documentation but I didn't find a place that explain this.
By only using the bulk API, you'll need to repeat the document each time.
Another way of doing it is to bulk-index the documents into the first index and then using the Reindex API you can create the second index with a different routing value for each document.
POST _bulk
{ "index" : { "_index" : "test_1", "_id" : "1", "routing": "1234" } }
{ "field1" : "value1", "routing2": "5678" }
And then you can reindex into a second index using the second routing value (that you need to store in the document somehow
POST _reindex
{
"source": {
"index": "test_1"
},
"dest": {
"index": "test_2"
},
"script": {
"source": "ctx._routing = ctx._source.routing2",
"lang": "painless"
}
}
That way, you only index the data once using the bulk API, which will roughly take half the time than when doubling all documents, and then by leveraging the Reindex API all the data will be reindexed internally (i.e. without the added network latency of sending the potentially big payload)

Replacing (Bulk Update) Nested documents in ElasticSearch

I have an ElasticSearch index with vacation rentals (100K+), each including a property with nested documents for availability dates (1000+ per 'parent' document). Periodically (several times daily), I need to replace the entire set of nested documents for each property (to have fresh data for availability per vacation rental property) - however ElasticSearch default behavior is to merge nested documents.
Here is a snippet of the mapping (availability dates in the "bookingInfo"):
{
"vacation-rental-properties": {
"mappings": {
"property": {
"dynamic": "false",
"properties": {
"bookingInfo": {
"type": "nested",
"properties": {
"avail": {
"type": "integer"
},
"datum": {
"type": "date",
"format": "dateOptionalTime"
},
"in": {
"type": "boolean"
},
"min": {
"type": "integer"
},
"out": {
"type": "boolean"
},
"u": {
"type": "integer"
}
}
},
// this part left out
}
}
}
}
Unfortunately, our current underlying business logic does not allow us to replace or update parts of the "bookingInfo" nested documents, we need to replace the entire array of nested documents. With the default behavior, updating the 'parent' doc, merely adds new nested docs to the "bookingInfo" (unless they exist, then they're updated) - leaving the index with a lot of old dates that should no longer be there (if they're in the past, they're not bookable anyway).
How do I go about making the update call to ES?
Currently using a bulk call such as (two lines for each doc):
{ "update" : {"_id" : "abcd1234", "_type" : "property", "_index" : "vacation-rental-properties"} }
{ "doc" : {"bookingInfo" : ["all of the documents here"]} }
I have found this question that seems related, and wonder if the following will work (first enabling scripts via script.inline: on in the config file for version 1.6+):
curl -XPOST localhost:9200/the-index-and-property-here/_update -d '{
"script" : "ctx._source.bookingInfo = updated_bookingInfo",
"params" : {
"updated_bookingInfo" : {"field": "bookingInfo"}
}
}'
How do I translate that to a bulk call for the above?
Using ElasticSearch 1.7, this is the way I solved it. I hope it can be of help to someone, as a future reference.
{ "update": { "_id": "abcd1234", "_retry_on_conflict" : 3} }\n
{ "script" : { "inline": "ctx._source.bookingInfo = param1", "lang" : "js", "params" : {"param1" : ["All of the nested docs here"]}}\n
...and so on for each entry in the bulk update call.

Resources