Duplicated ElasticSearch documents

Duplicated ElasticSearch documents - elasticsearch

We use spring boot application to insert/update elastic search documents. Our data provider sends ous data via Kafka. Our app process events, tries to find a record and insert record If not exists or update if received record is different than saved. There shouldn't be any duplicated record in elasticsearch.
App inserts/update documents with IMMEDIATE refresh
Problem:
Occasionally we have to remove all data and load them again, becouse there are duplicated records. I found out that these cloned records differs only with insert date. Its usually a few hours difference.
Generally it works as expected, detailed integration tests on org.codelibs.elasticsearch-cluster-runner are green.
Example metadata from elastic search query:
{
"docs" : [
{
"_index" : "reference",
"_type" : "reference",
"_id" : "s0z-BHIBCvxpj4TjysIf",
"_version" : 1,
"_seq_no" : 17315835,
"_primary_term" : 40,
"found" : true,
"_source" : {
...
"insertedDate" : 1589221706262,
...
}
},
{
"_index" : "reference",
"_type" : "reference",
"_id" : "jdVCBHIBXucoJmjM8emL",
"_version" : 1,
"_seq_no" : 17346529,
"_primary_term" : 41,
"found" : true,
"_source" : {
...
"insertedDate" : 1589209395577,
...
}
}
]
}
Tests
I loaded many times data to local instance of ES - no duplications
I created a few long working integrational tests with big number of inserts, updates, queries on local instance of org.codelibs.elasticsearch-cluster-runner with 1 to 5 nodes in memory- no duplications 
Details:
Elastic Search version - 7.5
ES connection with org.elasticsearch.client.RestHighLevelClient

The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.

Related

How to de-normalize the relational data in Elasticsearch

I am working on social networking application and I am using elasticsearch for service data.I have multiple joins in elasticsearch. Users can share the posts and each post has one parent user. I have a scenario than I have shown posts of those users whose you follow.
Type Post
{
"_index" : "xxxxxx",
"_type" : "_doc",
"_id" : "p-370648",
"_score" : null,
"_routing" : "2",
"_source" : {
"uid" : "9a73b0e0-a52c-11ec-aa58-37061b467b8c",
"user_id" : 87,
"id" : 370648,
"type" : {
"parent" : "u-87",
"name" : "post"
},
"item_type_number" : 2,
"source_key" : "youtube-5wcpIrpbvXQ#2"
}
}
Type User
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "u-56432",
"_score" : null,
"_routing" : "1",
"_source" : {
"gender" : "female",
"picture" : "125252125.jpg",
"uid" : "928de1a5-cc93-4fd3-adec-b9fb220abc2b",
"full_name" : "Shannon Owens",
"dob" : "1990-08-18",
"id" : 56432,
"username" : "local_12556",
"type" : {
"name" : "user"
},
},
}
Type Follow
{
"_index" : "trending",
"_type" : "_doc",
"_id" : "fr-561763",
"_score" : null,
"_routing" : "6",
"_source" : {
"user_id" : 25358,
"id" : 561763,
"object_id" : 36768,
"status" : "U",
"type" : {
"parent" : "u-36768",
"name" : "followers"
},
}
}
So in this scenario if user follow someone then we save record in elasticsearch with object_id following user and user_id who follow the user and type "followers", and on the other hand each post has one parent user. So when I try to fetch posts from elasticsearch with type post so then I need to put two level joins to fetch posts.
First one for post parent with user and second for checking following status with user. This query work good when there is no traffic on system. But when traffic comes on system send consurrent requests then the elasticsearch query gets down due to processing even I try to fix this issue with high server with higher performance and CPU/Ram but still facing fall down.
So I decided to denormalize the type post data but the problem is that I am failed to check the following status with post.
Because If I do another query from DB and use some caching then I facing memory exaust issue when thousand of following users data come in query. So is there any way that I can check the following directly in following with type posts instead of adding parent join in query.

How to Curl Kibana Dashboard with GET Response to Find All Dashboard IDs and/or Dashboard Content

So I am trying to automate the scrape of our internal Kibana Dashboards from within our environments for information gathering purposes. I have looked through the following link, but Elasticsearch doesn't seem to really provide good examples of what I am trying to do or accomplish here. Several constraints I have: 1. the commands must be in BASH, 2. I cannot use any compiler such as Python and the Requests and/or Beautifulsoup modules to grab the information and parse it.
Here is my Dilemma:
I log in to the Kibana Dashboard:
Some http://<IP_ADDRESS>:5601/app/kibana#/dashboards?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-1h,mode:quick,to:now))
It will look like this if I am properly tunneled into the environment.
There are three dashboards that I want to collect:
API RESPONSES
Logs
Notifications
The example curl command I am using is as follows to scrape the dashboards:
curl -s http://<IP_ADDRESS>:5601/app/kibana#/dashboard/API\ RESPONSES
curl -s http://<IP_ADDRESS>:5601/app/kibana#/dashboard/logs
curl -s http://<IP_ADDRESS>:5601/app/kibana#/dashboard/notifications
Now the Elasticsearch Documentation mentions something of a Dashboard ID, to which I cannot see it unless I open a webpage and use the inspect tool on a particular element I am sending the GET request to. I am trying to accomplish that by curling the main dashboard page:
curl -s http://<IP_ADDRESS>:5601/app/kibana#/dashboard/_search?pretty
My output will return an HTML output, but it doesn't seem to change and I cannot properly acquire the Dashboards without knowing the Dashboard ID. Furthermore, I am trying to see what dashboards are available and scrape all of them depending on what a person has set up within the environment so it's important that this process is dynamic. My eventual and ultimate goals are to get:
Dashboard IDs Available
Scrape the Dashboards by IDs
Basically I want to curl this output to get the return JSON.
Any thoughts would be greatly appreciated.

So apparently, I was curling the wrong location.
I needed to curl the VIP and port 9200 for the Kibana Index to pull in the available dashboards.
rbarrett#cfg01:~$ curl -s http://<IP_ADDRESS>:9200/.kibana/dashboard/_search?pretty
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : ".kibana",
"_type" : "dashboard",
"_id" : "logs",
"_score" : 1.0,
"_source" : {
"description" : "",
"hits" : 0,
"kibanaSavedObjectMeta" : {
"searchSourceJSON" : "{\"filter\":[{\"query\":{\"query_string\":{\"analyze_wildcard\":true,\"query\":\"*\"}}}]}"
},
"optionsJSON" : "{\"darkTheme\":true}",
"panelsJSON" : "[{\"col\":1,\"columns\":[\"Hostname\",\"Logger\",\"programname\",\"severity_label\",\"Payload\",\"environment_label\"],\"id\":\"search-logs\",\"panelIndex\":5,\"row\":13,\"size_x\":12,\"size_y\":12,\"sort\":[\"Timestamp\",\"desc\"],\"type\":\"search\"},{\"col\":1,\"id\":\"NUMBER-OF-LOG-MESSAGES-PER-SEVERITY\",\"panelIndex\":7,\"row\":9,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"TOP-10-PROGRAMS\",\"panelIndex\":9,\"row\":5,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":1,\"id\":\"LOG-MESSAGES-OVER-TIME-PER-SOURCE\",\"panelIndex\":10,\"row\":1,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"TOP-10-HOSTS\",\"panelIndex\":11,\"row\":9,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":1,\"id\":\"TOP-10-SOURCES\",\"panelIndex\":14,\"row\":5,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"LOG-MESSAGES-OVER-TIME-PER-SEVERITY\",\"panelIndex\":16,\"row\":1,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"}]",
"timeFrom" : "now-1h",
"timeRestore" : true,
"timeTo" : "now",
"title" : "Logs",
"uiStateJSON" : "{\"P-10\":{\"vis\":{\"legendOpen\":true}},\"P-11\":{\"vis\":{\"colors\":{\"Count\":\"#629E51\"},\"legendOpen\":true}},\"P-12\":{\"spy\":{\"mode\":{\"fill\":false,\"name\":null}},\"vis\":{\"colors\":{\"Count\":\"#2F575E\"},\"legendOpen\":false}},\"P-14\":{\"vis\":{\"legendOpen\":true}},\"P-7\":{\"vis\":{\"legendOpen\":false}},\"P-9\":{\"vis\":{\"colors\":{\"Count\":\"#99440A\"},\"legendOpen\":true}}}",
"version" : 1
}
},
{
"_index" : ".kibana",
"_type" : "dashboard",
"_id" : "notifications",
"_score" : 1.0,
"_source" : {
"description" : "",
"hits" : 0,
"kibanaSavedObjectMeta" : {
"searchSourceJSON" : "{\"filter\":[{\"query\":{\"query_string\":{\"analyze_wildcard\":true,\"query\":\"*\"}}}]}"
},
"optionsJSON" : "{\"darkTheme\":true}",
"panelsJSON" : "[{\"col\":1,\"columns\":[\"Logger\",\"publisher\",\"severity_label\",\"event_type\",\"old_state\",\"old_task_state\",\"state\",\"new_task_state\",\"environment_label\",\"display_name\"],\"id\":\"search-notifications\",\"panelIndex\":1,\"row\":14,\"size_x\":12,\"size_y\":11,\"sort\":[\"Timestamp\",\"desc\"],\"type\":\"search\"},{\"col\":1,\"id\":\"NOTIFICATIONS-OVER-TIME-PER-SOURCE\",\"panelIndex\":2,\"row\":1,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"NOTIFICATIONS-OVER-TIME-PER-SEVERITY\",\"panelIndex\":3,\"row\":1,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"EVENT-TYPE-BREAKDOWN\",\"panelIndex\":4,\"row\":5,\"size_x\":6,\"size_y\":5,\"type\":\"visualization\"},{\"col\":1,\"id\":\"SOURCE-BREAKDOWN\",\"panelIndex\":5,\"row\":5,\"size_x\":6,\"size_y\":5,\"type\":\"visualization\"},{\"col\":1,\"id\":\"HOST-BREAKDOWN\",\"panelIndex\":6,\"row\":10,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"},{\"col\":7,\"id\":\"NOTIFICATIONS-PER-SEVERITY\",\"panelIndex\":7,\"row\":10,\"size_x\":6,\"size_y\":4,\"type\":\"visualization\"}]",
"timeFrom" : "now-1h",
"timeRestore" : true,
"timeTo" : "now",
"title" : "Notifications",
"uiStateJSON" : "{\"P-4\":{\"vis\":{\"legendOpen\":true}},\"P-7\":{\"vis\":{\"legendOpen\":false}}}",
"version" : 1
}
}
]
}
}
Afterwhich I was able to pull out the existing IDs with JQ:

What is difference b/w modifying and updating in the Elasticsearch?

I am following Elasticsearch official docs where there is a section on Modifying Document: https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_modifying_your_data.html
So I already have a document under /customer/_doc/1:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "ajay"
}
}
Below is the request to "modify"
PUT /customer/_doc/1
{
"firstname": "ajay",
"lastname": "tanwar"
}
GET would return the updated document
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 2,
"_primary_term" : 1,
"found" : true,
"_source" : {
"firstname" : "ajay",
"lastname" : "tanwar"
}
}
On the next page of docs, Updating Documents https://www.elastic.co/guide/en/elasticsearch/reference/6.2/_updating_documents.html
Below is the request used to "update"
POST /customer/_doc/1/_update
{
"doc":{
"firstname": "ajay",
"lastname": "tanwar"
}
}
This also return the same result as "modify".
Two difference I noticed in both of these:
"modify" request updates the _version on each request. Whereas the
"update" request keeps the _version same
"modify" request's response contain "result" : "updated" whereas
the "update" request's response contain "result" : "noop"
But few doubts I have: first of all, why the "modify" returns "result" : "updated"? Docs itself says it's a modification operation. And why "modify" returns "result" : "noop"? What is noop BTW?
And if we go logically, modifying and updating are the same thing. What is the purpose of these two different APIs?

When you modify document, you delete the old document and insert an entirely new document in its place. This is similar to HTTP's PUT method, in that it simply replaces the old document with whatever is sent in the HTTP body.
When you update a document, you make changes to the old document. Internally, ElasticSearch will also delete the old document and insert a new (updated) document. However, this operation should be treated as if it just made changes to the old document. This is similar to HTTP's PATCH method, in that it will keep the old document and only apply the changes sent in the HTTP body.
"result" : "updated" means changes were made to the ElasticSearch database whereas "result" : "noop" (no operation) means nothing happened (probably because the end result after update would've been the same as before the update).

Using field instead of "_id" for more-like-this query

I have a slug field that I want to use to identify object to use as a reference instead of "_id" field. But instead of using it as a reference, doc seems to use it as query to comapre against. Since slug is a unique field with a simple analyzer, it just returns exactly one result like the following. As far as I know, there is no way to use a custom field as _id field:
https://github.com/elastic/elasticsearch/issues/6730
So is double look up, finding out elasticsearch's id first then doing more_like_this the only way to achieve what I am looking for? Someone seems to have asked a similar question three years ago, but it doesn't have an answer.
ArticleDocument.search().query("bool",
should=Q("more_like_this",
fields= ["slug", "text"],
like={"doc": {"slug": "OEXxySDEPWaUfgTT54QvBg",
}, "_index":"article", "_type":"doc"},
min_doc_freq=1,
min_term_freq=1
)
).to_queryset()
Returns:
<ArticleQuerySet [<Article: OEXxySDEPWaUfgTT54QvBg)>]>

You can make some of your documents field as "default" _id while ingesting data.
Logstash
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "my_name"
document_id => "%{some_field_id}"
}
}
Spark (Scala)
DF.saveToEs("index_name" + "/some_type", Map("es.mapping.id" -> "some_field_id"))
Index API
PUT twitter/_doc/1
{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elasticsearch"
}
{
"_shards" : {
"total" : 2,
"failed" : 0,
"successful" : 2
},
"_index" : "twitter",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"result" : "created"
}

Elasticsearch - How to update document

How does elasticsearch update document? It will delete original document and make new one? I've heard this is how nosql's updating method. does elasticsearch do, same as any other nosql db? or It will replace/insert just field which need to be?

For example, I'm running with Elasticsearh 7.0.0.
First, I created one document,
PUT /employee/_doc/1
{
"first_name" : "John",
"last_name" : "Snow",
"age" : 19,
"about" : "King in the north",
"sex" : "male"
}
Then I updated it via
POST /employee/_update/1/
{
"doc": {
"first_name" : "Aegon",
"last_name" : "Targaryen",
"skill": "fighting and leading"
}
}
Finally, I got correct result when
GET /employee/_doc/1
{
"_index" : "employee",
"_type" : "_doc",
"_id" : "1",
"_version" : 9,
"_seq_no" : 11,
"_primary_term" : 1,
"found" : true,
"_source" : {
"first_name" : "Aegon",
"last_name" : "Targaryen",
"age" : 19,
"about" : "King in the north",
"sex" : "male",
"skill" : "fighting and leading"
}
}

Document in elasticsearch are immutable object. Updating a document is always a reindexing and it consist of the following steps:
Retrieve the JSON (that you want to reindex)
Change it
Delete the old document
Index a new document
Elasticsearch documentation

For the answer you can check the documentation:
In addition to being able to index and replace documents, we can also
update documents. Note though that Elasticsearch does not actually do
in-place updates under the hood. Whenever we do an update,
Elasticsearch deletes the old document and then indexes a new document
with the update applied to it in one shot.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Duplicated ElasticSearch documents - elasticsearch

The reason has been found. One of the nodes had problems to establish a connection and liked to disconnect sometimes.

Related

How to de-normalize the relational data in Elasticsearch

How to Curl Kibana Dashboard with GET Response to Find All Dashboard IDs and/or Dashboard Content

What is difference b/w modifying and updating in the Elasticsearch?

Using field instead of "_id" for more-like-this query

Elasticsearch - How to update document

Categories

Resources