Elasticsearch document count doesn't reflect higher indexing rate - elasticsearch

When we monitor our elasticsearch cluster health through kibana, For a particular index we see very higher indexing rate. But it seems document count doesn't increase proportionally. How to tackle these two.
sample document
{
"_index": "finance_report_fgl_reporting_log",
"_type": "fgl_reporting_logs",
"_id": "1907688_POINTS_ACCOUNT_DEBIT",
"_score": 9.445704,
"_source": {
"reportingLogId": {
"journalId": 1907688,
"postingAccountId": "POINTS_ACCOUNT",
"postingAccountingEntry": "DEBIT"
},
"journalId": 1907688,
"journalEventId": "trip_completed",
"journalEventLogId": "15db1f2b-b9d0-4edd-96f0-c4e4f8e68150",
"journalAccountingRuleId": "trip_completed_points_payment_rule",
"journalReferenceId": "174558200",
"journalGrossAmount": 154.11,
"postingJournalId": 1907688,
"postingAccountingRuleId": "trip_completed_points_payment_rule",
"postingReferenceId": "174558200",
"postingAccountId": "POINTS_ACCOUNT",
"postingAccountingPeriod": "2019_08",
"postingAccountingEntry": "DEBIT",
"postingCurrencyTypeId": "POINTS",
"postingAmount": 154.11,
"accountId": "POINTS_ACCOUNT",
"accountStakeholderId": "OPERATOR",
"accountCurrencyTypeId": "POINTS",
"accountTypeId": "CONTROLLER",
"accountingRuleId": "trip_completed_points_payment_rule",
"accountingRuleDescription": "Points payment",
"eventId": "trip_completed",
"eventReferenceParam": "body.trip.id",
"createdDate": "2019-08-29T10:03:32.000+0530",
"modifiedDate": "2019-08-29T10:03:32.000+0530",
"createdBy": "ENGINE",
"modifiedBy": "ENGINE",
"version": "3.12.6",
"createYear": 2019,
"routingKey": "_2019"
}
},

The reason why this usually happens is because your indexing operations do not create new documents but update existing ones. Mainly because you're sending updates to an ID that already exists.
Every few hours, a new batch of documents is created (according to the jumps in the graphs) because you're creating a new set of IDs.
Make sure to verify how you're creating your IDs as the solution is hidden in there somewhere.

You might get some info when performing GET _cat/indices?v check out the "docs.deleted" column, as an update operation is merely a "create new+delete older" operation.

Related

Does Elasticsearch execute operations in a specific order?

I read that ES is near real-time, and therefore all index/create/update/delete etc. operations are not executed immediately.
Let's say I index 3 documents with same id, in this order with 1 millisecond between each, and then force refresh:
{
"_id": "A",
"_source": { "text": "a" }
}
{
"_id": "A",
"_source": { "text": "b" }
}
{
"_id": "A",
"_source": { "text": "c" }
}
Then, if I search for a document with id "A", I will get 1 result, but which one?
When Elasticsearch performs a refresh, does it execute operations sequentially in the order in which they arrive?
in this instance it will come down to which indexing approach you take
a bulk request does not guarantee the order that you submitted it in is how it will be applied. it might be in the same order with (some of) your tests, but there's no guarantee that Elasticsearch provides there
you can manage this by specifying a version in your document, so a higher version of a document is always going to be what is indexed
indexing using 3 individual POSTs will be ordered, as you are making 3 separate and sequential requests one after the other. that's because each request has the same _id and will be directed to the same shard and actioned by the order they are received in

Use Kafka Connect to update Elasticsearch field on existing document instead of creating new

I have Kafka set-up running with the Elasticsearch connector and I am successfully indexing new documents into an ES index based on the incoming messages on a particular topic.
However, based on incoming messages on another topic, I need to append data to a field on a specific document in the same index.
Psuedo-schema below:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": []
}
^ This document is being created fine in ES based on the data in the topic mentioned above.
However, how do I then add items to the views field using messages from another topic. Like so:
article-view topic schema:
{
"article_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"user_id": 123456,
"timestamp: 136389734
}
and instead of simply creating a new document on the article-view index (which I dont' want to even have). It appends this to the views field on the article document with corresponding _id equal to article_id from the message.
so the end result after one message would be:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": [
{
"user_id": 123456,
"timestamp: 136389734
}
]
}
Using the ES API it is possible using a script. Like so:
{
"script": {
"lang": "painless",
"params": {
"newItems": [{
"timestamp": 136389734,
"user_id": 123456
}]
},
"source": "ctx._source.views.addAll(params.newItems)"
}
}
I can generate scripts like above dynamically in bulk, and then use the helpers.bulk function in the ES Python library to bulk update documents this way.
Is this possible with Kafka Connect / Elasticsearch? I haven't found any documentation on Confluent's website to explain how to do this.
It seems like a fairly standard requirement and an obvious thing people would need to do with Kafka / A sink connector like ES.
Thanks!
Edit: Partial updates are possible with write.method=upsert (src)
The Elasticsearch connector doesn't support this. You can update documents in-place but need to send the full document, not a delta for appending which I think it what you're after.

Elasticsearch query to get results irrespective of spaces in search text

I am trying to fetch data from Elasticsearch matching from a field name. I have following two records
{
"_index": "sam_index",
"_type": "doc",
"_id": "key",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
and
{
"_index": "sam_index",
"_type": "doc",
"_id": "key1",
"_version": 1,
"_score": 2,
"_source": {
"name": "Sample Name"
}
}
When I try to search using texts like sam, sample, Sa, etc, I able fetch both records by using match_phrase_prefix query. The query I tried with match_phrase_prefix is
GET sam_index/doc/_search
{
"query": {
"match_phrase_prefix" : {
"name": "sample"
}
}
}
I am not able to fetch the records when I try to search with string samplen. I need search and get results irrespective of spaces between texts. How can I achieve this in Elasticsearch?
First, you need to understand how Elasticsearch works and why it gives the result and doesn't give the result.
ES works on the token match, Documents which you index in ES goes through the analysis process and creates and stores the tokens generated from this process to inverted index which is used for searching.
Now when you make a query then that query also generates the search tokens, these can be as it is in the search query in case of term query or tokens based on the analyzer defined on the search field in case of match query. Hence it's very important to understand the internals of your search query.
Also, it's very important to understand the mapping of your index, ES uses the standard analyzer by default on the text fields.
You can use the Explain API to understand the internals of the query like which search tokens are generated by your search query, how documents matched to it and on what basis score is calculated.
In your case, I created the name field as text, which uses the word joined analyzer explained in Ignore spaces in Elasticsearch and I was able to get the document which consists of sample name when searched for samplen.
Let us know if you also want to achieve the same and if it solves your issue.

Truncate and Index String values in Elasticsearch 2.3.x

I am running ES 2.3.3. I want to index a non-analyzed String but truncate it to a certain number of characters. The ignore_above property, according to the documentation, will NOT index a field above the provided value. I don't want that. I want to take say a field that could potentially be 30K long and truncate it to 10K long, but still be able to filter and sort on the 10K that is retained.
Is this possible in ES 2.3.3 or do I need to do this using Java prior to indexing a document.
I want to index a non-analyzed String but truncate it to a certain number of characters.
Technically it's possible with Update API and Upsert option, but, depending on your exact needs, it may not be very handy.
Let's say you want to index this document:
{
"name": "foofoofoofoo",
"age": 29
}
but you need to truncate name field so that it has only 5 characters. Using Update API, you'd have to execute a script:
POST http://localhost:9200/insert/test/1/_update
{
"script" : "ctx._source.name = ctx._source.name.substring(0,5);",
"scripted_upsert": true,
"upsert" : {
"name": "foofoofoofoo",
"age": 29
}
}
It means that, if ES does not find the document with given id (here id=1), it should index the document that is inside upsert element, and perform given script. So as you can see, it's rather inconvenient if you want to have automatically generated ids, as you have to provide the id in URI.
Result:
GET http://localhost:9200/insert/test/1
{
"_index": "insert",
"_type": "test",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"name": "foofo",
"age": 29
}
}

Elasticsearch find uniqueness of content

I have a system that pulls in articles and stores them in an elasticsearch index. When a new article is available I want to determine how unique the article's content is before I publish it on my site, so that I can try and reduce duplicates.
Currently I search for the new article title against the index using a min_score filter and if there are 0 results then it can be published:
{
"index": "articles",
"type": "article",
"body": {
"min_score": 1,
"query": {
"multi_match": {
"query": "[ARTICLE TITLE HERE]",
"type": "best_fields",
"fields": [
"title^3",
"description"
]
}
}
}
}
This is not very accurate as you can imagine, most articles get published with a fair amount of duplicates.
How do you think I could improve this (if at all)?
Well , you need to handle this before indexing the document.
My best solution would be to model the _id based on title , so that if the same title exist , the new document can be discarded ( using _create API ) or all document can be discarded.
Even better , you can use upsert so that the exisitng document is updated by the duplicate info , like you can tell that news from this source has also appeared in this source.
You can see some practical example of the same here.

Resources