Elasticsearch: Handling updates out of order - elasticsearch

Imagine I have the following document:
{
"name": "Foo"
"age": 0
}
We receive events that trigger updates to these fields:
Event 1
{
"service_timestamp": "2019-09-15T09:00:01",
"updated_name": "Bar"
}
Event 2
{
"service_timestamp": "2019-09-15T09:00:02",
"updated_name": "Foo"
}
Event 2 was published by our service 1 second later than Event 1, so we would expect our document to first update the "name" property to "Bar", then back to "Foo". However, imagine that for whatever reason these events hit out of order (Event 2 THEN Event 1). The final state of the document will be "Bar", which is not the desired behavior.
We need to guarantee that we update our document in the order of the "service_timestamp" field on the event.
One solution we came up with is to have an additional last_updated_property on each field like so:
{
"name": {
"value": "Foo",
"last_updated_time": 1970-01-01T00:00:00
}
"age": {
"value": 0,
"last_updated_time": 1970-01-01T00:00:00
}
}
We would then only update the property if the service_timestamp of the event occurs after the last_updated_time of the property in the document:
{
"script": {
"source": "if (ctx._source.name.last_updated_time < event.service_timestamp) {
ctx._source.name.value = event.updated_name;
ctx._source.name.last_updated_time = event.service_timestamp;
}"
}
}
While this would work, it seems costly to perform a read, then a write on each update. Are there any other ways to guarantee events update in the correct order?
Edit 1: Some other things to consider
We cannot assume out-of-order events will occur in a small time window. Imagine the following: we attempt to update a customer's name, but this update fails, so we store the update event in some dead letter queue with the intention of refiring it later. We fix the bug that caused the update to fail, and refire all events in the dead letter queue. If no updates occurred that update the name field during the time we were fixing this bug, then the event in the dead letter queue should successfully update the property. However, if some events did update the name, the event in the dead letter queue should not update the property.

Everything Mousa said is correct wrt "Internal" versioning, which is where you let Elasticsearch handle incrementing the version.
However, Elasticsearch also supports "External" versioning, where you can provide a version with each update that gets checked against the current doc's version. I believe this would solve your case of events indexing to ES "out of order", and would prevent those issues across any timeframe of events (whether 1 second or 1 week apart, as in you dead letter queue example).
To do this, you'd track the version of documents in your primary datastore (Elasticsearch should never be a primary datastore!), and attach it to indexing requests.
First you'd create your doc with any version number you want, let's start with 1:
POST localhost:9200/my-index/my-type/<doc id>?version=1&version_type=external -d
{
"name": "Foo"
"age": 0
}
Then the updates would also get assigned versions from your service and/or primary datastore
Event 1
POST localhost:9200/my-index/my-type/<doc id>?version=2&version_type=external -d
{
"service_timestamp": "2019-09-15T09:00:01",
"updated_name": "Bar"
}
Event 2
POST localhost:9200/my-index/my-type/<doc id>?version=3&version_type=external -d
{
"service_timestamp": "2019-09-15T09:00:02",
"updated_name": "Foo"
}
This ensures that even if the updates are applied out of order the most recent one wins. If Event 1 is applied after event 2, you'd get a 409 error code that represents a VersionConflictEngineException, and most importantly Event 1 would NOT override event 2.
Instead of incrementing a version int by 1 each time, you could choose to convert your timestamps to epoch millis and provide that as the version - similar to your idea of creating a last_updated_property field, but taking advantage of Elasticsearch's built in versioning. That way, the most recently timestamped update will always "win" and be applied last.
I highly recommend you read this short blog post on Elasticsearch versioning - it goes into way more detail than I did here: https://www.elastic.co/blog/elasticsearch-versioning-support.
Happy searching!

Related

Message order with Kafka Connect Elasticsearch Connector

We are having problems enforcing the order in which messages from a Kafka topic are sent to Elasticsearch using the Kafka Connect Elasticsearch Connector. In the topic the messages are in the right order with the correct offsets, but if there are two messages with the same ID created in quick succession, they are intermittently sent to Elasticsearch in the wrong order. This causes Elasticsearch to have the data from the second last message, not from the last message. If we add some artificial delay of a second or two between the two messages in the topic, the problem disappears.
The documentation here states:
Document-level update ordering is ensured by using the partition-level
Kafka offset as the document version, and using version_mode=external.
However I can't find any documentation anywhere about this version_mode setting, and whether it's something we need to set ourselves somewhere.
In the log files from the Kafka Connect system we can see the two messages (for the same ID) being processed in the wrong order, a few milliseconds apart. It might be significant that it looks like these are processed in different threads. Also note that there is only one partition for this topic, so all messages are in the same partition.
Below is the log snippet, slightly edited for clarity. The messages in the Kafka topic are populated by Debezium, which I don't think is relevant to the problem, but handily happens to include a timestamp value. This shows that the messages are processed in the wrong order (though they're in the correct order in the Kafka topic, populated by Debezium):
[2019-01-17 09:10:05,671] DEBUG http-outgoing-1 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE SECOND UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER SECOND UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716205205
}
" (org.apache.http.wire)
...
[2019-01-17 09:10:05,696] DEBUG http-outgoing-2 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE FIRST UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER FIRST UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716204190
}
" (org.apache.http.wire)
Does anyone know how to force this connector to maintain message order for a given document ID when sending the messages to Elasticsearch?
The problem was that our Elasticsearch connector had the key.ignore configuration set to true.
We spotted this line in the Github source for the connector (in DataConverter.java):
final Long version = ignoreKey ? null : record.kafkaOffset();
This meant that, with key.ignore=true, the indexing operations that were being generated and sent to Elasticsearch were effectively "versionless" ... basically, the last set of data that Elasticsearch received for a document would replace any previous data, even if it was "old data".
From looking at the log files, the connector seems to have several consumer threads reading the source topic, then passing the transformed messages to Elasticsearch, but the order that they are passed to Elasticsearch is not necessarily the same as the topic order.
Using key.ignore=false, each Elasticsearch message now contains a version value equal to the Kafka record offset, and Elasticsearch refuses to update the index data for a document if it has already received data for a later "version".
That wasn't the only thing that fixed this. We still had to apply a transform to the Debezium message from the Kafka topic to get the key into a plain text format that Elasticsearch was happy with:
"transforms": "ExtractKey",
"transforms.ExtractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractKey.field": "id"

Protocol buffers Fieldmask on Collections within resource

If I want to update the "amount" field within a particular element inside "f_units" collection in the below resource (protocol buffer), how will the FieldMask look like to update the amount field? Does the field mask operate on array index for collections?
{
"f_sel": {
"f_units": [
{
"id": "1",
"amount": {
"coefficient": 1000,
"exponent": -2
}
},
{
"id": "2",
"amount": {
"coefficient": 2000,
"exponent": -2
}
}
]
}
}
Will it be "f_sel.f_units.0.amount" ? How can I update the amount using FieldMask?
As far as I know, there is no way to replace individual elements of a repeated field with an index in a FieldMask.
Instead, you'd update the amount field for the element within f_units you wish to change and set the FieldMask to
"f_sel.f_units"
It would be slightly more efficient to only have to send a delta to the original list, but it would be hard to prevent bugs. For example, what if the proto was modified in the meantime and the specified index (presuming there was a way to specify one) for the repeated field was no longer in range?
As an aside, Google does propose the concept of MergeOptions which defines semantics for how repeated fields are to be handled when merging. Currently, it appears they intend for you either to replace the repeated field in its entirety or append to the end of the destination field. Both of these merging strategies avoid the aforementioned bug that could be caused by specifying an invalid index.

Elasticsearch: auto increment integer field across two index

I need a auto increment integer field across two index.
Can Elasticsearch do it automatically like MySQL "auto increment" field in a table?
Eg. when puts some documents in two different index:
POST /my_index_1/blogpost/
{
"title": "Foo Bar"
}
POST /my_index_2/blogpost/
{
"title": "Baz quux"
}
On retrieve it, i want:
GET /my_index_*/blogpost/
{
"uid" : 1,
"title": "Foo Bar"
},
{
"uid" : 2,
"title": "Baz quux"
}
No, ES does not have any auto increment feature since it is a distributed system, figuring out the correct value for the counter is non trivial. Especially since (bulk) indexing tends to be heavily concurrent. You can typically max out CPUs on all nodes if you throw enough documents at it.
So, your best option is to do this outside of ES before you send the documents to ES. Or even better, don't do this. If you need some kind of order of insertion, a better option is to simply use a timestamp. They are actually stored as a number internally. You still might get duplicates of course if two documents get indexed the same millisecond. A trick we've used to work around that is to offset documents indexed at the same time by 1 ms. to ensure we keep the insertion order.

ES keeps returning every document

I recently inherited an ES instance and ensured I read an entire book on ES cover-to-cover before posting this, however I'm afraid I'm unable to get even simple examples to work.
I have an index on our staging environment which exhibits behavior where every document is returned no matter what - I have a similar index on our QA environment which works like I would expect it to. For example I am running the following query against http://staging:9200/people_alias/_search?explain:
{ "query" :
{ "filtered" :
{ "query" : { "match_all" : {} },
"filter" : { "term" : { "_id" : "34414405382" } } } } }
What I noticed on this staging environment is the score of every document is 1 and it is returning EVERY document in my index no matter what value I specify ...using ?explain I see the following:
_explanation: {
value: 1
description: ConstantScore(*:*), product of:
details: [
{
value: 1, description: boost
}, { value: 1, description: queryNorm } ] }
On my QA environment, which correctly returns only one record I observe for ?explain:
_explanation: {
value: 1
description: ConstantScore(cache(_uid:person#34414405382)), product of:
details: [ {
value: 1,
description: boost
}, {
value: 1,
description: queryNorm
}
]
}
The mappings are almost identical on both indices - the only difference is I removed the manual field-level boost values on some fields as I read field-level boosting is not recommended in favor of query-time boosting, however this should not affect the behavior of filtering on the document ID (right?)
Is there any clue I can glean from the differences in the explain output or should I post the index mappings? Are there any server-level settings I should consider checking? It doesn't matter what query I use on Staging, I can use match queries and exact match lookups on other fields and Staging just keeps returning every result with Score 1.0
I feel like I'm doing something very glaringly and obviously wrong on my Staging environment. Could someone please explain the presence of ConstantScore, boost and queryNorm? I thought from looking at examples in other literature I would see things like term frequency etc.
EDIT: I am issuing the query from Elastic Search Head plugin
In your HEAD plugin, you need to use POST in order to send the query in the payload, otherwise the _search endpoint is hit without any constraints.
In your browser, if you open the developer tools and look at the networking tab, you'll see that nothing is sent in the payload when using GET.
It's a common mistake people often do. Some HTTP clients (like curl) do send a payload using GET, but others (like /head/) don't. Sense will warn you if you use GET instead of POST when sending a payload and will automatically force POST instead of GET.
So to sum it up, it's best to always use POST whenever you wish to send some payload to your servers, so you don't have to care about the behavior of the HTTP client you're using.

DynamoDB: What's the best way to structure and query a sorted list of timestamped logs?

In the interest of better understanding Amazon's DynamoDB, Lambda functions and IAM roles (I'll stick to DynamoDB in this question), I'm setting up a Linux device to listen for new DynamoDB items and audibly read out updates that are being added by other functions at a regular interval. My goal is to query or scan items, returning those items in ascending order since a specific timestamp (the last time the device checked).
Here's the item structure I'm using so far:
{
"id": {
"S": "1eb4520d44715b6daa5f9d907fe43aab" //md5sum of "time"
},
"message": {
"S": "I'm creating the audible reporting log now."
},
"status": {
"S": "working"
},
"time": {
"S": "1452297505" //timestamp: should probably add milliseconds for sake of unique "id"
}
}
"id" is the partition key. "time" is the sort key. Looking at this now, I'm guessing I should probably make "time" a number, not a string...
Query or scan? Query seems like the correct option for sorting, but it requires a specific partition ID in the query (at least in in the AWS website query tool), so perhaps I'm adding those incorrectly. Scan loads all items and I'm guessing that the sort is not automatic or an option (at least not in in the AWS website query tool). I really only want to load items greater than a timestamp value, sorted.
Where am I off in my thinking? I appreciate the assistance in advance.
UPDATE
After further experimentation with AWS-CLI and DynamoDB, I ended up using a slightly different solution. Since this is a small scale "hello world" type of project, all update items are added to the same table with a single partition key, "SF Reporter", for now. This could scale if I decide to start monitoring additional "reporter"/service updates with separate queries and/or devices.
{
"datetime": { //sort key
"S": "2016-01-11T05:15:02"
},
"message": {
"S": "It is all good."
},
"reporter": { //primary partition key
"S": "SF Reporter"
},
"status": {
"S": "ok"
}
}
The JSON query itself looks something like this (abbreviated node.js example):
var AWS = require("aws-sdk");
AWS.config.credentials = new AWS.SharedIniFileCredentials({ profile: 'default' });
AWS.config.update({"region": "us-west-2"});
var docClient = new AWS.DynamoDB.DocumentClient();
var params = {
TableName: "spoken_reports",
KeyConditionExpression: "#reporter = :reporter and #datetime >= :datetime",
ExpressionAttributeNames:{
"#reporter": "reporter",
"#datetime": "datetime"
},
ExpressionAttributeValues: {
":reporter":"SF Reporter",
":datetime":"2016-01-11T05:15:02"
}
};
docClient.query(params, onUpdatesReceived);
var onUpdatesReceived = function(err, data) {
if (err) {
console.log(err, err.stack);
} else {
console.log(data);
}
}
The query gets the latest updates sorted by a string timestamp (defaults to ascending order in this example). This allows for some scaling as I can have multiple devices checking the same table for the latest updates. I would create a scheduled query/function to clear out old updates once in a while to keep things light.
Dead simple way:
You should set up a global secondary index, and project "isNew" as the primary/hash key to it, with timestamp as the range key.
On creation of an entry, mark isNew as a UUID or something. This will make the table item project into the index.
When you need to check for data, scan the secondary index - the index will have only the results which are new. Then, updateItem the items you have read within the table itself to delete the isNew key on the item. The item will be removed from the secondary index, so it is not read again.
If you stick with this table design, scanning the entire table is the only option you have, for the reasons you've mentioned: for querying, you need a partition key, which is something your devices have no way of knowing beforehand.
There is another solution that comes to my mind:
Let's say your current table is called T1. Create another table, T2, that has deviceID as partition key and timestamp as sort key.
You define a AWS Lambda function on T1's stream that will, on any update, push that row in T2 as well, one per device.
Now whenever any of your device wakes up, it queries (not scan) T2 with its own device id. Processes all the rows and deletes them.
In other words, T2 will always have all the rows that a given device is yet to process.

Resources