We are having problems enforcing the order in which messages from a Kafka topic are sent to Elasticsearch using the Kafka Connect Elasticsearch Connector. In the topic the messages are in the right order with the correct offsets, but if there are two messages with the same ID created in quick succession, they are intermittently sent to Elasticsearch in the wrong order. This causes Elasticsearch to have the data from the second last message, not from the last message. If we add some artificial delay of a second or two between the two messages in the topic, the problem disappears.
The documentation here states:
Document-level update ordering is ensured by using the partition-level
Kafka offset as the document version, and using version_mode=external.
However I can't find any documentation anywhere about this version_mode setting, and whether it's something we need to set ourselves somewhere.
In the log files from the Kafka Connect system we can see the two messages (for the same ID) being processed in the wrong order, a few milliseconds apart. It might be significant that it looks like these are processed in different threads. Also note that there is only one partition for this topic, so all messages are in the same partition.
Below is the log snippet, slightly edited for clarity. The messages in the Kafka topic are populated by Debezium, which I don't think is relevant to the problem, but handily happens to include a timestamp value. This shows that the messages are processed in the wrong order (though they're in the correct order in the Kafka topic, populated by Debezium):
[2019-01-17 09:10:05,671] DEBUG http-outgoing-1 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE SECOND UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER SECOND UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716205205
}
" (org.apache.http.wire)
...
[2019-01-17 09:10:05,696] DEBUG http-outgoing-2 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE FIRST UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER FIRST UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716204190
}
" (org.apache.http.wire)
Does anyone know how to force this connector to maintain message order for a given document ID when sending the messages to Elasticsearch?
The problem was that our Elasticsearch connector had the key.ignore configuration set to true.
We spotted this line in the Github source for the connector (in DataConverter.java):
final Long version = ignoreKey ? null : record.kafkaOffset();
This meant that, with key.ignore=true, the indexing operations that were being generated and sent to Elasticsearch were effectively "versionless" ... basically, the last set of data that Elasticsearch received for a document would replace any previous data, even if it was "old data".
From looking at the log files, the connector seems to have several consumer threads reading the source topic, then passing the transformed messages to Elasticsearch, but the order that they are passed to Elasticsearch is not necessarily the same as the topic order.
Using key.ignore=false, each Elasticsearch message now contains a version value equal to the Kafka record offset, and Elasticsearch refuses to update the index data for a document if it has already received data for a later "version".
That wasn't the only thing that fixed this. We still had to apply a transform to the Debezium message from the Kafka topic to get the key into a plain text format that Elasticsearch was happy with:
"transforms": "ExtractKey",
"transforms.ExtractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractKey.field": "id"
Related
I have two Elasticsearch data nodes, Slave and Master.
M and S can communicate with each other, however for security reasons S cannot send data to M when it receives it, M must request data from S, and (assuming no other requirements on what data S exports) when this happens M receives the requested data from S.
S's data is then incorporated in to M's data.
Is this behaviour achievable with Elasticsearch? Unless I am mistaken, neither replication nor snapshotting achieve this behaviour, and while I am aware that I could use S's REST API to receive this data on M before purging copied data from S, this solution seems clunky and prone to error.
Is there an elegant solution to achieve this architecture?
It is true that Cross Cluster Replication (CCR) is a potential solution for this, but that solution requires the most expensive version of elasticsearch, and there is a free alternative.
The elasticsearch input and output plugins for logstash work for this, albeit with some tweaking to get it to behave exactly as you want.
Below is a crude example which queries one elasticsearch node for data, and exports to another. This does mean that you require a logstash instance between the Slave and Master nodes to handle this behaviour.
input {
elasticsearch {
docinfo => true #Necessary to get metadata info
hosts => "192.168.0.1" #Slave (Target) elasticsearch instance
query => '{ "query": { "query_string": { "query": "*" } } }' #Query to return documents, this example returns all data which is bad if you combine with the below schedule
schedule => "* * * * *" #Run periodically, this example runs every minute
}
}
output {
elasticsearch {
hosts => "192.168.0.2:9200" #Master (Destination) elasticsearch instance
index => "replica.%{[#metadata][_index]}"
document_id => "%{[metadata][_id]}"
}
}
Imagine I have the following document:
{
"name": "Foo"
"age": 0
}
We receive events that trigger updates to these fields:
Event 1
{
"service_timestamp": "2019-09-15T09:00:01",
"updated_name": "Bar"
}
Event 2
{
"service_timestamp": "2019-09-15T09:00:02",
"updated_name": "Foo"
}
Event 2 was published by our service 1 second later than Event 1, so we would expect our document to first update the "name" property to "Bar", then back to "Foo". However, imagine that for whatever reason these events hit out of order (Event 2 THEN Event 1). The final state of the document will be "Bar", which is not the desired behavior.
We need to guarantee that we update our document in the order of the "service_timestamp" field on the event.
One solution we came up with is to have an additional last_updated_property on each field like so:
{
"name": {
"value": "Foo",
"last_updated_time": 1970-01-01T00:00:00
}
"age": {
"value": 0,
"last_updated_time": 1970-01-01T00:00:00
}
}
We would then only update the property if the service_timestamp of the event occurs after the last_updated_time of the property in the document:
{
"script": {
"source": "if (ctx._source.name.last_updated_time < event.service_timestamp) {
ctx._source.name.value = event.updated_name;
ctx._source.name.last_updated_time = event.service_timestamp;
}"
}
}
While this would work, it seems costly to perform a read, then a write on each update. Are there any other ways to guarantee events update in the correct order?
Edit 1: Some other things to consider
We cannot assume out-of-order events will occur in a small time window. Imagine the following: we attempt to update a customer's name, but this update fails, so we store the update event in some dead letter queue with the intention of refiring it later. We fix the bug that caused the update to fail, and refire all events in the dead letter queue. If no updates occurred that update the name field during the time we were fixing this bug, then the event in the dead letter queue should successfully update the property. However, if some events did update the name, the event in the dead letter queue should not update the property.
Everything Mousa said is correct wrt "Internal" versioning, which is where you let Elasticsearch handle incrementing the version.
However, Elasticsearch also supports "External" versioning, where you can provide a version with each update that gets checked against the current doc's version. I believe this would solve your case of events indexing to ES "out of order", and would prevent those issues across any timeframe of events (whether 1 second or 1 week apart, as in you dead letter queue example).
To do this, you'd track the version of documents in your primary datastore (Elasticsearch should never be a primary datastore!), and attach it to indexing requests.
First you'd create your doc with any version number you want, let's start with 1:
POST localhost:9200/my-index/my-type/<doc id>?version=1&version_type=external -d
{
"name": "Foo"
"age": 0
}
Then the updates would also get assigned versions from your service and/or primary datastore
Event 1
POST localhost:9200/my-index/my-type/<doc id>?version=2&version_type=external -d
{
"service_timestamp": "2019-09-15T09:00:01",
"updated_name": "Bar"
}
Event 2
POST localhost:9200/my-index/my-type/<doc id>?version=3&version_type=external -d
{
"service_timestamp": "2019-09-15T09:00:02",
"updated_name": "Foo"
}
This ensures that even if the updates are applied out of order the most recent one wins. If Event 1 is applied after event 2, you'd get a 409 error code that represents a VersionConflictEngineException, and most importantly Event 1 would NOT override event 2.
Instead of incrementing a version int by 1 each time, you could choose to convert your timestamps to epoch millis and provide that as the version - similar to your idea of creating a last_updated_property field, but taking advantage of Elasticsearch's built in versioning. That way, the most recently timestamped update will always "win" and be applied last.
I highly recommend you read this short blog post on Elasticsearch versioning - it goes into way more detail than I did here: https://www.elastic.co/blog/elasticsearch-versioning-support.
Happy searching!
I am trying to create a way to navigate my log files and the main features I need are:
search for strings inside log file (and returning line of occurrences).
pagination from line x to line y.
Now I was checking Logstash and it was looking great for my first feature (searching), but not so much for the second one. I was under the idea that I could somehow index the file line number along with the log information of each record, but I can't seem to find a way.
Is there somehow a Logstash Filter to do this? or a Filebeat processor? I can't make it work.
I was thinking that maybe I could create a way for all my processes to log into a database with processed information, but that's also kind of impossible (or very difficult) because the Log Handler also doesn't know what's the current log line.
At the end what I could do is, for serving a way to paginate my log file (through a service) would be to actually open it, navigate to a specific line and show it in a service which is not very optimal, as the file could be very big, and I am already indexing it into Elasticsearch (with Logstash).
My current configuration is very simple:
Filebeat
filebeat.prospectors:
- type: log
paths:
- /path/of/logs/*.log
output.logstash:
hosts: ["localhost:5044"]
Logstash
input {
beats {
port => "5044"
}
}
output {
elasticsearch {
hosts => [ "localhost:9200" ]
}
}
Right now for example I am getting an item like:
{
"beat": {
"hostname": "my.local",
"name": "my.local",
"version": "6.2.2"
},
"#timestamp": "2018-02-26T04:25:16.832Z",
"host": "my.local",
"tags": [
"beats_input_codec_plain_applied",
],
"prospector": {
"type": "log"
},
"#version": "1",
"message": "2018-02-25 22:37:55 [mylibrary] INFO: this is an example log line",
"source": "/path/of/logs/example.log",
"offset": 1124
}
If I could somehow include into that item a field like line_number: 1, would be great as I could use Elasticsearch filters to actually navigate through the whole logs.
If you guys have ideas for different ways to store my logs (and navigate) please also let me know
Are the log files generated by you? Or can you change the log structure? Then you can add a counter as a prefix and filter it out with logstash.
For example for
12345 2018-02-25 22:37:55 [mylibrary] INFO: this is an example log line
your filter must look like this:
filter {
grok {
match => {"message" => "%{INT:count} %{GREEDYDATA:message}"
overwrite => ["message"]
}
}
New field "count" will be created. You can then possibly use it for your purposes.
At this moment, I don't think there are any solutions here. Logstash, Beats, Kibana all have the idea of events over time and that's basically the way things are ordered. Line numbers are more of a text editor kind of functionality.
To a certain degree Kibana can show you the events in a file. It won't give you a page by page kind of list where you can actually click on a page number, but using time frames you could theoretically look at an entire file.
There are similar requests (enhancements) for Beats and Logstash.
First let me give what is probably the main reason why Filebeat doesn't already have a line number field. When Filebeat resumes reading a file (like after a restart) it does an fseek to resume from the last recorded offset. If it had to report the line numbers it would either need to store this state in its registry or re-read the file and count newlines up to the offset.
If you want to offer a service that allows you to paginate through the logs that are backed by Elasticsearch you can use the scroll API with a query for the file. You must sort the results by #timestamp and then by offset. Your service would use a scroll query to get the first page of results.
POST /filebeat-*/_search?scroll=1m
{
"size": 10,
"query": {
"match": {
"source": "/var/log/messages"
}
},
"sort": [
{
"#timestamp": {
"order": "asc"
}
},
{
"offset": "asc"
}
]
}
Then to get all future pages you use the scroll_id returned from the first query.
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBwAAAAAAPXDOFk12OEYw="
}
This will give you all log data for a given file name even tracking it across rotations. If line numbers are critical you could produce them synthetically by counting events starting with the first event that has offset == 0, but I avoid this because it's very error prone especially if you ever add any filtering or multiline grouping.
I'm using logstash to analyze my web servers access. At this time, it works pretty well. I used a configuration file that produce to me this kind of data :
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345"
...
}
This record are stored into "logstash-2016.10.02" index (one index per day).
I also created an other index named "publications". This index contains the publication metadata.
A json record looks like this :
{
"type": "publication",
"id": "boreal:12345",
"sm_title": "The title of the publication",
"sm_type": "thesis",
"sm_creator": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
"sm_departement": [
"UCL/CORE - Center for Operations Research and Econometrics",
],
"sm_date": "2001",
"ss_state": "A"
...
}
And I would like to create a query like "give me all access for 'Smith, John' publications".
As all my data are not into the same index, I can't use parent-child relation (Am I right ?)
I read this on a forum but it's an old post :
By limiting itself to parent/child type relationships elasticsearch makes life
easier for itself: a child is always indexed in the same shard as its parent,
so has_child doesn’t have to do awkward cross shard operations.
Using logstash, I can't place all data in a single index nammed logstash. By month I have more than 1M access... In 1 year, I wil have more than 15M record into 1 index... and I need to store the web access data for minimum 5 year (1M * 12 * 15 = 180M).
I don't think it's a good idea to deal with a single index containing more than 18M record (if I'm wrong, please let me know).
Is it exists a solution to my problem ? I don't find any beautifull solution.
The only I have a this time in my python script is : A first query to collect all id's about 'Smith, John' publications ; a loop on each publication to get all WebServer access for this specific publication.
So if "Smith, John" has 321 publications, I send 312 http requests to ES and the response time is not acceptable (more than 7 seconds ; not so bad when you know the number of record in ES but not acceptable for final user.)
Thanks for your help ; sorry for my english
Renaud
An idea would be to use the elasticsearch logstash filter in order to get a given publication while an access log document is being processed by Logstash.
That filter would retrieve the sm_creator field in the publications index having the same object_id and enrich the access log with whatever fields from the publication document you need. Thereafter, you can simply query the logstash-* index.
elasticsearch {
hosts => ["localhost:9200"]
index => publications
query => "id:%{object_id}"
fields => {"sm_creator" => "author"}
}
As a result of this, your access log document will look like this afterwards and for "give me all access for 'Smith, John' publications" you can simply query the sm_creator field in all your logstash indices
{
"type": "apache_access",
"clientip": "192.243.xxx.xxx",
"verb": "GET",
"request": "/publications/boreal:12345?direction=rtl&language=en",
...
"url_path": "/publications/boreal:12345",
"url_params": {
"direction": "rtl",
"language": "end"
},
"object_id": "boreal:12345",
"author": [
"Smith, John",
"Dupont, Albert",
"Reegan, Ronald"
],
...
}
I recently inherited an ES instance and ensured I read an entire book on ES cover-to-cover before posting this, however I'm afraid I'm unable to get even simple examples to work.
I have an index on our staging environment which exhibits behavior where every document is returned no matter what - I have a similar index on our QA environment which works like I would expect it to. For example I am running the following query against http://staging:9200/people_alias/_search?explain:
{ "query" :
{ "filtered" :
{ "query" : { "match_all" : {} },
"filter" : { "term" : { "_id" : "34414405382" } } } } }
What I noticed on this staging environment is the score of every document is 1 and it is returning EVERY document in my index no matter what value I specify ...using ?explain I see the following:
_explanation: {
value: 1
description: ConstantScore(*:*), product of:
details: [
{
value: 1, description: boost
}, { value: 1, description: queryNorm } ] }
On my QA environment, which correctly returns only one record I observe for ?explain:
_explanation: {
value: 1
description: ConstantScore(cache(_uid:person#34414405382)), product of:
details: [ {
value: 1,
description: boost
}, {
value: 1,
description: queryNorm
}
]
}
The mappings are almost identical on both indices - the only difference is I removed the manual field-level boost values on some fields as I read field-level boosting is not recommended in favor of query-time boosting, however this should not affect the behavior of filtering on the document ID (right?)
Is there any clue I can glean from the differences in the explain output or should I post the index mappings? Are there any server-level settings I should consider checking? It doesn't matter what query I use on Staging, I can use match queries and exact match lookups on other fields and Staging just keeps returning every result with Score 1.0
I feel like I'm doing something very glaringly and obviously wrong on my Staging environment. Could someone please explain the presence of ConstantScore, boost and queryNorm? I thought from looking at examples in other literature I would see things like term frequency etc.
EDIT: I am issuing the query from Elastic Search Head plugin
In your HEAD plugin, you need to use POST in order to send the query in the payload, otherwise the _search endpoint is hit without any constraints.
In your browser, if you open the developer tools and look at the networking tab, you'll see that nothing is sent in the payload when using GET.
It's a common mistake people often do. Some HTTP clients (like curl) do send a payload using GET, but others (like /head/) don't. Sense will warn you if you use GET instead of POST when sending a payload and will automatically force POST instead of GET.
So to sum it up, it's best to always use POST whenever you wish to send some payload to your servers, so you don't have to care about the behavior of the HTTP client you're using.