How to check properties before update in elasticsearch? - go

I've already read official documentation and find no way.
My datas to es are from kafka which sometimes can be out of order. In the past, message from kafka is parsed and directly insert or update ES doc with specific ID. To avoid the older data override the newer data, I have to check whether the doc with specific ID is already exists and some properties of this doc are meet the conditions. Then I do the UPDATE action(or INSERT).
What I'm doing now is 'search before update'.
Before updating a doc, I search from ES with specific ID(included in kafka msg). Then check if this doc meets the conditions(for example, whether update_time is older?). Lastly I update the doc. And I set refresh to true to update index instantly.
What I'm worried about?
It seems Transactional.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?

If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
That is a possibility since indexes are refreshed once in every second (by default), reducing this value is neither recommended nor guaranteed to give you the desired result since Elasticsearch is NOT designed for this.
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
You can use script if the number of fields being updated are very limited. Personally I have found script to be best suited for single field update and that too for corner use cases, it should not be used as a general practice. Any more than that and you are running into the same risk as that with stored procedures in the RDBMS world. It makes data management volatile overall and a system which is harder to maintain/extend in the longer run.
Your use case is best suited for optimistic locking support available from Elasticsearch out of the box. Take a look at Elasticsearch Versioning Support for full details.
You can very well use the inbuilt doc version if concurrency is the only problem that you need to solve. If, however, you need more than concurrency (out of order message delivery and respective ES updates) then you should use your application/domain specific field as the inbuilt version wouldn't work as-is.
You can very well use any of the app specific (numeric) field as a version field and use it for optimistic locking during document updates. If you use this approach, please pay special attention to all insert, update, delete operations for that index. Quoting AS-IS from versioning support - when using external versioning, make sure you always add the current version (and version_type) to any index, update or delete calls. If you forget, Elasticsearch will use it's internal system to process that request, which will cause the version to be incremented erroneously
I'll recommend you evaluate the inbuilt version first and use it if it fulfills your needs. It'll make the overall design much simpler. Consider the app specific version as the second option if the inbuilt version does not meet your requirements.

If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
Ad 1. It is possible to save data in ElasticSearch and in a short while after receive stale result (before the index is updated)
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
Ad 2. If you process Kafka messages in several threads, it would be the best to use business data (eg. some business ids) as partition keys in Kafka to ensure data is processed in order. Remember to use Kafka to consume messages in many threads and don't consume messages by single consumer to fan out later to multiple threads.
It seems it would be best to ensure data is processed in order and then drop checking in Elasticsearch since it is not guaranteed to give valid results.

Related

How to know when data has been inserted in clickhouse

I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.

Apache Nifi - Federated Search

My team’s been thrown into the deep end and have been asked to build a federated search of customers over a variety of large datasets which hold varying degrees of differing data about each individuals (and no matching identifiers) and I was wondering how to go about implementing it.
I was thinking Apache Nifi would be a good fit to query our various databases, merge the result, deduplicate the entries via an external tool and then push this result into a database which is then queried for use in an Elasticsearch instance for the applications use.
So roughly speaking something like this:-
For examples sake the following data then exists in the result database from the first flow :-

Then running https://github.com/dedupeio/dedupe over this database table which will add cluster ids to aid the record linkage, e.g.:-

Second flow would then query the result database and feed this result into Elasticsearch instance for use by the applications API for querying which would use the cluster id to link the duplicates.
Couple questions:-
How would I trigger dedupe to run on the merged content was pushed to the database?
The corollary question - how would the second flow know when to fetch results for pushing into Elasticsearch? Periodic polling?
I also haven’t considered any CDC process here as the databases will be getting constantly updated which I'd need to handle, so really interested if anybody had solved a similar problem or used different approach (happy to consider other technologies too).
Thanks!
For de-duplicating...
You will probably need to write a custom processor, or use ExecuteScript. Since it looks like a Python library, I'm guessing writing a script for ExecuteScript, unless there is a Java library.
For triggering the second flow...
Do you need that intermediate DB table for something else?
If you do need it, then you can send the success relationship of PutDatabaseRecord as the input to the follow-on ExecuteSQL.
If you don't need it, then you can just go MergeContent -> Dedupe -> ElasticSearch.

How do you perform hitless reindexing in elastic search that avoids races and keeps consistency?

I'm implementing Elastic Search for a blog where documents can be updated.
I need to perform hitless reindexing in Elastic Search that avoids races and keeps consistency. (By consistency, I mean if the application does a write followed by a query, the query should show the changes even during reindexing).
The best advice I've been able to find is that you use aliases to atomically switch which index the application is using and that the application writes to both the old index (via an write_alias) and the new index (via a special write_next_version alias) when writing during a reindexing operation, and reads from the old index (via a read_alias). Any races in the concurrent writes between reindex and the application are resolved by the document version numbers as long as the application writes to the old index first and the new index second. When the reindexing is done, just atomically switch the application's read and write aliases to the new index and delete the write_next_version alias.
However, there are still races and performance issues.
My application doesn't know there's a reindex occurring, reindex and the alias switching involved is a separate long-running process. I could use a HEAD command to find out if a special write_next_version alias exists and only write if it exists. However, that's an extra round trip to the ES servers. There's also still a race between the HEAD command and the reindex process described above that deletes the second write_next_version alias. I could just do both writes every time and silently handle the error to the usually non-existent write_next_version alias. I'd do this via a bulk API if my documents were small, but they are blog entries, they could be fairly large.
So should I just write twice every time and swallow the error on the second write? or should I use the HEAD API to determine whether the application needs to perform the second write for consistency? Or is there some better way of doing this?
The general outline of this strategy is shown in this article. This older article also shows how to do it but they don't use aliases which is not acceptable. There is a related issue on the Elastic Search github but they do not address the problem that two writes need to be done in order to maintain consistency. They also don't address the races or performance issues. (they closed the issue...)

Optimistic locking over multiple documents

I need to update some documents at once, like a RDBMS transaction. The best way to do this for a single document in a key-value store like couchbase seems to be using optimistic locking. This would work for me. However, I need to update multiple documents at once.
I need all documents to be updated, or none. Is this possible in couchbase or some similar highly scalable database?
(by the way, I'm using Go)
There are three approaches to resolve it:
You should take another look at your key/document designs and identify if its possible to combine your multiple docs into one. Then you will be able to do a single transactional update in Couchbase.
Simulate Transaction the effect can be simulated by writing a suitable document and view definition that produces the effect while still only requiring a single document update to be applied.
Simulate Multi-phase Transactions to use the transaction record to record each stage of the update process

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

Resources