How do you perform hitless reindexing in elastic search that avoids races and keeps consistency? - elasticsearch

I'm implementing Elastic Search for a blog where documents can be updated.
I need to perform hitless reindexing in Elastic Search that avoids races and keeps consistency. (By consistency, I mean if the application does a write followed by a query, the query should show the changes even during reindexing).
The best advice I've been able to find is that you use aliases to atomically switch which index the application is using and that the application writes to both the old index (via an write_alias) and the new index (via a special write_next_version alias) when writing during a reindexing operation, and reads from the old index (via a read_alias). Any races in the concurrent writes between reindex and the application are resolved by the document version numbers as long as the application writes to the old index first and the new index second. When the reindexing is done, just atomically switch the application's read and write aliases to the new index and delete the write_next_version alias.
However, there are still races and performance issues.
My application doesn't know there's a reindex occurring, reindex and the alias switching involved is a separate long-running process. I could use a HEAD command to find out if a special write_next_version alias exists and only write if it exists. However, that's an extra round trip to the ES servers. There's also still a race between the HEAD command and the reindex process described above that deletes the second write_next_version alias. I could just do both writes every time and silently handle the error to the usually non-existent write_next_version alias. I'd do this via a bulk API if my documents were small, but they are blog entries, they could be fairly large.
So should I just write twice every time and swallow the error on the second write? or should I use the HEAD API to determine whether the application needs to perform the second write for consistency? Or is there some better way of doing this?
The general outline of this strategy is shown in this article. This older article also shows how to do it but they don't use aliases which is not acceptable. There is a related issue on the Elastic Search github but they do not address the problem that two writes need to be done in order to maintain consistency. They also don't address the races or performance issues. (they closed the issue...)

Related

How to check properties before update in elasticsearch?

I've already read official documentation and find no way.
My datas to es are from kafka which sometimes can be out of order. In the past, message from kafka is parsed and directly insert or update ES doc with specific ID. To avoid the older data override the newer data, I have to check whether the doc with specific ID is already exists and some properties of this doc are meet the conditions. Then I do the UPDATE action(or INSERT).
What I'm doing now is 'search before update'.
Before updating a doc, I search from ES with specific ID(included in kafka msg). Then check if this doc meets the conditions(for example, whether update_time is older?). Lastly I update the doc. And I set refresh to true to update index instantly.
What I'm worried about?
It seems Transactional.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
That is a possibility since indexes are refreshed once in every second (by default), reducing this value is neither recommended nor guaranteed to give you the desired result since Elasticsearch is NOT designed for this.
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
You can use script if the number of fields being updated are very limited. Personally I have found script to be best suited for single field update and that too for corner use cases, it should not be used as a general practice. Any more than that and you are running into the same risk as that with stored procedures in the RDBMS world. It makes data management volatile overall and a system which is harder to maintain/extend in the longer run.
Your use case is best suited for optimistic locking support available from Elasticsearch out of the box. Take a look at Elasticsearch Versioning Support for full details.
You can very well use the inbuilt doc version if concurrency is the only problem that you need to solve. If, however, you need more than concurrency (out of order message delivery and respective ES updates) then you should use your application/domain specific field as the inbuilt version wouldn't work as-is.
You can very well use any of the app specific (numeric) field as a version field and use it for optimistic locking during document updates. If you use this approach, please pay special attention to all insert, update, delete operations for that index. Quoting AS-IS from versioning support - when using external versioning, make sure you always add the current version (and version_type) to any index, update or delete calls. If you forget, Elasticsearch will use it's internal system to process that request, which will cause the version to be incremented erroneously
I'll recommend you evaluate the inbuilt version first and use it if it fulfills your needs. It'll make the overall design much simpler. Consider the app specific version as the second option if the inbuilt version does not meet your requirements.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
Ad 1. It is possible to save data in ElasticSearch and in a short while after receive stale result (before the index is updated)
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
Ad 2. If you process Kafka messages in several threads, it would be the best to use business data (eg. some business ids) as partition keys in Kafka to ensure data is processed in order. Remember to use Kafka to consume messages in many threads and don't consume messages by single consumer to fan out later to multiple threads.
It seems it would be best to ensure data is processed in order and then drop checking in Elasticsearch since it is not guaranteed to give valid results.

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

How to create multiple indexes at once in RethinkDB?

In RethinkDB, is it possible to create multiple indexes at once?
Something like (which doesn't work):
r.db('test').table('user').indexCreate('name').indexCreate('email').run(conn, callback)
Index creation is a fairly heavyweight operation because it requires scanning the existing documents to bring the index up to date. It's theoretically possible to allow creation of 2 indexes at the same time such that they both perform this process in parallel and halve the work We don't support that right now though.
However I suspect that's not what you're asking about. If you're just looking for a way to not have to wait for the index to complete and then come back and start the next one the best way would be to do:
table.index_create("foo").run(noreply=True)
# returns immediately
table.index_create("bar").run(noreply=True)
# returns immediately
You also can always do any number of writes in a single query by putting them in an array like so:
r.expr([table.index_create("foo"), table.index_create("bar")]).run()
I can't actually think of why this would be useful for index creation because index writes don't block until the index is ready but hey who knows. It's definitely useful in table creation.

Solr deployment strategies for 100% Up Time while creating whole index

I'm working on a Solr 3.6 with ASP.net MVC3 e-commerce project.
I've an index of appx. 1 lac products in Solr. There is some changes in requirements, and we need to rebuild the whole index. Whole indexing is taking almost 1 & half hour during which site needs to be down.
How can I rebuild the index and also keep the site live which serving contents from older index. What is best practices to reduce down time while rebuilding the whole index. I wish I can do it with 100% uptime.
Edit
I'm storing a several URLs into Solr data as stored field and hence, which are dynamically generated while adding data into Solr. If I deploy application on different sub domain like test.example.com then it takes wrong URL, wherein it will only work with example.com. So hosting an another application is not an option for me.
You can leverage the concept of multiple cores in Solr and thereby have a live core that users are currently searching against and a standby core where you can make schema changes, re-index, etc. Then using the SWAP command you can switch the live and standby cores without any user downtime. The swap will be handled internally by Solr and your users will never notice a difference.
As I see, there are several ways to correctly solve this problem:
Do not rebuild the whole index - just update necessary records on-the-fly when they changes, Solr can do it pretty simple
Create 2 Solr instances on different ports and use them one after one. When first is rebuilding, on the second you can use old index. And when first is rebuilt, you can use it until the index on second instance is rebuilt.
Add boolean field to your index named, for example, "old_index". And whem reindexing starts, update all currrent records and set old_index=1, then write somewhere in configuration that you looking for records with old_index==1. Than start reindexing, than delete old records. It can be done with Solr`s deleteByQuery and either atomatic update in Solr 4.x or manual update.

How to optimize "text search" for inverted index and relational database? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Update 2022-08-12
I re-thought about it and realized I was overcomplicating it. I found the best way to enhance this system is by using good old information retrieval techniques ie using 'location' of a word in a sentence and 'ranking' queries to display best hits. The approach is illustrated in this following picture.
Update 2015-10-15
Back in 2012, I was building a personal online application and actually wanted to re-invent the wheel because am curious by nature, for learning purposes and to enhance my algorithm and architecture skills. I could have used apache lucene and others, however as I mentioned I decided to build my own mini search engine.
Question: So is there really no way to enhance this architecture except by using available services like elasticsearch, lucene and others?
Original question
I am developing a web application, in which users search for specific titles (say for example : book x, book y, etc..) , which data is in a relational database (MySQL).
I am following the principle that each record that was fetched from the db, is cached in memory , so that the app has less calls to the database.
I have developed my own mini search engine , with the following architecture:
This is how it works:
a) User searches a record name
b) The system check what character the query starts with, checks if query there : get record. If not there, adds it and get all matching records from database using two ways:
Either query already there in the Table "Queries" (which is a sort of history table) thus get record based on IDs (Fast performance)
Or, otherwise using Mysql LIKE %% statement to get records/ids (Also then keep the used query by the user in history table Queries along with the ids it maps to).
-->Then It adds records and their ids to the cache and Only the ids to the inverted index map.
c) results are returned to the UI
The system works fine, however I have Two main issues, that i couldn't find a good solution for (been trying for the past month):
First issue:
if you check point (b) , case where no query "history" is found and it has to use the Like %% statement : this process becomes time consuming when the query matches numerous records in the database (instead of one or two):
It will take some time to get records from Mysql (this is why i used INDEXES on the specific columns)
Then time to save query history
Then time to add records/ids to cache and inverted index maps
Second issue:
The application allows users to add themselves new records, that can immediately be used by other users logged in the to application.
However to achieve this, inverted index map and table "queries" have to be updated so that in case any old query matches to the new word. For example if a new record "woodX" is being added, still the old query "wood" does map to it. So in order to re-hook query "wood" to this new record, here is what i am doing now:
new record "woodX" gets added to "records" table
then i run a Like %% statement to see which already existing query in table "queries" does map to this record(for example "wood"), then add this query with the new record id as a new row: [ wood, new id].
Then in memory, update inverted index Map's "wood" key's value (ie the list), by adding the new record Id to this list
--> Thus now if a remote user searches "wood" it will get from memory : wood and woodX
The Issue here is also time consumption. Matching all query histories (in table queries) with the newly added word takes a lot of time (the more matching queries, the more time). Then the in memory update also takes a lot of time.
What i am thinking of doing to fix this time issue, is to return the desired results to the user first , then let the application POST an ajax call with the required data to achieve all these UPDATE tasks. But i am not sure if this is a bad practice or an unprofessional way of doing things?
So for the past month ( a bit more) i tried to think of the best optimization/modification/update for this architecture, but I am not an expert in the document retrieval field (actually its my first mini search engine ever built).
I would appreciate any feedback or guidance on what i should do to be able to achieve this kind of architecture.
Thanks in advance.
PS:
Its a j2ee application using servlets.
I am using MySQL innodb (thus i cannot use full-text search option)
I would strongly recommend Sphinx Search Server, wchich is best optimized in full-text searching. Visit http://sphinxsearch.com/.
It's designed to work with MySQL, so it's an addition to Your current workspace.
I do not pretend to have THE solution but here is my ideas.
First, I though like you for time-consuming queries LIKE%% : I would execute a query limited to a few answers in MySQL, like a dozen, return that to user, and wait to see if user wants more matching records, or launch in background the full-query, depending on you indexation needs for future searches.
More generally, I think that storing everything in memory could lead, one day, to too-much memory consumption. And althrough the search-engine becomes faster and faster when it keeps everything in memory, you'll have to keep all these caches up-to-date when data is added or updated and it will certainly take more and more time.
That's why I think the solution I saw a day in an "open-source forum software" (I couldn't remember its name) is not too bad for text searching in posts : each time a data is inserted, a table named "Words" keeps tracks of every existing word, and another table (let's say "WordsLinks") the link between each word and posts it appears in.
This kind of solution has some drawbacks:
Each Insert, Delete, Update in database is a lot slower
Data selection for search engine must be anticipated : if you choose to keep two letter words you never kept, it is too late for already recorded data, unless you launch a complete data re-processing.
You must take care of DELETE as well as UPDATE and INSERT
But I think there are some big advantages:
Computing time is probably the same than the "memory solution" (eventually), but it is divided in each database Create/Update/Delete, rather than at query time.
Looking for a whole word, or words "starting with" is instantaneous : when indexed, searching in "Words" table is dichotomic. And "WordLinks" table query is very fast either with an index.
Looking for multiple words at the same time could be simple : gather a group of "WordLinks" for each found Word, and execute an intersection on them to keep only "Database Ids" common to all these groups. For example with the words "tree" and "leaf", the first one could give Table records {1, 4, 6}, and the second one could give {1, 3, 6, 9}. So with an intersection it is simple to keep only common parts : {1, 6}.
A "Like %%" in a single-column table is probably faster than a lot of "Like %%" in different fields of different tables. And each database engine handles some cache : "Words" table could be little enough to be kept in memory
I think there is a small risk of performance and memory problems if data becomes huge.
As every search is fast, you can even look for synonyms. For example search "network" if user didn't find anything with "ethernet".
You can apply rules, like splitting camel case words to generate for example the 3 words "wood", "X", "woodX" from "woodX". Each "word" is very lightweight to store and find, so you can do a lot of things.
I think the solution you need could be a blend of methods : for example you can keep lightweight UPDATE, INSERT, DELETE, and launch "Words" and "WordsLinks" feeding from a TRIGGER.
Just for anecdote, I saw a software developped by my company in which it was decided to keep "everything" (!) in memory. It leads us to recommend to our customers to buy servers with 64GB RAM. A little bit expensive. It explains why I am very prudent when I see solutions that could lead, eventually, to memory filling.
I have to say, I don't think your design fits the problem very well. The issues that you see now are consequences of that. And apart from that, your current solution doesn't scale.
Here is a possible solution:
Redesign your database to only contain authoritative data, but no derived data. So all cache entries must vanish from MySQL.
Keep data only for the duration of a request in memory within your application. This makes the design of your application much simpler (think race conditions) and enables you to scale to a sensible number of clients.
Introduce a caching layer. I'd strongly recommend to use an established product, rather than building this yourself. This frees you of all the custom built caching logic in your application and even does the job much better.
You can take a look at Redis or Memcached for the caching layer. I think an LRU strategy should fit here. Depending on how complex your queries become, a dedicated indexed search mechanism like Lucene might make sense as well.
I'm sure this can be implemented in MySQL but it would be a lot less effort to just use an existing search-oriented database such as Elasticsearch. It uses Lucene library to implement the inverted index, has extensive documentation, supports horizontal scaling, fairly simple query language and so forth. I guess it has been quite a lot of work to get this far, and it will be even more work to handle caches, race conditions, bugs, performance issues etc. to make the solution "production grade".

Resources