Solr (re)indexing database - performance

I've looked for several other questions related to mine but for now I couldn't find a solution for my issue.
Here is the situation:
A database with table table_x
A cronjob which checks every 2 minutes to index newly added/updated content in table_x using Solr
Extra information about the cronjob and table_x
- The cronjobs checks a field in table_x to determine if row has to be indexed with Solr or not
- table_x contains over 400k records
What we want is Solr to reindex whole table_x. But there are (we think) 2 factors that are not clear for us:
- What will happen when Solr is indexing all 400k records and the cronjob detects more records to be indexed
- What will happen when a search query is performed on the website while Solr is indexing all 400k records?
If there is someone who can answer this to me?
Kind regards,
Pim

The question has two parts
What happens to indexing when you see the changes detected while the initial indexing is going on?
You can make the second cron trigger to wait till the first is completed. This is more of your application question and how you want to handle it.
How does query is affected by new indexing or indexing in progress?
Which version of solr you are using? You can use NearRelaTimeSearch to see the changes before har commits.

Related

Table To Elastic Search - Create/Update

I have a table with size of 50GB. I want to move the table to elastic search for faster search. Currently it takes more minutes even to run a count(1) query.
Is there is open source project which does this, so that we can use it for faster searching of results? Also whenever there is a considerable updates in the table need to push the changes to elastic search
Please advise.
Thanks

How to debug document not available for search in Elasticsearch

I am trying to search and fetch the documents from Elasticsearch but in some cases, I am not getting the updated documents. By updated I mean, we update the documents periodically in Elasticsearch. The documents in ElasticSearch are updated at an interval of 30 seconds, and the number of documents could range from 10-100 Thousand. I am aware that the update is generally a slow process in Elasticsearch.
I am suspecting it is happening because Elasticsearch though accepted the documents but the documents were not available for searching. Hence I have the following questions:
Is there a way to measure the time between indexing and the documents being available for search? There is setting in Elasticsearch which can log more information in Elasticsearch logs?
Is there a setting in Elasticsearch which enables logging whenever the merge operation happens?
Any other suggestion to help in optimizing the performance?
Thanks in advance for your help.
By default the refresh_interval parameter is set to 1 second, so unless you changed this parameter each update will be searchable after maximum 1 second.
If you want to make the results searchable as soon as you have performed the update operation you can use the refresh parameter.
Using refresh=wait_for the endpoint will respond once a refresh has occured. If you use refresh=true a refresh operation will be triggered. Be careful using refresh=true if you have many update since it can impact performances.

Elasticsearch database sync

I'm using jdbc river to sync Elasticsearch and database.The known problem is that rows deleted from database remain in ES, jdbc river plugin doesn't solve that. Author of jdbc river suggested the way of solving the problem:
A good method would be windowed indexing. Each timeframe (maybe once per day or >per week) a new index is created for the river, and added to an alias. Old >indices are to be dropped after a while. This maintenance is similar to >logstash indexing, but it is outside the scope of a river.
My question is, what does that mean in precise way?
Lets say I have table in database called table1 with million rows, my try is as follows:
Create river called river1, with index1. index1 contains indexed
rows of table1. Index1 is added to alias.
Some rows from table1 are deleted during the day so every night I create another river called river2, with index2 which
contains only what is now present in table1.
Remove old index1 from alias and add index2 to alias.
Delete old index1.
Is that the right way?
How about using the _ttl field? Define a static _ttl in the SQL-statement to be longer than the SQL-update frequency.
The SQL would be something like this when the river is scheduled to run more frequently than 1 hour:
"select '1h' as _ttl, some_id as _id, ..."
This way the _ttl gets updated when the river runs, but deleted rows will not get updated and will be removed from the ES when the _ttl expires.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Yes, it can be done using _ttl field, but I solved it using scripts.
Every night script starts with indexing table and creating an index for that day. Indexing can last for few hours.
Another scripts periodically reads output from localhost:9200/_river/jdbc/*/_state?pretty and checks if all rivers are finished (by checking existance of lastEndDate field). When all rivers are finished, alias is refreshed with newly created index. Old index is dropped.

how do you update or sync with a jdbc river

A question about rivers and data syncing with a production database using elastic search:
Are rivers suited for only bulk loading data initially, or does it somehow listen or monitor for changes.
If I have a nightly import of data, is it just better to delete rivers and indexes, and re-index and recreate the rivers?
If I update or change a river, do I have to delete and re-create the index?
How do I set up a schedule with a river to fetch new data periodically. Can it store last maxid so that it can do diff queries in the sql to select into the river?
Any suggestions on a better way to keep the database and elastic search in sync - without calling individual index update functions with a PUT command?
All of the Elasticsearch rivers are different - some are provided directly by Elasticsearch, many more are developed by third parties:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Each operates differently, so to answer your questions you have to choose a specific river. For your case, since you're looking to index data from a production database, I'll assume that the JDBC river is what you would use:
https://github.com/jprante/elasticsearch-river-jdbc
This river will index data from your JDBC source, including picking up changes. It can do so on a schedule (there is detailed documentation on the schedule parameter on this page: https://github.com/jprante/elasticsearch-river-jdbc). However, this river will not pick up deletes:
https://github.com/jprante/elasticsearch-river-jdbc/issues/213
you may find this discussion useful, concerning getting around the lack of delete support with building a new river/index daily and using index aliases: ElasticSearch river JDBC MySQL not deleting records
You can just map your id in your DB to be _id with alias, this way the elastic will identify when the document was changed or not.

When a river updates data in elastic search, is missing data deleted?

I am using the JDBC river to populate docs in elastic search from Sql Server. I am fetching data using a simple Sql query and have set the polling to 20 mins. Now suppose the river fetches 100 docs the first time it polls sql server. And after 20 mins, when it fetches the data again it gets 120 docs where there are 40 new records and 20 records which were deleted in sql server are not there.
Will the records which were deleted from the Sql server also be deleted from the index at ElasticSearch? (This doesnt seem to be happening)
After observing the behavior overnight I find that the index has the correct records and deleted records are not present anymore. Strangely, this did not happen when I restarted the elastics search service. Anyway, does answer my question.

Resources