Elasticsearch database sync - elasticsearch

I'm using jdbc river to sync Elasticsearch and database.The known problem is that rows deleted from database remain in ES, jdbc river plugin doesn't solve that. Author of jdbc river suggested the way of solving the problem:
A good method would be windowed indexing. Each timeframe (maybe once per day or >per week) a new index is created for the river, and added to an alias. Old >indices are to be dropped after a while. This maintenance is similar to >logstash indexing, but it is outside the scope of a river.
My question is, what does that mean in precise way?
Lets say I have table in database called table1 with million rows, my try is as follows:
Create river called river1, with index1. index1 contains indexed
rows of table1. Index1 is added to alias.
Some rows from table1 are deleted during the day so every night I create another river called river2, with index2 which
contains only what is now present in table1.
Remove old index1 from alias and add index2 to alias.
Delete old index1.
Is that the right way?

How about using the _ttl field? Define a static _ttl in the SQL-statement to be longer than the SQL-update frequency.
The SQL would be something like this when the river is scheduled to run more frequently than 1 hour:
"select '1h' as _ttl, some_id as _id, ..."
This way the _ttl gets updated when the river runs, but deleted rows will not get updated and will be removed from the ES when the _ttl expires.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html

Yes, it can be done using _ttl field, but I solved it using scripts.
Every night script starts with indexing table and creating an index for that day. Indexing can last for few hours.
Another scripts periodically reads output from localhost:9200/_river/jdbc/*/_state?pretty and checks if all rivers are finished (by checking existance of lastEndDate field). When all rivers are finished, alias is refreshed with newly created index. Old index is dropped.

Related

Does query multiple indices slower than querying single index in Elasticsearch?

I have a retention index which is used to save transaction data. The index pattern in on yearly, which means transaction-2000, transaction-2001, etc. There is a timestamp field inside each document which indicate the time of this document occurred.
I also have an alias transaction which points to all yearly transaction indices. When I query the transaction data in my application, I just use the alias name rather than the yearly index name.
My question is if I query just one year document based on the timestamp field, e.g. 2000, will the query be faster if I only query the single index transaction-2000 rather than the alias transaction? Or whether they are the same speed?
Joey , this is a classic elasticsearch problem. When you have multiple aliases behind an alias , all of them are queried. One way to overcome this is to use routing which comes in extremely handy if you already know which index to go to. At query time if you already know the ts field (2000 or 2000 and 2001 for example) , you can specifically instruct to search only 2 indexes behind the alias using the route
https://www.elastic.co/blog/customizing-your-document-routing
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shard-routing.html
We had a similar issue when we scaled for a large DataSet search (similar design where multiple indexes were behind an alias) , routing came in handy and the queries scaled to our requirements. Hope this helps

How to make Logstash replace old data?

I have an Oracle DB. Logstash retrieves data from Oracle and puts it to ElasticSearch.
But when Logstash makes planned export every 5 minutes, ElasticSearch filled with copies cause old data still exist. This is an obvious situation. Oracle's condition almost not changed during this 5 minutes. Let's say - added 2-3 rows, and 4-5 deleted.
How can we replace old data with new without copies?
For example:
Delete the whole old index;
Create new index with the same name and make the same configuration (nGram configuration and mapping);
Add all new data;
Wait for 5 minutes and repeat.
It's pretty easy: create a new index for each import and apply the mappings, switch your alias afterwards to the most recent index. Remove old indices if needed. Your currenr data will be always searchable while indexing the most recent data.
Here are the sources you'll probalbly need to read:
Use aliases (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to point to the most current data when searching in elasticsearch (BTW it`s always a good idea to have aliases in place).
Use rollover api (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) to create a new index for each import run - note the alias handling here too.
Use index templates (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html) to autmatically apply the mappings/settings for your newly created indices.
Shrink, close and/or delete old indices to keep your cluster handling data you really need. Have a look on the curator (https://github.com/elastic/curator) as standalone tool.
You just need to use the fingerprint/hash of each document , or hash of the uniq fields in each document , as the document id , so that eveytime you can overwirte the same documents with updated one , in place , while adding new documents as well.
But this approach will not work with deleting data from oracle.

ElasticSearch 5, document time to live, create and update

I'm using ElasticSearch 5 and I need my document, older than X days/weeks or a date, to be automatically deleted. I am not sure _ttl is available in 5 but from what I read Elastic do not recommend it any way.
I will update my documents, it is only the one non update for a define period that I need deleting.
Any ideas?
If you need to do that for all docs which are older than a date X, then it's definitely better to create one index per period (let say per day) then after X days, simply drop the index.
It's a way more efficient than doing delete doc operations.
If it's with a given query, docs that are older than X days and match XYZ, then add yourself a timestamp within your doc and run a delete by query call every day.

how do you update or sync with a jdbc river

A question about rivers and data syncing with a production database using elastic search:
Are rivers suited for only bulk loading data initially, or does it somehow listen or monitor for changes.
If I have a nightly import of data, is it just better to delete rivers and indexes, and re-index and recreate the rivers?
If I update or change a river, do I have to delete and re-create the index?
How do I set up a schedule with a river to fetch new data periodically. Can it store last maxid so that it can do diff queries in the sql to select into the river?
Any suggestions on a better way to keep the database and elastic search in sync - without calling individual index update functions with a PUT command?
All of the Elasticsearch rivers are different - some are provided directly by Elasticsearch, many more are developed by third parties:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Each operates differently, so to answer your questions you have to choose a specific river. For your case, since you're looking to index data from a production database, I'll assume that the JDBC river is what you would use:
https://github.com/jprante/elasticsearch-river-jdbc
This river will index data from your JDBC source, including picking up changes. It can do so on a schedule (there is detailed documentation on the schedule parameter on this page: https://github.com/jprante/elasticsearch-river-jdbc). However, this river will not pick up deletes:
https://github.com/jprante/elasticsearch-river-jdbc/issues/213
you may find this discussion useful, concerning getting around the lack of delete support with building a new river/index daily and using index aliases: ElasticSearch river JDBC MySQL not deleting records
You can just map your id in your DB to be _id with alias, this way the elastic will identify when the document was changed or not.

Solr (re)indexing database

I've looked for several other questions related to mine but for now I couldn't find a solution for my issue.
Here is the situation:
A database with table table_x
A cronjob which checks every 2 minutes to index newly added/updated content in table_x using Solr
Extra information about the cronjob and table_x
- The cronjobs checks a field in table_x to determine if row has to be indexed with Solr or not
- table_x contains over 400k records
What we want is Solr to reindex whole table_x. But there are (we think) 2 factors that are not clear for us:
- What will happen when Solr is indexing all 400k records and the cronjob detects more records to be indexed
- What will happen when a search query is performed on the website while Solr is indexing all 400k records?
If there is someone who can answer this to me?
Kind regards,
Pim
The question has two parts
What happens to indexing when you see the changes detected while the initial indexing is going on?
You can make the second cron trigger to wait till the first is completed. This is more of your application question and how you want to handle it.
How does query is affected by new indexing or indexing in progress?
Which version of solr you are using? You can use NearRelaTimeSearch to see the changes before har commits.

Resources