I am using the JDBC river to populate docs in elastic search from Sql Server. I am fetching data using a simple Sql query and have set the polling to 20 mins. Now suppose the river fetches 100 docs the first time it polls sql server. And after 20 mins, when it fetches the data again it gets 120 docs where there are 40 new records and 20 records which were deleted in sql server are not there.
Will the records which were deleted from the Sql server also be deleted from the index at ElasticSearch? (This doesnt seem to be happening)
After observing the behavior overnight I find that the index has the correct records and deleted records are not present anymore. Strangely, this did not happen when I restarted the elastics search service. Anyway, does answer my question.
Related
i have 400000+ records now stored in MongoDB with a regular indexed but when i fire a update or search query through laravel elenquote it's taking too much time to get the particular records.
in where condition we have use indexed columns only.
we are using atlas M10 cluster instance with multiple replicas
so anyone have a some idea about it please share us
my replication lag graph
this is my profiler data
My Indexs in schema
I'm using jdbc river to sync Elasticsearch and database.The known problem is that rows deleted from database remain in ES, jdbc river plugin doesn't solve that. Author of jdbc river suggested the way of solving the problem:
A good method would be windowed indexing. Each timeframe (maybe once per day or >per week) a new index is created for the river, and added to an alias. Old >indices are to be dropped after a while. This maintenance is similar to >logstash indexing, but it is outside the scope of a river.
My question is, what does that mean in precise way?
Lets say I have table in database called table1 with million rows, my try is as follows:
Create river called river1, with index1. index1 contains indexed
rows of table1. Index1 is added to alias.
Some rows from table1 are deleted during the day so every night I create another river called river2, with index2 which
contains only what is now present in table1.
Remove old index1 from alias and add index2 to alias.
Delete old index1.
Is that the right way?
How about using the _ttl field? Define a static _ttl in the SQL-statement to be longer than the SQL-update frequency.
The SQL would be something like this when the river is scheduled to run more frequently than 1 hour:
"select '1h' as _ttl, some_id as _id, ..."
This way the _ttl gets updated when the river runs, but deleted rows will not get updated and will be removed from the ES when the _ttl expires.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Yes, it can be done using _ttl field, but I solved it using scripts.
Every night script starts with indexing table and creating an index for that day. Indexing can last for few hours.
Another scripts periodically reads output from localhost:9200/_river/jdbc/*/_state?pretty and checks if all rivers are finished (by checking existance of lastEndDate field). When all rivers are finished, alias is refreshed with newly created index. Old index is dropped.
A question about rivers and data syncing with a production database using elastic search:
Are rivers suited for only bulk loading data initially, or does it somehow listen or monitor for changes.
If I have a nightly import of data, is it just better to delete rivers and indexes, and re-index and recreate the rivers?
If I update or change a river, do I have to delete and re-create the index?
How do I set up a schedule with a river to fetch new data periodically. Can it store last maxid so that it can do diff queries in the sql to select into the river?
Any suggestions on a better way to keep the database and elastic search in sync - without calling individual index update functions with a PUT command?
All of the Elasticsearch rivers are different - some are provided directly by Elasticsearch, many more are developed by third parties:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Each operates differently, so to answer your questions you have to choose a specific river. For your case, since you're looking to index data from a production database, I'll assume that the JDBC river is what you would use:
https://github.com/jprante/elasticsearch-river-jdbc
This river will index data from your JDBC source, including picking up changes. It can do so on a schedule (there is detailed documentation on the schedule parameter on this page: https://github.com/jprante/elasticsearch-river-jdbc). However, this river will not pick up deletes:
https://github.com/jprante/elasticsearch-river-jdbc/issues/213
you may find this discussion useful, concerning getting around the lack of delete support with building a new river/index daily and using index aliases: ElasticSearch river JDBC MySQL not deleting records
You can just map your id in your DB to be _id with alias, this way the elastic will identify when the document was changed or not.
I've looked for several other questions related to mine but for now I couldn't find a solution for my issue.
Here is the situation:
A database with table table_x
A cronjob which checks every 2 minutes to index newly added/updated content in table_x using Solr
Extra information about the cronjob and table_x
- The cronjobs checks a field in table_x to determine if row has to be indexed with Solr or not
- table_x contains over 400k records
What we want is Solr to reindex whole table_x. But there are (we think) 2 factors that are not clear for us:
- What will happen when Solr is indexing all 400k records and the cronjob detects more records to be indexed
- What will happen when a search query is performed on the website while Solr is indexing all 400k records?
If there is someone who can answer this to me?
Kind regards,
Pim
The question has two parts
What happens to indexing when you see the changes detected while the initial indexing is going on?
You can make the second cron trigger to wait till the first is completed. This is more of your application question and how you want to handle it.
How does query is affected by new indexing or indexing in progress?
Which version of solr you are using? You can use NearRelaTimeSearch to see the changes before har commits.
At present I am using elasticsearch-1.0.0.Beta2 for quick search the
records.But while dumping I have 20000 recods in user table. But in
elasticsearch server(http://www.domain.com) I can see only fifty
records.But all records are dumping(means when I using this
http://www.domain.com/index/type/id
(http://www.domain.com/salambc/Question/51f9ac9024c8ce5261f4ee55)) by this
time I get every record data from elasticsearch server what ever I have in my
local database. But It will show only 50 records in elasticsearch
browser.
Can you please help me where is the remaianing records are store. Thanks in advance.