how do you update or sync with a jdbc river - elasticsearch

A question about rivers and data syncing with a production database using elastic search:
Are rivers suited for only bulk loading data initially, or does it somehow listen or monitor for changes.
If I have a nightly import of data, is it just better to delete rivers and indexes, and re-index and recreate the rivers?
If I update or change a river, do I have to delete and re-create the index?
How do I set up a schedule with a river to fetch new data periodically. Can it store last maxid so that it can do diff queries in the sql to select into the river?
Any suggestions on a better way to keep the database and elastic search in sync - without calling individual index update functions with a PUT command?

All of the Elasticsearch rivers are different - some are provided directly by Elasticsearch, many more are developed by third parties:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Each operates differently, so to answer your questions you have to choose a specific river. For your case, since you're looking to index data from a production database, I'll assume that the JDBC river is what you would use:
https://github.com/jprante/elasticsearch-river-jdbc
This river will index data from your JDBC source, including picking up changes. It can do so on a schedule (there is detailed documentation on the schedule parameter on this page: https://github.com/jprante/elasticsearch-river-jdbc). However, this river will not pick up deletes:
https://github.com/jprante/elasticsearch-river-jdbc/issues/213
you may find this discussion useful, concerning getting around the lack of delete support with building a new river/index daily and using index aliases: ElasticSearch river JDBC MySQL not deleting records

You can just map your id in your DB to be _id with alias, this way the elastic will identify when the document was changed or not.

Related

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

Setting up a daily partitioned index

I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Tool to compare elasticsearch index into data base records to ascertain inconsistent state

I Want to know whether any tool available for comparing database entries into elastcisearch index to find the mismatch.
Thanks in advance.
There is a way to do this with the Scrutineer tool, which provides support for comparing data stored in Elasticsearch against a source of truth, usually relational database.
After running this tool, you'll get a report of:
records in the source of truth and not in ES (missed create)
records in ES and not in the source of truth (missed delete)
records in ES and the source of truth which are out of synch (missed update)
Basically, this will give you an exact overview of the de-/synchronization state of the two data stores you're comparing (ES + DB).
UPDATE 1:
Here is another interesting blog article on the subject: Elasticsearch: Verifying Data Integrity with External Data Stores
UPDATE 2:
Here is yet another interesting blog article on the subject: How to keep Elasticsearch synchronized with a relational database using Logstash
I believe not, this has the potential to be a very taxing operation. However if you have used a unique PK from your database as the _id for the documents in elasticsearch then you could use the following command whilst iterating through records -
curl -XHEAD -i 'http://localhost:9200/twitter/tweet/1'
This will return an appropriate response as to whether the document exists or not. Storing all _id's which don't exist and placing these into ElasticSearch, within your own bespoke script or application.
If this is not the case the complexity of the problem significantly rise as do the implications to the cluster.

Couchbase XDCR Elasticsearch speed and deletions

We are thinking about implementing some sort of message cache which would hold onto the messages we send to our search index so we could persist while the index was down for an extended period of time (for example a complete re-index) then 're-apply' the messages. These messages are creations or updates of the documents we index. If space were cheap enough, with something as scalable as Couchbase we may even be able to hold all messages but I haven't done any sort of estimations of message size and quantity yet. Anyway, I suggested Couchbase + XDCR + Elasticsearch for this task as most of the work would be done automatically however there are 4 questions I have remaining:
If we were implementing this as a cache, I would not want Elasticsearch to remove any documents that were not in Couchbase, is this possible to do (perhaps it is even the default behaviour)?
Is it possible to apply some sort of versioning so that a document in the index is not over-written by an older version coming from Couchbase?
If I were to add a new field to the index, I might need to re-index from the actual document datasource then re-apply all the messages stored in Couchbase. I may have 100 million documents in Elasticsearch and say 500,000 documents in Couchbase that I want to re-apply to Elasticsearch? What would the speed be like.
Would I be able to apply any sort of logic in-between Couchbase and Elasticsearch?
Update:
So we store documents in an RDBMS as we need instant access to inserted docs plus some other stuff. We send limited versions of the document to a search engine via messages. If we want to add a field to the index we need to re-index the system from the RDBMS somehow. If we have this Couchbase message cache we could add the field to messages first, then switch off the indexing of old messages and re-index from the RDBMS. We could then switch back on the indexing of the messages and the entire 'queue' of messages would be indexed without having lost anything.
This system (if it worked) would remove the need for an MQ server, a message listener and make sure no documents were missing from the index.
The versioning would be necessary as we don't want to apply an 'update' to the index which actually contains a more recent document (not sure if this would ever happen now I think about it).
I appreciate it's probably not too great a job to implement points 1 and 4 by changing the Elasticsearch plugin code but I would like to confirm that the idea is reasonable first!
The Couchbase-Elasticsearch integration today should be seen as an indexing engine for Couchbase. This means the index is "managed/controlled" by the data that are in Couchbase.
The XDCR is used to sent "all the events" to Elasticsearch. This means the index is update/delete every time a document (stored in Couchbase) is created, modified or deleted.
So "all the documents" stored into a Couchbase bucket are indexed into Elasticsearch.
Let's answer your questions one by one, based on the current implementation of the Couchbase-Elasticsearch.
When a document is removed from Couchbase, the Elasticsearch index is update (entry removed).
Not sure to understand the question. How an "older" version could come from Couchbase? Anyway once again everytime the document that is stored into Couchbase is modified, the index in Elasticsearch is updated.
Not sure to understand where you want to add a new field? If this is into a document that is stored into Couchbase, when the document will be sent to Elasticsearch the index will be updated. But based on what I have said before : all document "stored" into Couchbase will be present in Elasticsearch index.
Not with the plugin as it is today, but as you know it is an open source project so you can either add some logic to it or even contribute your ideas to the project ( https://github.com/couchbaselabs/elasticsearch-transport-couchbase )
So let me ask you more questions:
- how do you inser the document into you application? (and where Couchbase? Elasticsearch?)
- what are the types of documents?
- what do you want to cache into Couchbase?

Resources