Manually run a scheduled JDBC River instance with Elasticsearch - elasticsearch

JDBC river instance with an index scheduled to run at a specic time.
I expected that it would run on creation but this does not seem to be the case.
Is it possible to use the API to manually notify the instance that it should run the index process now?
elasticsearch-river-jdbc

The rivers API for Elastic is being deprecated, so I would highly recommend you move to a push model instead of pulling data in via the JDBC river.
We had the same issues with the JDBC river before moving the code to an external process. The JDBC river wouldn't consistently start when we restarted ES, we couldn't manually kick it off and it was just a pain to maintain.
We ended up writing small scripts to push data in and run them as local cron jobs. It's been much more reliable and we can run them at any time and debug them easily.
(As a note if you have a lot of data, you'll need to use the batch API for ES to not overwhelm ES with too many writes.)

Related

Archive old data from Elasticsearch to Google Cloud Storage

I have an elasticsearch server installed in Google Compute Instance. A huge amount of data is being ingested every minute and the underline disk fills up pretty quickly.
I understand we can increase the size of the disks but this would cost a lot for storing the long term data.
We need 90 days of data in the Elasticsearch server (Compute engine disk) and data older than 90 days (till 7 years) to be stored in Google Cloud Storage Buckets. The older data should be retrievable in case needed for later analysis.
One way I know is to take snapshots frequently and delete the indices older than 90 days from Elasticsearch server using Curator. This way I can keep the disks free and minimize the storage cost.
Is there any other way this can be done without manually automating the above-mentioned idea?
For example, something provided by Elasticsearch out of the box, that archives the data older than 90 days itself and keeps the data files in the disk, we can then manually move this file form the disk the Google Cloud Storage.
There is no other way around, to make backups of your data you need to use the snapshot/restore API, it is the only safe and reliable option available.
There is a plugin to use google cloud storage as a repository.
If you are using version 7.5+ and Kibana with the basic license, you can configure the Snapshot directly from the Kibana interface, if you are on an older version or do not have Kibana you will need to rely on Curator or a custom script running with a crontab scheduler.
While you can copy the data directory, you would need to stop your entire cluster everytime you want to copy the data, and to restore it you would also need to create a new cluster from scratch every time, this is a lot of work and not practical when you have something like the snapshot/restore API.
Look into Snapshot Lifecycle Management and Index Lifecycle Management. They are available with a Basic license.

Does creating snapshot make cluster slow?

I use ElasticSearch 5.6.
When running snapshot, I run
http://localhost:9200/_cluster/health
but did not get response for more than 10 sec.
I can also see when snapshot runs, machines have a lot of costs at disk/network IO.
Such a delay does not happen if I do not run snapshot.
I check _cluster/health with timeout to ensure that creating snapshot does not slow-down queries.
Is it the correct way to check this?
In practice will creating snapshots make queries slow down?
Yes, there is increased disk activity as indices are read however excerpt from elastic documentation states:
The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be executed against the index that is being snapshotted.
Apart from _cluster/health check taking more than 10 secs do you see any impact to data indexing/ searching etc ?
How frequently are you running the snapshots ? Is it a full cluster snapshot ? Where is the snapshot repository - filesystem / s3 / Azure/ Google cloud ?

Apache NIFI Jon is not terminating automatically

I am new to Apache NIFI tool. I am trying to import data from mongo db and put that data into the HDFS. I have created 2 processors one for MongoDB and second for HDFS and I configured them correctly. The job is running successfully and storing the data into HDFS but the job should terminate automatically on success. But it is not, and creating too many files in HDFS. I want to know how to make On Demand Job in NIFI and how to determine that a job is successfull.
GetMongo will continue to pull data from MongoDB based on the provided properties such as Query, Projection, Limit. It has no way of tracking the execution process, at least for now. What you can do, however, is changing the Run Schedule and/or Scheduling Strategy. You can find them by right clicking on the processor and clicking Configure. By default, Run Schedule will be 0 sec which means running continuously. Changing it to, say, 60 min will make the processor run every one hour. This will still read the same documents from MongoDB again every one hour but since you have mentioned that you just want to run it only once, I'm suggesting this approach.

Index data from PostgreSQL to Elasticsearch

I want to setup an elasticsearch cluster using multicast feature.One node is a external elasticsearch node and the other node is a node client (client property set as true-not hold data).
This node client is created using spring data elasticsearch. So I want to index data from postgresql database to external elasticsearch node.I had indexed data by using jdbc river plugin.
But I want to know is there any application that I can use for index data from postgresql instead of using the river plugin?
It is possible to do this in realtime, although it requires writing a dedicated Postgres->ES gateway and using some Postgres-specific features. I've written about it here: http://haltcondition.net/2014/04/realtime-postgres-elasticsearch/
The principle is actually pretty simple, complexity of the method I have come up with is due to handling corner cases such as multiple gateways running and gateways becoming unavailable for a while. In short my solution is:
Attach a trigger to all tables of interest that copies the updated row IDs to a temporary table.
The trigger also emits an async notification that a row has been updated.
A separate gateway (mine is written in Clojure) attaches to the Postgres server and listens for notifications. This is the tricky part, as not all Postgres client drivers support async notifications (there is a new experimental JDBC driver that does, which is what I use).
On update the gateway reads, transforms and pushes the data to Elasticsearch.
In my experiments this model is capable of sub-second updates to Elasticsearch after a Postgres row insert/update. Obviously this will vary in the real world though.
There is a proof-of-concept project with Vagrant and Docker test frameworks here: https://bitbucket.org/tarkasteve/postgres-elasticsearch-realtime

Rails - searching and configuration of Solr

I use for searching a content in my app Solr. What I don't like is, that everytime, when I restart computer, I have to manually start Solr and then, when is in the app a new content, I have to reindex that, because in other hand Solr wouldn't find the new data.
This is not very comfortable, how looks the work with Solr on the server, eg. on Heroku? Do I have there starting Solr all the time or do I have there reindex data over and over again, as on my localhost I do?
Eventually, exist better solution for searching except Solr?
You are using the included server, right?
You can choose to deploy it in Tomcat. You just have to copy your files to Tomcat and register your Solr application in Tomcat configuration. Tomcat is run as a service. Or, you can use a script to start Jetty on startup.
And a professional Solr service tries to keep your Solr application alive and your data safe against any cause such as a crashed software, failed server or even a datacenter that went down.
Check what Heroku (or other hosted Solr solutions) promises you in their terms. They would do a much better job than an individual (no restarting Solr instances frequently!).
When you add something to Solr, it is persisted to disk. When commited, it is available to search. If a document changes, you reindex it to reflect the new changes.
When you restart Solr, the same persisted data is available. What is your exact trouble?
There is the DIH (Direct Import Handler) if you want to automatically index from a DB.
I'm happy with Solr so far.
As far as starting Solr instance after restarting your computer, you can write a bash script that would do it for you, or declare an alias that would start your Solr and your app server.
As far as re-indexing. New and updated records should be re-indexed automatically, unless manipulate your data from the console.
For the alternate solutions check out Thinking Sphinx

Resources