Index data from PostgreSQL to Elasticsearch - elasticsearch

I want to setup an elasticsearch cluster using multicast feature.One node is a external elasticsearch node and the other node is a node client (client property set as true-not hold data).
This node client is created using spring data elasticsearch. So I want to index data from postgresql database to external elasticsearch node.I had indexed data by using jdbc river plugin.
But I want to know is there any application that I can use for index data from postgresql instead of using the river plugin?

It is possible to do this in realtime, although it requires writing a dedicated Postgres->ES gateway and using some Postgres-specific features. I've written about it here: http://haltcondition.net/2014/04/realtime-postgres-elasticsearch/
The principle is actually pretty simple, complexity of the method I have come up with is due to handling corner cases such as multiple gateways running and gateways becoming unavailable for a while. In short my solution is:
Attach a trigger to all tables of interest that copies the updated row IDs to a temporary table.
The trigger also emits an async notification that a row has been updated.
A separate gateway (mine is written in Clojure) attaches to the Postgres server and listens for notifications. This is the tricky part, as not all Postgres client drivers support async notifications (there is a new experimental JDBC driver that does, which is what I use).
On update the gateway reads, transforms and pushes the data to Elasticsearch.
In my experiments this model is capable of sub-second updates to Elasticsearch after a Postgres row insert/update. Obviously this will vary in the real world though.
There is a proof-of-concept project with Vagrant and Docker test frameworks here: https://bitbucket.org/tarkasteve/postgres-elasticsearch-realtime

Related

How to implement search functionalities?

Currently we are making a memory cache mechanism implementation in search functionalities.
Now the data is getting very large and we are not able to handle it in memory. Also we are getting more input source from different systems (oracle, flat file, and git).
Can you please share me how can we achieve this process?
We thought ES will help on this. But how can we provide input if any changes happen in end source? (batch processing will NOT help)
Hadoop - Not that level of data we are NOT handling
Also share your thoughts.
we are getting more input source from different systems (oracle, flat file, and git)
I assume that's why you tagged Kafka? It'll work, but you bring up a valid point
But how can we provide input if any changes happen...?
For plain text, or Git events, you'll obviously need to alter some parser engine and restart the job to get extra data in the message schema.
For Oracle, the GoldenGate product will publish table column changes, and Kafka Connect can recognize those events and update the payload accordingly.
If all you care about is searching things, plenty of tools exist, but you mention Elasticsearch, so using Filebeat works for plaintext, and Logstash can work with various other types of input sources. If you have Kafka, then feed events to Kafka, let Logstash or Kafka Connect update ES

Can Kafka be used as a messaging service between oracle and elasticsearch

Can Kafka be used as a messaging service between oracle and elastic search ? any downsides of this approach?
Kafka Connect provides you a JDBC Source and an Elasticsearch Sink.
No downsides that I am aware of, other than service maintenance.
Feel free to use Logstash instead, but Kafka provides better resiliency and scalability.
I have tried this in the past with Sql server instead of Oracle and it works great, and I am sure you could try the same approach with Oracle as well since I know the logstash JDBC plugin that I am going to describe below has support for Oracle DB.
So basically you would need a Logstash JDBC input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html that points to your Oracle DB instance and pushes the rows over to Kafka using the Kafka Output plugin https://www.elastic.co/guide/en/logstash/current/plugins-outputs-kafka.html.
Now to read the contents from Kafka you would need, another Logstash instance(this is the indexer) and use the Kafka input plugin https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kafka.html. And finally use the Elastic search output plugin in the Logstash indexer configuration file to push the events to Elastic Search.
So the pipeline would look like this,
Oracle -> Logstash Shipper -> Kafka -> Logstash Indexer -> Elastic search.
So overall I think this is a pretty scalable way to push events from your DB to Elastic search. Now, if you look at downsides, at times you can feel that there are one too many components in your pipeline and can be frustrating especially when you have failures. So you need to put in appropriate controls and monitoring at every level to make sure you have a functioning data aggregation pipeline that is described above. Give it a try and good luck!

Oracle replication data using Apache kafka

I would like to expose the data table from my oracle database and expose into apache kafka. is it technicaly possible?
As well i need to stream data change from my oracle table and notify it to Kafka.
do you know good documentation of this use case?
thanks
You need Kafka Connect JDBC source connector to load data from your Oracle database. There is an open source bundled connector from Confluent. It has been packaged and tested with the rest of the Confluent Platform, including the schema registry. Using this connector is as easy as writing a simple connector configuration and starting a standalone Kafka Connect process or making a REST request to a Kafka Connect cluster. Documentation for this connector can be found here
To move change data in real-time from Oracle transactional databases to Kafka you need to first use a Change Data Capture (CDC) proprietary tool which requires purchasing a commercial license such as Oracle’s Golden Gate, Attunity Replicate, Dbvisit Replicate or Striim. Then, you can leverage the Kafka Connect connectors that they all provide. They are all listed here
Debezium, an open source CDC tool from Redhat, is planning to work on a connector that is not relying on Oracle Golden Gate license. The related JIRA is here.
You can use Kafka Connect for data import/export to Kafka. Using Kafka Connect is quite simple, because there is no need to write code. You just need to configure your connector.
You would only need to write code, if no connector is available and you want to provide your own connector. There are already 50+ connectors available.
There is a connector ("Golden Gate") for Oracle from Confluent Inc: https://www.confluent.io/product/connectors/
At the surface this is technically feasible. However, understand that the question has implications on downstream applications.
So to comprehensively address the original question regarding technical feasibility, bear in mind the following:
Are ordering/commit semantics important? Particularly across tables.
Are continued table changes across instance crashes (Kafka/CDC components) important?
When the table definition changes - do you expect the application to continue working, or will resort to planned change control?
Will you want to move partial subsets of data?
What datatypes need to be supported? e.g. Nested table support etc.
Will you need to handle compressed logical changes - e.g. on update/delete operations? How will you address this on the consumer side?
You can consider also using OpenLogReplicator. This is a new open source tool which reads Oracle database redo logs and sends messages to Kafka. Since it is written in C++ it has a very low latency like around 10ms and yet a relatively high throughput ratio.
It is in an early stage of development but there is already a working version. You can try to make a POC and check yourself how it works.

why solr jdbc only support solrCloud? the single node can get this feature?

I want to use the solr jdbc in my project base on single node. But it noly support solrCloud? so the single node can get this feature?
The reason for this is that the SQL feature of Solr is a Mapreduce task built on top of parallell processing on the nodes. This requires a collection to keep track of SQL plans and available workers. Since this feature uses the collection API (and doesn't have an alternative implementation), and the JDBC driver connects to ZooKeeper (and not to Solr directly) to get information about your Solr cluster, Cloud mode is required for JDBC and SQL support.
You could run in SolrCloud-mode with a single node, and you can also run multiple instances on a single server (like having a cluster with three nodes, but only a single server).
I don't think having this support work with a non-cloud setup would be a very high priority task, as the SQL feature is rather new and experimental and cloud mode is becoming more of a norm (and "cloud mode" and "old mode" will probably be merged to a single mode later).

Manually run a scheduled JDBC River instance with Elasticsearch

JDBC river instance with an index scheduled to run at a specic time.
I expected that it would run on creation but this does not seem to be the case.
Is it possible to use the API to manually notify the instance that it should run the index process now?
elasticsearch-river-jdbc
The rivers API for Elastic is being deprecated, so I would highly recommend you move to a push model instead of pulling data in via the JDBC river.
We had the same issues with the JDBC river before moving the code to an external process. The JDBC river wouldn't consistently start when we restarted ES, we couldn't manually kick it off and it was just a pain to maintain.
We ended up writing small scripts to push data in and run them as local cron jobs. It's been much more reliable and we can run them at any time and debug them easily.
(As a note if you have a lot of data, you'll need to use the batch API for ES to not overwhelm ES with too many writes.)

Resources