How do I achieve Change Data Capture in CnosDB - clickhouse

I am migrating some of the IoT data from MySQL to CnosDB and then replicate some of the anomalies and also the changes/delta of the change to ClickHouse for further analysis.
Currently I have to do full replication (similar to full backup). Is there an easy way to just replicate the change data from CnosDB to ClickHouse. FYI. I am using the Kafka in the middle to stream the data.

Related

Transferring data from elasticsearch to influx

I am trying to send data from Elasticsearch to Influxdb. Is there any other way to do this except writing plugins and configuration files. I am new to both these databases, and trying to understand the overall picture.
Also, am I right in understanding that Kapacitor processes influx data and then sends it to Kafka for streaming? Or should I stream data using Kapacitor only?
I am trying to learn all these new technologies in as short time frame, and all the new terminologies have got me confused. Thanks for your time and help.
ElasticSearch is a search engine, not a database. InfluxDB is a time series data. I did't understand why you need to transfer data from search results to a time series database.
Kapacitor can process data in 2 different ways. Either in batch mode or in streaming mode. Assume some application streaming sensor data(or some time series data) to InfluxDb. You can set Kapacitor to process that data as soon it is available in InfluxDB by setting kapacitor in streaming mode. Or in case you need to process data from InfluxDB every 2 hour, you can configure that job as batch job. Once you process the data, Kapacitor can persist the data in InfluxDB. Or in case you need to stream the data to Kafka, Kpacitor has a Kafka plugin. Please note Kapacitor has more plugins than I mentioned in my answer.

How to implement search functionalities?

Currently we are making a memory cache mechanism implementation in search functionalities.
Now the data is getting very large and we are not able to handle it in memory. Also we are getting more input source from different systems (oracle, flat file, and git).
Can you please share me how can we achieve this process?
We thought ES will help on this. But how can we provide input if any changes happen in end source? (batch processing will NOT help)
Hadoop - Not that level of data we are NOT handling
Also share your thoughts.
we are getting more input source from different systems (oracle, flat file, and git)
I assume that's why you tagged Kafka? It'll work, but you bring up a valid point
But how can we provide input if any changes happen...?
For plain text, or Git events, you'll obviously need to alter some parser engine and restart the job to get extra data in the message schema.
For Oracle, the GoldenGate product will publish table column changes, and Kafka Connect can recognize those events and update the payload accordingly.
If all you care about is searching things, plenty of tools exist, but you mention Elasticsearch, so using Filebeat works for plaintext, and Logstash can work with various other types of input sources. If you have Kafka, then feed events to Kafka, let Logstash or Kafka Connect update ES

Oracle replication data using Apache kafka

I would like to expose the data table from my oracle database and expose into apache kafka. is it technicaly possible?
As well i need to stream data change from my oracle table and notify it to Kafka.
do you know good documentation of this use case?
thanks
You need Kafka Connect JDBC source connector to load data from your Oracle database. There is an open source bundled connector from Confluent. It has been packaged and tested with the rest of the Confluent Platform, including the schema registry. Using this connector is as easy as writing a simple connector configuration and starting a standalone Kafka Connect process or making a REST request to a Kafka Connect cluster. Documentation for this connector can be found here
To move change data in real-time from Oracle transactional databases to Kafka you need to first use a Change Data Capture (CDC) proprietary tool which requires purchasing a commercial license such as Oracle’s Golden Gate, Attunity Replicate, Dbvisit Replicate or Striim. Then, you can leverage the Kafka Connect connectors that they all provide. They are all listed here
Debezium, an open source CDC tool from Redhat, is planning to work on a connector that is not relying on Oracle Golden Gate license. The related JIRA is here.
You can use Kafka Connect for data import/export to Kafka. Using Kafka Connect is quite simple, because there is no need to write code. You just need to configure your connector.
You would only need to write code, if no connector is available and you want to provide your own connector. There are already 50+ connectors available.
There is a connector ("Golden Gate") for Oracle from Confluent Inc: https://www.confluent.io/product/connectors/
At the surface this is technically feasible. However, understand that the question has implications on downstream applications.
So to comprehensively address the original question regarding technical feasibility, bear in mind the following:
Are ordering/commit semantics important? Particularly across tables.
Are continued table changes across instance crashes (Kafka/CDC components) important?
When the table definition changes - do you expect the application to continue working, or will resort to planned change control?
Will you want to move partial subsets of data?
What datatypes need to be supported? e.g. Nested table support etc.
Will you need to handle compressed logical changes - e.g. on update/delete operations? How will you address this on the consumer side?
You can consider also using OpenLogReplicator. This is a new open source tool which reads Oracle database redo logs and sends messages to Kafka. Since it is written in C++ it has a very low latency like around 10ms and yet a relatively high throughput ratio.
It is in an early stage of development but there is already a working version. You can try to make a POC and check yourself how it works.

IIS Logs Straming to Hadoop real time

I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.

Oracle to Hadoop data ingestion in real-time

I have a requirement to ingest the data from an Oracle database to Hadoop in real-time.
What's the best way to achieve this on Hadoop?
The important problem here is getting the data out of the Oracle DB in real time. This is usually called Change Data Capture, or CDC. The complete solution depends on how you do this part.
Other things that matter for this answer are:
What is the target for the data and what are you going to do with it?
just store plain HDFS files and access for adhoc queries with something like Impala?
store in HBase for use in other apps?
use in a CEP solution like Storm?
...
What tools is your team familiar with
Do you prefer the DIY approach, gluing together existing open-source tools and writing code for the missing parts?
or do you prefer a Data integration tool like Informatica?
Coming back to CDC, there are three different approaches to it:
Easy: if you don't need true real-time and have a way to identify new data with an SQL query that executes fast enough for the required data latency. Then you can run this query over and over and ingest its results (the exact method depends on the target, the size of each chunk, and the preferred tools)
Complicated: Roll your own CDC solution: download the database logs, parse them into series of inserts/updates/deletes, ingest these to Hadoop.
Expensive: buy a CDC solution, that does this for you (like GoldenGate or Attunity)
Expanding a bit on what #Nickolay mentioned, there are a few options, but the best would be too opinion based to state.
Tungsten (open source)
Tungsten Replicator is an open source replication engine supporting a variety of different extractor and applier modules. Data can be extracted from MySQL, Oracle and Amazon RDS, and applied to transactional stores, including MySQL, Oracle, and Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse stores such as Vertica, Hadoop, and Amazon rDS.
Oracle GoldenGate
Oracle GoldenGate is a comprehensive software package for real-time data integration and replication in heterogeneous IT environments. The product set enables high availability solutions, real-time data integration, transactional change data capture, data replication, transformations, and verification between operational and analytical enterprise systems. It provides a handler for HDFS.
Dell Shareplex
SharePlex™ Connector for Hadoop® loads and continuously replicates changes from an Oracle® database to a Hadoop® cluster. This gives you all the benefits of maintaining a real-time or near real-time copy of source tables
Apache Sqoop is a data transfer tool to transfer bulk data from any RDBMS with JDBC connectivity(supports Oracle also) to hadoop HDFS.

Resources