I want to ingest data using NIFI to two directions one in HDFS and one in Oracle Database. Is it Possible? - apache-nifi

We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.

NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.

Related

Import small stream in Impala

We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?

how to efficiently move data from Kafka to an Impala table?

Here are the steps to the current process:
Flafka writes logs to a 'landing zone' on HDFS.
A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.
About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?
I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.
You can either dump the data to a parket file on HDFS you can load in Impala.
You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).
Addign something like this to your Kafka Connect configuration might do the trick:
# Don't flush less than 1000 messages to HDFS
flush.size = 1000
# Dump to parquet files
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class = TimebasedPartitioner
# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
Answering that question in year 2022, I would say that solution would be streaming messages from Kafka to Kudu and integrate Impala with Kudu, as it has already tight integration.
Here is example of Impala schema for Kudu:
CREATE EXTERNAL TABLE my_table
STORED AS KUDU
TBLPROPERTIES (
'kudu.table_name' = 'my_kudu_table'
);
Apache Kudu supports SQL inserts and it uses own file format under the hood. Alternatively you could use Apache Phoenix which supports inserts and upserts (if you need exactly once semantic) and uses HBase under the hood.
As long as the Impala is your final way of accessing the data, you shouldn't care about underlaying formats.

how to load data from hadoop to solr using sqoop?

I want to copy indexes created via MR jobs which are now residing in HDFS into solr. Is it possible using sqoop?
If yes, what is the jdbc connector or driver to use? If not sqoop, is there any other way to do this?
You may want to consider using flume. https://flume.apache.org/FlumeUserGuide.html#flume-1-5-2-user-guide
MorphlineSolrSink: This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink).
For more: https://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

Data moving from RDBMS to Hadoop, using SQOOP and FLUME

I am in the process of learning Hadoop and stuck with few concepts on moving data from Relational database to Hadoop and vice versa.
I have transferred files from MySQL to HDFS using SQOOP import queries. The files I transferred were structured datasets and not any server log data. I recently read that we usually use flume for moving log files into Hadoop,
My question is:
1. Can we use SQOOP as well for moving log files?
2. If yes, which of SQOOP or FLUME is more preferred for log files and why?
1) Sqoop can be used to transfer data between any rdbms and hdfs. To use scoop the data has to be structured usually specified by schema of database from where data is being imported or exported.Log files are not always structured,depending on source and type of log so sqoop is not used for moving log files.
2)Flume can collect, aggregate data from many different kinds of customizable data sources. It gives more flexibility in controlling what specific events to capture and use in user defined work flow before storing in say hdfs.
I hope it clarified difference between sqoop and flume.
SQOOP is designed to transfer data from RDMS to HDFS whereas FLUME is for moving large amounts of log data.
Both are different and specialized for different purposes.
Like
You can use SQOOP to import data via JDBC ( which you can not do in FLUME ),
and
You can use FLUME to say something like "I want to tail 200 lines of log file from this server".
Read more about FLUME here
http://flume.apache.org/
SQOOP not only transfers data from RDBMS but also from NOSql databases like MongoDB. You can directly transfer data to HDFS or Hive.
Transferring data to Hive you need not have to create table beforehand.. It takes the scheme from database itself.
Flume is used to fetch log data or streaming data

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

Resources