in our project we load data from one database(oracle) to another database(oracle) and run some batch level analytics to it.
as of now it is done via pl/sql jobs where we are pulling 3 years of data into destination db..
i have got a task to automate the flow using APache nifi..
cluster info:
1. APache hadoop cluster of 5 nodes
2. all the softwares are open source being used.
i have tried creating a flow where i am using a processor queryDatabaseTable -> putDatabaseRecord. but as far as i know that queryDatabaseTable outputs avro format..
i request to suggest me how to convert and what should be the processors sequence also i need to handle incremental loads/Change data capture. kindly suggest.
thanks in advance :)
PutDatabaseRecord configured with an Avro reader will be able to read the Avro produced by QueryDatabaseTable.
Related
I'm referring this below article;
https://community.cloudera.com/t5/Community-Articles/Create-Dynamic-Partitions-based-on-FlowFile-Con...
I'm trying to create pipeline in nifi while data coming realtime streaming based say some example kafka, while data put in hdfs in partitioned location, it may ended be with many small files at the same while querying im facing performance lag issue; can you please give some apporaches to resolve small files issue in nifi itself with orc file format;
I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).
I would like to know the best way to automate this process ( using java or hadoops tools )
I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc
There's a list of Kafka Connect connectors here.
This blog series walks through the basics:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Little to no coding required? In no particular order
Talend Open Studio
Streamsets Data Collector
Apache Nifi
Assuming you can setup a Kafka cluster, you can try Kafka Connect
If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie
If you don't need the raw HDFS data, you can load directly into HBase
I am learner of Apache nifi and currently expolering on "import mysql data to hdfs using apache nifi"
Please guide me on creating flow by providing an doc, end to end flow.
i have serached my sites, its not available.
To import MySQL data, you would create a DBCPConnectionPool controller service, pointing at your MySQL instance, driver, etc. Then you can use any of the following processors to get data from your database (please see the documentation for usage of each):
ExecuteSQL
QueryDatabaseTable
GenerateTableFetch
Once the data is fetched from the database, it is usually in Avro format. If you want it in another format, you will need to use some conversion processor(s) such as ConvertAvroToJSON. When the content of the flow file(s) is the way you want it, you can use PutHDFS to place the files into HDFS.
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I have a requirement to read huge CSV file from Kafka topic to Cassandra. I configured Apache Nifi to achieve the same.
Flow:
User does not have a control on Nifi setup. He only specifies the URL where the CSV is located. The web application writes the URL into kafka topic. Nifi fetches the file and inserts into Cassandra.
How will I know that Nifi has inserted all the rows from the CSV file into Cassandra? I need to let the user know that inserting is done.
Any help would be appreciated.
I found the solution.
Using MergeContent processor, all FlowFiles with the same value for "fragment.identifier" will be grouped together. Once MergeContent has defragmented them, we can notify the user.