I am learner of Apache nifi and currently expolering on "import mysql data to hdfs using apache nifi"
Please guide me on creating flow by providing an doc, end to end flow.
i have serached my sites, its not available.
To import MySQL data, you would create a DBCPConnectionPool controller service, pointing at your MySQL instance, driver, etc. Then you can use any of the following processors to get data from your database (please see the documentation for usage of each):
ExecuteSQL
QueryDatabaseTable
GenerateTableFetch
Once the data is fetched from the database, it is usually in Avro format. If you want it in another format, you will need to use some conversion processor(s) such as ConvertAvroToJSON. When the content of the flow file(s) is the way you want it, you can use PutHDFS to place the files into HDFS.
Related
in our project we load data from one database(oracle) to another database(oracle) and run some batch level analytics to it.
as of now it is done via pl/sql jobs where we are pulling 3 years of data into destination db..
i have got a task to automate the flow using APache nifi..
cluster info:
1. APache hadoop cluster of 5 nodes
2. all the softwares are open source being used.
i have tried creating a flow where i am using a processor queryDatabaseTable -> putDatabaseRecord. but as far as i know that queryDatabaseTable outputs avro format..
i request to suggest me how to convert and what should be the processors sequence also i need to handle incremental loads/Change data capture. kindly suggest.
thanks in advance :)
PutDatabaseRecord configured with an Avro reader will be able to read the Avro produced by QueryDatabaseTable.
I am trying to query GitHub data provided by ghtorrent API using hadoop. how can I inject this much data(4-5 TB) into HDFS? Also, their databases are real time. Is it possible to process real time data in hadoop using tools such as pig, hive, hbase?
Go through this presentation . It has described the way you can connect to their MySql or MongoDb instance and fetch data. Basically you have to share your public key, they will add that key to their repository and then you can ssh. As an alternative you can download their periodic dumps from this link
Imp Link :
query mongodb programatically
connect to mysql instance
For processing real time data, you cannt do that uisng Pig, Hive. Those are Batch processing tools. Consider using Apache Spark.
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I have a requirement to read huge CSV file from Kafka topic to Cassandra. I configured Apache Nifi to achieve the same.
Flow:
User does not have a control on Nifi setup. He only specifies the URL where the CSV is located. The web application writes the URL into kafka topic. Nifi fetches the file and inserts into Cassandra.
How will I know that Nifi has inserted all the rows from the CSV file into Cassandra? I need to let the user know that inserting is done.
Any help would be appreciated.
I found the solution.
Using MergeContent processor, all FlowFiles with the same value for "fragment.identifier" will be grouped together. Once MergeContent has defragmented them, we can notify the user.
Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable