I've just started working on a hadoop use case of analyzing CDRs in near-real time.
CDRs are encoded in ASN1.1. A remote server is feeded regularly by CDRs. I'm wondering about how to ingest CDRs from this server into my cluster, and decode them to generate CSV files that can be processed by Hive (or Spark Streaming ..).
Is Flume Adapted to ingest this kind of data ? When do you think I should decode ASN1.1, before or after ingesting ? I have a program written in C for decoding ASN1.1.
If Flume is adapted to ingest data, should I implement an Avro client in the server containing initial data, or is there another well-suited method ?
Related
We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.
We are currently on a Big Data project.
The Big Data platform Hadoop Cloudera.
Input of our system we have a small flow of data, we collect via Kafka (approximately 80Mo/h continuously).
Then the messages are stored in HDFS to be queried via Impala.
Our client does not want to separate the hot data with the cold data. After 5 mins, the data must be accessible in the history data (cold data). We chose to have a single database.
To insert the data, we use the JDBC connector provided by Impala API (eg INSERT INTO ...).
we are aware that this is not the recommended solution, each Impala insertion creates a file (<10kb) in HDFS.
We seek a solution to insert a small stream in a Imapala base which avoids getting many small files.
What solution we preconize?
I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).
I am trying to build up a big data platform to receive and store in Hadoop large amount of heterogeneous data like (documents,videos,images,sensors data, etc) then implement classification process.
So what architecture can help me as I’m currently using
VMware VSphere EXSi
Hadoop
Habse
Thrift
XAMPP
All these working fine but I don’t know how to receive a large amount of data and how to store the data because I discovered that Hbase is a column-oriented data base and it’s not data warehouse.
You have to customize solution for type of Big Data ( Structured, Semi-Structured and Un-Structured)
You can use HIVE/HBASE for structured data if total data size <= 10 TB
You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc.
You can use FLUME for processing Un-structured data.
You can use Content Management System to process Un-structured data & Semi-Structured data - Tera Or Peta bytes of data. If you are storing un-structured data, I prefer to store the data in CMS and use meta data information in NoSQL database like HBASE
To process Big data streaming, you can use PIG.
Have a look at Structured Data and Un-Structured data handling in Hadoop
We are receiving huge amounts of XML data via API. In-order to handle this large data set, we were planning to do it in Hadoop.
Needed your help in understanding how to efficiently bring the data to Hadoop. What are the tools available ? Is there a possibility of bringing this data real-time ?
Please provide your inputs.
Thanks for your help.
Since you are receiving huge amounts of data, the appropriate way, IMHO, would be to use some aggregation tool like Flume. Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of data into your Hadoop cluster from different types of sources.
You can easily write custom sources based on your needs to collect the data. You might fins this link helpful to get started. It presents a custom Flume source designed to connect to the Twitter Streaming API and ingest tweets in a raw JSON format into HDFS. You could try something similar for your xml data.
You might also wanna have a look at Apache Chukwa which does the same thing.
HTH
Flume, Scribe & Chukwa are the tools that can accomplish the above task. However Flume is most popularly used tool of all the three. Flume has strong Reliability and Failover techniques available. As well Flume has commercial support available from Cloudera while the other two does not have.
If your only objective is for the data to land in HDFS, you can keep writing the XML responses to disk following some convention such as data-2013-08-05-01.xml and write a daily (or hourly cron) to import the XML data in HDFS. Running Flume will be overkill if you don't need streaming capabilities. From your question, it is not immediately obvious why you need Hadoop? Do you need to run MR jobs?
You want to put the data into Avro or your choice of protocol buffer for processing. Once you have a buffer to match the format of the text the hadoop ecosystem is of much better help in processing the structured data.
Hadoop originally was found most useful for taking one line entries of log files and structuring / processing the data from their. XML is already structured and requires more processing power to get it into a hadoop friendly format.
A more basic solution would be to chunking the xml data and process using Wukong (Ruby streaming) or a python alternative. Since your network bound by the 3rd party api a streaming solution might be more flexible and just as fast in the end for your needs.