There are 4mc data on HDFS. When I use Flink
env.readTextFile("hdfs://127.0.0.1:8020/search_logs/4mc")
It can not know the compressed format, so does Flink have the related API to implement to handle this kind of compressed format data?
Thanks.
Related
I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).
I have to build a tool which will process our data storage from HBase(HFiles) to HDFS in parquet format.
Please suggest one of the best way to move data from HBase tables to Parquet tables.
We have to move 400 million records from HBase to Parquet. How to achieve this and what is the fastest way to move data?
Thanks in advance.
Regards,
Pardeep Sharma.
Please have a look in to this project tmalaska/HBase-ToHDFS
which reads a HBase table and writes the out as Text, Seq, Avro, or Parquet
Example usage for parquet :
Exports the data to Parquet
hadoop jar HBaseToHDFS.jar ExportHBaseTableToParquet exportTest c export.parquet false avro.schema
I recently opensourced a patch to HBase which tackles the problem you are describing.
Have a look here: https://github.com/ibm-research-ireland/hbaquet
I've just started working on a hadoop use case of analyzing CDRs in near-real time.
CDRs are encoded in ASN1.1. A remote server is feeded regularly by CDRs. I'm wondering about how to ingest CDRs from this server into my cluster, and decode them to generate CSV files that can be processed by Hive (or Spark Streaming ..).
Is Flume Adapted to ingest this kind of data ? When do you think I should decode ASN1.1, before or after ingesting ? I have a program written in C for decoding ASN1.1.
If Flume is adapted to ingest data, should I implement an Avro client in the server containing initial data, or is there another well-suited method ?
I want to copy indexes created via MR jobs which are now residing in HDFS into solr. Is it possible using sqoop?
If yes, what is the jdbc connector or driver to use? If not sqoop, is there any other way to do this?
You may want to consider using flume. https://flume.apache.org/FlumeUserGuide.html#flume-1-5-2-user-guide
MorphlineSolrSink: This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink).
For more: https://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.