Persisting unstructured data to hadoop using spark streaming - hadoop

I have an ingest pipeline created using spark streaming, and I would like to store the RDDs in hadoop as a large unstructured (JSONL) datafile to simplify future analysis.
What is the best approach for persisting astream to hadoop without ending up with very large numbers of small files? (since hadoop is not good with those, and they complicate analysis workflows)

First, I would suggest using a persistance layer that can handle this like Cassandra. But, if you are deadset on HDFS, then the mailing list has an answer already
You can use FileUtil.copyMerge (from the hadoop fs) API and specify the path to the folder where saveAsTextFiles is saving the part text file.
Suppose your directory is /a/b/c/ use
FileUtil.copyMerge(FileSystem of source, a/b/c,
FileSystem of destination, Path to the merged file say (a/b/c.txt),
true(to delete the original dir,null))

Related

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Hadoop HDFS dependency

In hadoop mapreduce programming model; when we are processing files is it mandatory to keep the files in HDFS file system or can I keep the files in other file system's and still have the benefit of mapreduce programming model ?
Mappers read input data from an implementation of InputFormat. Most implementations descend from FileInputFormat, which reads data from local machine or HDFS. (by default, data is read from HDFS and the results of the mapreduce job are stored in HDFS as well.) You can write a custom InputFormat, when you want your data to be read from an alternative data source, not being HDFS.
TableInputFormat would read data records directly from HBase and DBInputFormat would access data from relational databases. You could also imagine a system where data is streamed to each machine over the network on a particular port; the InputFormat reads data from the port and parses it into individual records for mapping.
However, in your case, you have data in a ext4-filesystem on a single or multiple servers. In order to conveniently access this data within Hadoop you'd have to copy it into HDFS first. This way you will benefit from data locality, when the file chunks are processed in parallel.
I strongly suggest reading the tutorial from Yahoo! on this topic for detailed information. For collecting log files for mapreduce processing also take a look at Flume.
You can keep the files elsewhere but you'd lose the data locality advantage.
For example. if you're using AWS, you can store your files on S3 and access them directly from Map-reduce code, Pig, Hive, etc.
In order to user Apache Haddop you must have your files in HDFS, the hadoop file system. Though there are different abstract types of HDFS, like AWS S3, these are all at their basic level HDFS storage.
The data needs to be in HDFS because HDFS distributed the data along your cluster. During the mapping phase each Mapper goes through the data stored in it's node and then sends it to the proper node running the reducer code for the given chunk.
You can't have Hadoop MapReduce, withput using HDFS.

Understanding more about Hadoop/HDFS Data Loading

im researching Hadoop and MapReduce (I'm a beginner!) and have a simple question regarding HDFS. I'm a little confused about how HDFS and MapReduce work together.
Lets say I have logs from System A, Tweets, and a stack of documents from System B. When this is loaded into Hadoop/HDFS, is this all thrown into one big HDFS bucket, or would there be 3 areas (for want of a better word)? If so, what is the correct terminology?
The questions stems from understanding how to execute a MapReduce job. If I only wanted to concentrate on the Logs for example, can this be done, or are all jobs executed on the entire content stored on the cluster?
Thanks for your guidance!
TM
HDFS is a file system. As in your local filesystem you can organize all your logs and documents into multiple files and directories. When you run MapReduce jobs you usually specify a directory with your input files. Thus it is possible to execute a job only on the logs from system A or the documents from system B.
However the input for your mappers is specified by the InputFormat. Most implementations originate from FileInputFormat which reads files. However it is possible to implement custom InputFormats in order to read data from other sources. You can find an explanation on input and output formats in this Hadoop Tutorial.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

HDFS: Using HDFS API to append to a SequenceFile

I've been trying to create and maintain a Sequence File on HDFS using the Java API without running a MapReduce job as a setup for a future MapReduce job. I want to store all of my input data for the MapReduce job in a single Sequence File, but the data gets appended over time throughout the day. The problem is, if a SequenceFile exists, the following call will just overwrite the SequenceFile instead of appending to it.
// fs and conf are set up for HDFS, not as a LocalFileSystem
seqWriter = SequenceFile.createWriter(fs, conf, new Path(hdfsPath),
keyClass, valueClass, SequenceFile.CompressionType.NONE);
seqWriter.append(new Text(key), new BytesWritable(value));
seqWriter.close();
Another concern is that I cannot maintain a file of my own format and turn the data into a SequenceFile at the end of the day as a MapReduce job could be launched using that data at any point.
I cannot find any other API call to append to a SequenceFile and maintain its format. I also cannot simply concatenate two SequenceFiles because of their formatting needs.
I also wanted to avoid running a MapReduce job for this since it has high overhead for the little amount of data I'm adding to the SequenceFile.
Any thoughts or work-arounds? Thanks.
Support for appending to existing SequenceFiles has been added to Apache Hadoop 2.6.1 and 2.7.2 releases onwards, via enhancement JIRA: https://issues.apache.org/jira/browse/HADOOP-7139
For example usage, the test-case can be read: https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/io/TestSequenceFileAppend.java#L63-L140
CDH5 users can find the same ability in version CDH 5.7.1 onwards.
Sorry, currently the Hadoop FileSystem does not support appends. But there are plans for it in a future release.

Resources