Huge files in hadoop: how to store metadata? - hadoop

I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks

I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.

If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.

The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

Related

Data storage format for unstructured data rows on HDFS

We are consuming very large data that needs to be written as fast as we receive and we are using HDFS, so we prefer using it. The data is almost unstructured, and we will be doing basic queries on them rarely. The data is flat with some fields, each row representing another data.
key1=str key2=30.3 key3=longtexthere
Another data row:
key1=3 key5=abc
SequenceFile seemed the most natural one but I could not find how to store multiple rows in a single SequenceFile.
Currently, in our temporary solution, we have multiple writers that writes to multiple text files. So when querying is needed, we read them in parallel. However, current text files contains 1000s of rows and I don't think creating a single SequenceFile for each row would be feasiable, it would incur much overhead for storing metadata and reading many too many files at once when querying.
I think the problem can be solved by using HBase or Cassandra, a columunar database but we are almost required to use HDFS. Am I missing something with SequenceFiles or we should really use a columunar database?
So sequence file format is like this:
<key, value>
<key, value>
<key, value>
...
where the key is a WritableComparable and the value is a Writable.
Now what a lot of people are doing - and you could do the same - is:
Only use the key OR the value 'column'
Implement a custom Writable which wraps a set of other Writables (call it record, row, ...)
That way you can model everything you want. That record writable could have a fixed schema, like it contains 'IntWritable, Text, IntWritable, IntWritable' (depending on you fields). Or in case you don't wanna support different types, you could use the existing ArrayWritable as your 'record'.
Knowing the schema of each file (e.g. put it into the metadata of the sequence file, will allow you to do reads on files with different/evolved schema's.
So its a lot of handcrafting, but build can very efficient and flexible structure. Never used it, but take a look at http://pangool.net/userguide/schemas.html, think they already modeled suche a flexible record/tuple schema on top of sequence files.
Bottom line, i think you can achieve what you want with sequence files.
However i would recommend to also have a look at columnar file formats like Parquet or ORC files. Those come with their own tradeoffs, but you will have a higher compression rate and selective reads (column projection, filter pushdown). Also you don't have to invent the schema/tuple structure.

What are binary types in hadoop?

Hadoop - The Definitive Guide says
If you want to log binary types, plain text isn’t a suitable format.
My Questions is 1. Why not? 2. What are binary types?
and further:
Hadoop’s SequenceFileclass fits the bill in this situation, providing
a persistent data structure for binary key-value pairs. To use it as a
logfile format, you would choose a key, such as timestamp represented
by a LongWritable, and the value is a Writable that represents the
quantity being logged.
Why text file can't be used and sequence file is required?
On the same page, it was quoted that:
For some applications, you need a specialized data structure to hold your data. For doing
MapReduce-based processing, putting each blob of binary data into its own file doesn’t
scale, so Hadoop developed a number of higher-level containers for these situations.
e.g. Assume that you are uploading images in facebook and you have to remove duplicate images. You can't store image in textformat. What you can do : get MD5SUM of image file and if MD5SUM already exists in the system, just discard insertion of duplicate image. In your text file, you can simply have "Date:" and "Number of images uploaded". Image can be stored out side of HDFS system like CDN network or at some other web server

Merge multiple document categorizer models in OpenNLP

I am trying to write a map-reduce implementation of Document Categorizer using OpenNLP.
During the training phase, I am planning to read a large amount of files and create a model file as result of the map-reduce computation(may be a chain of jobs). I will distribute the files to different mappers, I would create a number of model files as result of this step. Now, I wish to reduce these model files to a single model file to be used for classification.
I understand that this is not the most intuitive of use cases, but I am ready to get my hands dirty and extend/modify the OpenNLP source code, assuming it is possible to tweak the maxent algorithm to work this way.
In case this seems too far fetched, I request for suggestions to do this by generating document samples corresponding to the input files as output of map-reduce step and reducing them to model files by feeding them to document categorizer trainer.
Thanks!
I've done this before, and my approach was to not have each reducer produce the model, but rather only produce the properly formatted data.
Rather than use a category as a key, which separates all the categories Just use a single key and make the value the proper format (cat sample newline) then in the single reducer you can read in that data as (a string) a bytearrayinputstream and train the model. Of course this is not the only way. You wouldn't have to modify opennlp at all to do this.
Simply put, my recommendation is to use a single job that behaves like this:
Map: read in your data, create category label and sample pair. Use a key called 'ALL' and context.write each pair with that key .
Reduce: use a stringbuilder to concat all the cat: sample pairs into the proper training format. Convert the string into a bytearrayinputstream and feed the training API . Write the model somewhere.
Problem may occur that your samples data is too huge to send to one node. If so, you can write the values to A nosql db and read then in from a beefier training node. Or you can use randomization in your mapper to produce many keys and build many models, then at classification time write z wrapper that tests data across them all and Getz The best from each one..... Lots of options.
HTH

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources