Parquet API doesn't have the concept of Keys? - hadoop

Ok so after getting exceptions about not being able to write keys into a parquet file via spark I looked into the API and found only this.
public class ParquetOutputFormat<T> extends FileOutputFormat<Void, T> {....
(My assumption could be wrong =D, and there might be another API somewhere. )
Ok This makes some warped sense, after all you can project/restrict the data as it is materialising out of the container file. However, just to be on the safe side. A Parquet file does not have the notion of a sequence file's "key" value , right ?
I find this a bit odd, the Hadoop infrastructure builds around the fact that a sequence file may have a key. And I assume this key is used liberally to partition data into blocks for locality (not at the HDFS level ofc) ? Spark has a lot of API calls that work with the code to do reductions and join's etc. Now I have to do extra step to map the keys out from the body of the materialised object. Weird.
So any good reasons why a key is not a first class citizen in the parquet world ?

You are correct. Parquet file is not a key/value file format. It's a columnar format. Your "key" can be a specific column from your table. But it's not like HBase where you have a real key concept. Parquet is not a sequence file.

Related

Data storage format for unstructured data rows on HDFS

We are consuming very large data that needs to be written as fast as we receive and we are using HDFS, so we prefer using it. The data is almost unstructured, and we will be doing basic queries on them rarely. The data is flat with some fields, each row representing another data.
key1=str key2=30.3 key3=longtexthere
Another data row:
key1=3 key5=abc
SequenceFile seemed the most natural one but I could not find how to store multiple rows in a single SequenceFile.
Currently, in our temporary solution, we have multiple writers that writes to multiple text files. So when querying is needed, we read them in parallel. However, current text files contains 1000s of rows and I don't think creating a single SequenceFile for each row would be feasiable, it would incur much overhead for storing metadata and reading many too many files at once when querying.
I think the problem can be solved by using HBase or Cassandra, a columunar database but we are almost required to use HDFS. Am I missing something with SequenceFiles or we should really use a columunar database?
So sequence file format is like this:
<key, value>
<key, value>
<key, value>
...
where the key is a WritableComparable and the value is a Writable.
Now what a lot of people are doing - and you could do the same - is:
Only use the key OR the value 'column'
Implement a custom Writable which wraps a set of other Writables (call it record, row, ...)
That way you can model everything you want. That record writable could have a fixed schema, like it contains 'IntWritable, Text, IntWritable, IntWritable' (depending on you fields). Or in case you don't wanna support different types, you could use the existing ArrayWritable as your 'record'.
Knowing the schema of each file (e.g. put it into the metadata of the sequence file, will allow you to do reads on files with different/evolved schema's.
So its a lot of handcrafting, but build can very efficient and flexible structure. Never used it, but take a look at http://pangool.net/userguide/schemas.html, think they already modeled suche a flexible record/tuple schema on top of sequence files.
Bottom line, i think you can achieve what you want with sequence files.
However i would recommend to also have a look at columnar file formats like Parquet or ORC files. Those come with their own tradeoffs, but you will have a higher compression rate and selective reads (column projection, filter pushdown). Also you don't have to invent the schema/tuple structure.

Map Reduce : Which is the underlying Data Structure used

I am wondering that if such a large datasets are used in Hadoop Map Reduce then what are the data structures which are used by hadoop. If possible please somebody provide me a detail view of underlying data structures in hadoop.
HDFS is the default underlying storage platform of Hadoop.
Its like any other file system in the sense that - it does not care what structure the files have. It only ensures that files will be saved in a redundant fashion and available for retrieval quickly.
So it is totally upto you the user, to store files with whatever structure you like inside them.
A Map Reduce program simply gets the file data fed to it as an input. Not necessarily the entire file, but parts of it depending on InputFormats etc. The Map program then can make
use of the data in whatever way it wants to.
'Hive' - on the other hand deals with TABLES (columns/rows). And you can query them in a SQL like fashion using Hive-QL.
Thanks to all of you
I got the answer of my question. The underlying HDFS uses block as a storing units a detail description of which is mentioned in the following book and networking streaming concepts.
All the details are available in the third chapter of Hadoop: The Definitive Guide.

Sequence File of Objects into Hive

We started with a bunch of data stored in NetCDF files. From there, some Java code was written to create sequence files from the NetCDF files. We don't know much about the original intentions of the code, but we have been able to learn a little bit about the sequence files themselves. Ultimately, we are trying to create tables within Hive using these sequence files, but seem incapable of doing so at the moment.
We know that the keys and values within the sequence files are stored as objects that implements WritableComparable. We are also capable of creating Java code to iterate through all of the data in the sequence files.
So, what would be necessary to actually get Hive to read the data within the objects of these sequence files properly?
Thanks in advanced!
UPDATE: The reason it is so difficult to describe where I am having trouble exactly is because I am not necessarily getting any errors. Hive is simply just reading the sequence files incorrectly. When running the Hadoop -text command on my sequence file I get a list of objects as such:
NetCDFCompositeKey#263c7e3f , NetCDFRecordWritable#4d846db5
The data is within those objects themselves. So, currently from the help of #Tariq I believe what I have to do in order to actually read those objects is to create a custom InputFormat to read the keys and a custom SerDe to serialize and deserialize the objects?
I'm sorry, i'm not able to understand from your question where exactly you are facing the problem. If you wish to use SequenceFiles through Hive you just have to add STORED AS SEQUENCEFILE clause while issuing CREATE TABLE(most probably you already know this, nothing new). When you work on SequenceFiles Hive treats each key/value pair of the SequenceFiles similar to rows in normal files. Important thing here is that keys will be ignored. Apart from that nothing very special.
Having said that, if you wish to read both keys and values, you might have to write a custom InputFormat that can read both keys and values. See this project for example. It allows us to access data stored in a SequenceFile's key.
Also, if your keys and values are custom classes, you will require to write a SerDe as well to serialize and deserialize your data.
HTH
P.S. : I don't know if this is exactly what you were looking for. Do let me know if it is not and add some more detail to your question. I'll try addressing that.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Huge files in hadoop: how to store metadata?

I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks
I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.
If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.
The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

Resources