I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.
Related
I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.
I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).
I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.
I'm currently using hadoop mapreduce jobs with SequenceFiles of writables.
The same Writable type are used for serialization also in the non-hadoop related parts of the system.
This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.
It appears that apache avro handles these issues.
The problem is, that during the migration I will have data in both formats.
is there a simple way to handle the migration?
I haven't tried it myself, but maybe using AvroSequenceFile format would help. It's just a wrapper around SequenceFile so in theory you should be able to write data in both your old SequenceFile format as well as your new Avro format which should make the migration easier.
Here is more information about this format.
Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.
Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.
The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.
If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.
I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks
I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.
If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.
The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.