I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :
Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.
I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?
images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.
effective way to store image files
How many files is too many on a modern HDP cluster?
Processing a 1Mb file has an overhead to it. So processing 128 1Mb
files will cost you 128 times more "administrative" overhead, versus
processing 1 128Mb file. In plain text, that 1Mb file may contain 1000
records. The 128 Mb file might contain 128000 records.
Related
When importing unstructured data - audio and video file - into HDFS, are those files splited into 128MB block and saved like other data does?
Blocks are created, yes. It is then up to the client to determine how to re-assemble the files into proper blobs, which is why object/blob storage, such as Apache Ozone, would be recommended for these filetypes, rather than HDFS block storage.
I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.
I came across this page on CompressedStorage in the documentation and it has me a bit confused.
According to the page, if my input files (on AWS s3) are compressed gzip files, I should first load the data with the option STORED AS TextFile and then create another table with the option STORED AS SEQUENCEFILE and insert the data into that. Is that really the recommended way?
Or can I just load the data straight into a table set with the option STORED AS SEQUENCEFILE?
If the former method is really the recommended way, is there any further explanation as to why it is?
You must load your data in its format. It means, if your files are Text Files then you should load them as TextFile and if your files are Sequence Files then load them as SEQUENCEFILE.
For Hive the compression format doesn't matter because it will decompress them on fly using the extension of the file as reference (If the compression codec was configured properly in Hadoop).
The suggestion in the page that you are sharing is that it's better work with Sequence Files than Compressed Text Files. That is because a Gzip file is not splittable and if you have a very big Gzip file all the file have to be processed with only one Mapper not allowing work in parrallel distributing the effort among the cluster nodes.
Then the Hive's suggestion is convert Compressed Text Files into Sequence Files to avoid that limitation. It is only about performance.
If your files are small, then it doesn't matter (< 1 Hadoop block size - 128MB by default).
Iam using Hadoop to parse ample(about 1 million) text files and each has lot of data into it.
Firstly I uploaded all my text files into hdfs using Eclipse. But when uploading the files, my map-reduce operation resulted in huge amount of files in following directory C:\tmp\hadoop-admin\dfs\data.
So , is there any mechanism, using which I can shrink the size of my HDFS (basically above mentioned drive).
to shrink your HDFS size you can set a greater value (in bytes) to following hdfs-site.xml property
dfs.datanode.du.reserved=0
You can also lower the amount of data generated by map outputs by enabling map output compression.
map.output.compress=true
hope that helps.
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.