Iam using Hadoop to parse ample(about 1 million) text files and each has lot of data into it.
Firstly I uploaded all my text files into hdfs using Eclipse. But when uploading the files, my map-reduce operation resulted in huge amount of files in following directory C:\tmp\hadoop-admin\dfs\data.
So , is there any mechanism, using which I can shrink the size of my HDFS (basically above mentioned drive).
to shrink your HDFS size you can set a greater value (in bytes) to following hdfs-site.xml property
dfs.datanode.du.reserved=0
You can also lower the amount of data generated by map outputs by enabling map output compression.
map.output.compress=true
hope that helps.
Related
If I am trying to write a file of 200MB into HDFS where HDFS block size is 128MB. What happens if the write fails after writing 150MB out of 200MB. Will I be able to read data from the portion of data written? What if I try to write the same file again? Will that be a duplicate? What happens to the 150MB of data written earlier to failure?
HDFS default Block Size is 128MB, if it fails while writing (it will show the status in Hadoop Administration UI, with file extension copying.)
Only 150MB data will be copied.
yeah you can read only portion of data(150MB).
Once you reinstate the copying it will continue from previous point(if both the paths are same and file name is same).
For every piece of data you can find the replication based on your replication factor.
Previous written data will be available in HDFS.
I want to save image files (like jpeg, png etc) on HDFS (Hadoop File System). I tried two ways :
Saved the image files as it is (i.e in the same format) into HDFS using put command. The full command was : hadoop fs -put /home/a.jpeg /user/hadoop/. It was successfully placed.
Converted these image files into Hadoop's Sequence File format & then saved in HDFS using put command.
I want to know which format should be used to save in HDFS.
And what are the pros of using Sequence File format. One of the advantage that I know is that it is splittable. Is there any other ?
images are very small in size compare to block size of HDFS storage. The problem with small files is the impact on processing performance, This is why you should use Sequence Files, HAR, HBase or merging solutions. see these two threads more info.
effective way to store image files
How many files is too many on a modern HDP cluster?
Processing a 1Mb file has an overhead to it. So processing 128 1Mb
files will cost you 128 times more "administrative" overhead, versus
processing 1 128Mb file. In plain text, that 1Mb file may contain 1000
records. The 128 Mb file might contain 128000 records.
I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.
I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.
That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.