Writing a file larger than block size in hdfs - hadoop

If I am trying to write a file of 200MB into HDFS where HDFS block size is 128MB. What happens if the write fails after writing 150MB out of 200MB. Will I be able to read data from the portion of data written? What if I try to write the same file again? Will that be a duplicate? What happens to the 150MB of data written earlier to failure?

HDFS default Block Size is 128MB, if it fails while writing (it will show the status in Hadoop Administration UI, with file extension copying.)
Only 150MB data will be copied.
yeah you can read only portion of data(150MB).
Once you reinstate the copying it will continue from previous point(if both the paths are same and file name is same).
For every piece of data you can find the replication based on your replication factor.
Previous written data will be available in HDFS.

Related

When to move data to HDFS/Hive?

So I'm developing an application that is expected to deal with large amounts of data, and as such I've decided to use Hadoop to process it.
My services node and datanodes are separated from the webapp, so I'm using HttpFS to communicate the app with Hadoop.
So, whenever a new row of data is generated in my application, should I already call the corresponding HttpFS URL to append the data to an HDFS file? Should I write this data in a file in the webserver and using a cronjob upload it to HDFS for example every hour?
Should I have the Hive table updated or should I just load the data in there whenever I need to query it?
I'm pretty new to Hadoop so any link that could help will also be useful.
I prefer below approach.
Do not call HtpFS URL to append data to HDSF file for every row update. HDFS is efficient when data file size is more than 128 MB (in Hadoop 2.x) or 64 MB (in Hadoop 1.x)
Write the data in web server. Have a rolling appender when file size reaches certain limit - in multiples of 128 MB e.g 1 GB file.
You can have hourly based cron jobs but make sure that you are sending big data file (e.g 1 GB or multiple of 128 MB) instead of just sending the log file, which is accumulated in 1 hour.
Regarding loading of data, you can use internal or external HIVE tables. Have a look at this article

Hadoop Avro file size concern

I have a cronjob that that downloads zip files (200 bytes to 1MB) from a server on the internet every 5 minutes. If I import the zip files into HDFS as is, I encounter the infamous Hadoop small file size issue. In order to avoid the build up of small files in HDFS, process of the the text data in the zip files and convert them into avro files and wait every 6 hours to add my avro file into HDFS. Using this method, I have managed to get avro files imported into HDFS with a file size larger than 64MB. The files sizes range from 50MB to 400MB. What I'm concerned about is that what happens if I start building file sizes that start getting into the 500KB avro file size range or larger. Will this cause issues with Hadoop? How does everyone else handle this situation?
Assuming that you have some Hadoop post-aggregation step and that you're using some splittable compression type (sequence, snappy, none at all), you shouldn't face any issues from Hadoop's end.
If you would like your avro file sizes to be smaller, the easiest way to do this would be to make your aggregation window configurable and lower it when needed (6 hours => 3 hours?). Another way you might be able to ensure more uniformity in file sizes would be to keep a running count of lines seen from downloaded files and then combine upload after a certain line threshold has been reached.

Add a entire directory to hadoop file system (hdfs)

I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.
That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)

How to shrink size of HDFS in Hadoop

Iam using Hadoop to parse ample(about 1 million) text files and each has lot of data into it.
Firstly I uploaded all my text files into hdfs using Eclipse. But when uploading the files, my map-reduce operation resulted in huge amount of files in following directory C:\tmp\hadoop-admin\dfs\data.
So , is there any mechanism, using which I can shrink the size of my HDFS (basically above mentioned drive).
to shrink your HDFS size you can set a greater value (in bytes) to following hdfs-site.xml property
dfs.datanode.du.reserved=0
You can also lower the amount of data generated by map outputs by enabling map output compression.
map.output.compress=true
hope that helps.

Hadoop. About file creation in HDFS

I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.

Resources