How to achieve desired block size with Hadoop with data on local filesystem - hadoop

I have a 2TB sequence file that I am trying to process with Hadoop which resides on a cluster set up to use a local (lustre) filesystem for storage instead of HDFS. My problem is that no matter what I try, I am always forced to have about 66000 map tasks when I run a map/reduce jobs with this data as input. This seems to correspond with a block size of 2TB/66000 =~ 32MB. The actual computation in each map task executes very quickly, but the overhead associated with so many map tasks slows things down substantially.
For the job that created the data and for all subsequent jobs, I have dfs.block.size=536870912 and fs.local.block.size=536870912 (512MB). I also found suggestions that said to try this:
hadoop fs -D fs.local.block.size=536870912 -put local_name remote_location
to make a new copy with larger blocks, which I did to no avail. I have also changed the stripe size of the file on lustre. It seems that any parameters having to do with block size are ignored for local file system.
I know that using lustre instead of HDFS is a non-traditional use of hadoop, but this is what I have to work with. I'm wondering if others either have experience with this, or have any ideas to try other than what I have mentioned.
I am using cdh3u5 if that is useful.


Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Can I get around the no-update restriction in HDFS?

Thanks for the answers. I'm still not quite getting the answer I want. It's a particular question involving HDFS and the concat api.
Here it is. When concat talks about files, does it mean only "files created and managed by HDFS?" Or will it work on files that are not known to HDFS but just happen to live on the datanodes?
The idea is to
Create a file and save it through HDFS. It's broken up into blocks and saved to the datanodes.
Go directly to the datanodes and make local copies of the blocks using normal shell commands.
Alter those copies. I now have a set of blocks that Hadoop doesn't know about. The checksums are definitely bad.
Use concat to stitch the copies together and "register" them with HDFS.
At the end of all that, I have two files as far as HDFS is concerned. The original and an updated copy. Essentially, I put the data blocks on the datanodes without going through Hadoop. The concat code put all those new blocks into a new HDFS file without having to pass the data through Hadoop.
I don't think this will work, but I need to be sure it won't. It was suggested to me as a possible solution to the update problem. I need to convince them this will not work.
The base philosophy of HDFS is:
write-once, read-many
then, it is not possible to update files with the base implementation of HDFS. You only can append at the end of a current file if you are using a Hadoop branch that allow it. (The original version doesn't allow it)
An alternative could be use a non-standard HDFS like Map-R file system:
Go for HBase which is built on top of Hadoop to support CRUD operations in big data hadoop world.
If you are not supposed to use No SQL database then there is no chance for updating HDFS files. Only option is to rewrite.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm or Apache Spark

How to make Hadoop Map Reduce process multiple files in a single run ?

For Hadoop Map Reduce program when we run it by executing this command $hadoop jar my.jar DriverClass input1.txt hdfsDirectory. How to make Map Reduce process multiple files( input1.txt & input2.txt ) in a single run ?
Like that:
hadoop jar my.jar DriverClass hdfsInputDir hdfsOutputDir
hdfsInputDir is the path on HDFS where your input files are stored (i.e., the parent directory of input1.txt and input2.txt)
hdfsOutputDir is the path on HDFS where the output will be stored (it should not exist before running this command).
Note that your input should be copied on HDFS before running this command.
To copy it to HDFS, you can run:
hadoop dfs -copyFromLocal localPath hdfsInputDir
This is your small files problem. for every file mapper will run.
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files, then you probably have lots of them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS can’t handle lots of files.
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
HAR files
Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0 to alleviate the problem of lots of files putting pressure on the namenode’s memory. HAR files work by building a layered filesystem on top of HDFS. A HAR file is created using the hadoop archive command, which runs a MapReduce job to pack the files being archived into a small number of HDFS files. To a client using the HAR filesystem nothing has changed: all of the original files are visible and accessible (albeit using a har:// URL). However, the number of files in HDFS has been reduced.
Sequence Files
The usual response to questions about “the small files problem” is: use a SequenceFile. The idea here is that you use the filename as the key and the file contents as the value. This works very well in practice. Going back to the 10,000 100KB files, you can write a program to put them into a single SequenceFile, and then you can process them in a streaming fashion (directly or using MapReduce) operating on the SequenceFile. There are a couple of bonuses too. SequenceFiles are splittable, so MapReduce can break them into chunks and operate on each chunk independently. They support compression as well, unlike HARs. Block compression is the best option in most cases, since it compresses blocks of several records (rather than per record).

Spark/Hadoop throws exception for large LZO files

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.:
In the spark-shell I'm running a job that counts the lines in the files. If I count the lines individually for each file, there is no problem, e.g. like this:
// Works fine
If I use a wild-card to load all the files with a one-liner, I get two kinds of exceptions.
// One-liner throws exceptions
The exceptions are:
java.lang.InternalError: lzo1x_decompress_safe returned: -6
at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
and Compressed length 1362309683 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(
It seems to me that the solution is hinted by the text given with the last exception, but I don't know how to proceed. Is there a limit to how big LZO files are allowed to be, or what is the issue?
My question is: Can I run Spark queries that load all LZO-compressed files in an S3 folder, without getting I/O related exceptions?
There are 66 files of roughly 200MB per file.
The exception only occurs when running Spark with Hadoop2 core libs (ami 3.1.0). When running with Hadoop1 core libs (ami 2.4.5), things work fine. Both cases were tested with Spark 1.0.1.
kgeyti's answer works fine, but:
LzoTextInputFormat introduces a performance hit, since it checks for an .index file for each LZO file. This can be especially painful with many LZO files on S3 (I've experienced up to several minutes delay, caused by thousands of requests to S3).
If you know up front that your LZO files are not splittable, a more performant solution is to create a custom, non-splittable input format:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.JobContext
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
class NonSplittableTextInputFormat extends TextInputFormat {
override def isSplitable(context: JobContext, file: Path): Boolean = false
and read the files like this:
I haven't run into this specific issue myself, but it looks like .textFile expects files to be splittable, much like the Cedrik's problem of Hive insisting on using CombineFileInputFormat
You could either index your lzo files, or try using the LzoTextInputFormat - I'd be interested to hear if that works better on EMR:
.map(_._2.toString) // if you just want a RDD[String] without writing a new InputFormat
yesterday we deployed Hive on a EMR cluster and had the same problem with some LZO files in S3 which have been taken without any problem by another non EMR cluster. After some digging in the logs I noticed, that the map tasks read the S3 files in 250MB chunks, although the files are definitely not splittable.
It turned out that the paramter mapreduce.input.fileinputformat.split.maxsize was set to 250000000 ~ 250MB. That resulted in LZO opening a stream from within a file and a ultimately a corrupt LZO block.
I set the parameter mapreduce.input.fileinputformat.split.maxsize=2000000000 bigger as the maximum file size of our input data and everything works now.
I'm not exactly sure how that correlates to Spark exactly, but changing the InputFormat might help, which seems like the problem in first place, as it has been mentioned in How Amazon EMR Hive Differs from Apache Hive.
