Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp - hadoop

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which according to the documentation, 'allows faster data-nodes to copy more bytes than slower nodes'.
I am setting the number of mappers to 400. when I launch the job, I see this error: java.io.IOException: Too many chunks created with splitRatio:2, numMaps:400. Reduce numMaps or decrease split-ratio to proceed.
when I googled about it, I found this link: https://issues.apache.org/jira/browse/MAPREDUCE-5402
In this link the author asks for a feature where we can increase distcp.dynamic.max.chunks.tolerable to resolve the issue.
The ticket says issue was resolved in the version 2.5.0. The hadoop version I am using is 2.7.3. So I believe it should be possible for me to increase the value of distcp.dynamic.max.chunks.tolerable.
However, I am not sure how can I increase that. Can this configuration be updated for a single distcp job by passing it like -Dmapreduce.job.queuename or do I have to update it on mapred-site.xml ? Any help would be appreciated.
Also does this approach work well if there are a large number of small files? are there any other parameters I can use to make it faster? Any help would be appreciated.
Thank you.

I was able to figure it out. A parameter can be passed with the distcp command instead of having to update the mapred-site.xml:
hadoop distcp -Ddistcp.dynamic.recordsPerChunk=50 -Ddistcp.dynamic.max.chunks.tolerable=10000 -skipcrccheck -m 400 -prbugc -update -strategy dynamic "hdfs://source" "hdfs://target"

Related

Spark 2.2.0 FileOutputCommitter

DirectFileOutputCommitter is no longer available in Spark 2.2.0. This means writing to S3 takes insanely long time (3 hours vs 2 mins). I'm able to work around this by setting FileOutputCommitter version to 2 in spark-shell by doing this,
spark-shell --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
same does not work with spark-sql
spark-sql --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
The above command seems to be setting the version=2 but when the query is exeucted it still shows version 1 behaviour.
Two questions,
1) How do I get FileOutputCommitter version 2 behaviour with spark-sql?
2) Is there a way I can still use DirectFileOutputCommitter in spark 2.2.0? [I'm fine with non-zero chance of missing data]
Related items:
Spark 1.6 DirectFileOutputCommitter
I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.
I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.
The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.
The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.
Also , You could look for a solution Netflix provided.
I haven't evaluated that.
EDIT:
The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.
See the documents
https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-performance.html

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

Mesos & Hadoop: How to get the running job input data size?

I'm running Hadoop 1.2.1 on top of Mesos 0.14. My goal is to log the input data size, running time, cpu usage, memory usage, and so on for optimization purposes later. All of these but data size are obtained using Sigar.
Is there any way I can get the input data size of any job which is running?
For example, when I'm running hadoop example's terasort, I need to get the teragen's generated data size before the job actually runs. If I'm running Wordcount example, I need to get the wordcount input file size. I need to get the data size automatically since I won't be able to know what job will be run inside this framework later.
I'm using Java to write some of the mesos library code. Preferably, I want to get the data size inside MesosExecutor class. For some reason, upgrading Hadoop/Mesos isn't an option.
Any suggestions or related API will be appreciated. Thank you.
Does hadoop fs -dus satisfy your requirement? Before submit the job to hadoop, calculate the input file size and pass it as params to your executor.

Merge HDFS files without going through the network

I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?
I have problem similar to yours.
Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
copyMerge - as far as I can see this will be again copy. But I did not check details yet.
File crush - again, looks like MapReduce.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

Using mahout and hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?
(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.
The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!
you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

Resources