Spark 2.2.0 FileOutputCommitter - hadoop

DirectFileOutputCommitter is no longer available in Spark 2.2.0. This means writing to S3 takes insanely long time (3 hours vs 2 mins). I'm able to work around this by setting FileOutputCommitter version to 2 in spark-shell by doing this,
spark-shell --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
same does not work with spark-sql
spark-sql --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
The above command seems to be setting the version=2 but when the query is exeucted it still shows version 1 behaviour.
Two questions,
1) How do I get FileOutputCommitter version 2 behaviour with spark-sql?
2) Is there a way I can still use DirectFileOutputCommitter in spark 2.2.0? [I'm fine with non-zero chance of missing data]
Related items:
Spark 1.6 DirectFileOutputCommitter

I have been hit by this issue. Spark is discouraging the usage of DirectFileOutputCommitter as it might lead to data loss in case of race situation. The algorithm version 2 doesn't help a lot.
I have tried to use the gzip to save the data in s3 instead of snappy compression which gave some benefit.
The real issue here is that spark writes in the s3://<output_directory>/_temporary/0 first then copies the data from temporary to the output. This process is pretty slow in s3,(Generally 6MBPS) So if you get lot of data you will get considerable slowdown.
The alternative is to write to HDFS first then use distcp / s3distcp to copy the data to s3.
Also , You could look for a solution Netflix provided.
I haven't evaluated that.
EDIT:
The new spark2.4 version has solved the problem of slow s3 write. I have found the s3 write performance of spark2.4 with hadoop 2.8 in the latest EMR version (5.24) is almost at par with HDFS write.
See the documents
https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-performance.html

Related

Hadoop Distcp - increasing distcp.dynamic.max.chunks.tolerable config and tuning distcp

I am trying to move data between two hadoop clusters using distcp. There is a lot of data to move with a large number of small files. In order to make it faster, I tried using -strategy dynamic, which according to the documentation, 'allows faster data-nodes to copy more bytes than slower nodes'.
I am setting the number of mappers to 400. when I launch the job, I see this error: java.io.IOException: Too many chunks created with splitRatio:2, numMaps:400. Reduce numMaps or decrease split-ratio to proceed.
when I googled about it, I found this link: https://issues.apache.org/jira/browse/MAPREDUCE-5402
In this link the author asks for a feature where we can increase distcp.dynamic.max.chunks.tolerable to resolve the issue.
The ticket says issue was resolved in the version 2.5.0. The hadoop version I am using is 2.7.3. So I believe it should be possible for me to increase the value of distcp.dynamic.max.chunks.tolerable.
However, I am not sure how can I increase that. Can this configuration be updated for a single distcp job by passing it like -Dmapreduce.job.queuename or do I have to update it on mapred-site.xml ? Any help would be appreciated.
Also does this approach work well if there are a large number of small files? are there any other parameters I can use to make it faster? Any help would be appreciated.
Thank you.
I was able to figure it out. A parameter can be passed with the distcp command instead of having to update the mapred-site.xml:
hadoop distcp -Ddistcp.dynamic.recordsPerChunk=50 -Ddistcp.dynamic.max.chunks.tolerable=10000 -skipcrccheck -m 400 -prbugc -update -strategy dynamic "hdfs://source" "hdfs://target"

Why is Spark setting partitions to the file size in bytes?

I have a very simple pyspark program that is supposed to read CSV files from S3:
r = sc.textFile('s3a://some-bucket/some-file.csv')
.map(etc... you know the drill...)
This was failing when running a local Spark node (it works in EMR). I was getting OOM errors and GC crashes. Upon further inspection, I realized that the number of partitions was insanely high. In this particular case r.getNumPartitions() would return 2358041.
I realized that that's exactly the size of my file in bytes. This, of course, makes Spark crash miserably.
I've tried different configurations, like chaning mapred.min.split.size:
conf = SparkConf()
conf.setAppName('iRank {}'.format(datetime.now()))
conf.set("mapred.min.split.size", "536870912")
conf.set("mapred.max.split.size", "536870912")
conf.set("mapreduce.input.fileinputformat.split.minsize", "536870912")
I've also tried using repartition or changing passing a partitions argument to textFile, to no avail.
I would love to know what makes Spark think that it's a good idea to derive the number of partitions from the file size.
In general it doesn't. As nicely explained by eliasah in his answer to Spark RDD default number of partitions it uses max of minPartitions (2 if not provided) and splits computed by Hadoop input format.
The latter one will by unreasonably high, only if instructed by the configuration. This suggests that some configuration file interferes with your program.
The possible problem with your code is that you use wrong configuration. Hadoop options should be set using hadoopConfiguration not Spark configuration. It looks like you use Python so you have to use private JavaSparkContext instance:
sc = ... # type: SparkContext
sc._jsc.hadoopConfiguration().setInt("mapred.min.split.size", min_value)
sc._jsc.hadoopConfiguration().setInt("mapred.max.split.size", max_value)
There was actually a bug in Hadoop 2.6 which would do this; the initial S3A release didn't provide a block size to Spark to split up, the default of "0" meant one-byte-per-job.
Later version should all take fs.s3a.block.size as the config option specifying the block size...something like 33554432 (= 32 MB) would be a start.
If you are using Hadoop 2.6.x. Don't use S3A. That's my recommendation.

Spark/Hadoop throws exception for large LZO files

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.:
...
s3://mylogfiles/2014-08-11-00111.lzo
s3://mylogfiles/2014-08-11-00112.lzo
...
In the spark-shell I'm running a job that counts the lines in the files. If I count the lines individually for each file, there is no problem, e.g. like this:
// Works fine
...
sc.textFile("s3://mylogfiles/2014-08-11-00111.lzo").count()
sc.textFile("s3://mylogfiles/2014-08-11-00112.lzo").count()
...
If I use a wild-card to load all the files with a one-liner, I get two kinds of exceptions.
// One-liner throws exceptions
sc.textFile("s3://mylogfiles/*.lzo").count()
The exceptions are:
java.lang.InternalError: lzo1x_decompress_safe returned: -6
at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
and
java.io.IOException: Compressed length 1362309683 exceeds max block size 67108864 (probably corrupt file)
at com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:291)
It seems to me that the solution is hinted by the text given with the last exception, but I don't know how to proceed. Is there a limit to how big LZO files are allowed to be, or what is the issue?
My question is: Can I run Spark queries that load all LZO-compressed files in an S3 folder, without getting I/O related exceptions?
There are 66 files of roughly 200MB per file.
EDIT:
The exception only occurs when running Spark with Hadoop2 core libs (ami 3.1.0). When running with Hadoop1 core libs (ami 2.4.5), things work fine. Both cases were tested with Spark 1.0.1.
kgeyti's answer works fine, but:
LzoTextInputFormat introduces a performance hit, since it checks for an .index file for each LZO file. This can be especially painful with many LZO files on S3 (I've experienced up to several minutes delay, caused by thousands of requests to S3).
If you know up front that your LZO files are not splittable, a more performant solution is to create a custom, non-splittable input format:
import org.apache.hadoop.fs.Path
import org.apache.hadoop.mapreduce.JobContext
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
class NonSplittableTextInputFormat extends TextInputFormat {
override def isSplitable(context: JobContext, file: Path): Boolean = false
}
and read the files like this:
context.newAPIHadoopFile("s3://mylogfiles/*.lzo",
classOf[NonSplittableTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text])
.map(_._2.toString)
I haven't run into this specific issue myself, but it looks like .textFile expects files to be splittable, much like the Cedrik's problem of Hive insisting on using CombineFileInputFormat
You could either index your lzo files, or try using the LzoTextInputFormat - I'd be interested to hear if that works better on EMR:
sc.newAPIHadoopFile("s3://mylogfiles/*.lz",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text])
.map(_._2.toString) // if you just want a RDD[String] without writing a new InputFormat
.count
yesterday we deployed Hive on a EMR cluster and had the same problem with some LZO files in S3 which have been taken without any problem by another non EMR cluster. After some digging in the logs I noticed, that the map tasks read the S3 files in 250MB chunks, although the files are definitely not splittable.
It turned out that the paramter mapreduce.input.fileinputformat.split.maxsize was set to 250000000 ~ 250MB. That resulted in LZO opening a stream from within a file and a ultimately a corrupt LZO block.
I set the parameter mapreduce.input.fileinputformat.split.maxsize=2000000000 bigger as the maximum file size of our input data and everything works now.
I'm not exactly sure how that correlates to Spark exactly, but changing the InputFormat might help, which seems like the problem in first place, as it has been mentioned in How Amazon EMR Hive Differs from Apache Hive.

Merge HDFS files without going through the network

I could do this:
hadoop fs -text /path/to/result/of/many/reudcers/part* | hadoop fs -put - /path/to/concatenated/file/target.csv
But it will make the HDFS file get streamed through the network. Is there a way to tell the HDFS to merge few files on the cluster itself?
I have problem similar to yours.
Here is article with number of HDFS files merging options but all of them have some specifics. No one from this list meets my requirements. Hope this could help you.
HDFS concat (actually FileSystem.concat()). Not so old API. Requires original file to have last block full.
MapReduce jobs: probably I will take some solution based on this technology but it's slow to setup.
copyMerge - as far as I can see this will be again copy. But I did not check details yet.
File crush - again, looks like MapReduce.
So main result is if MapReduce setup speed suits you, no problem. If you have realtime requirements, things are getting complex.
One of my 'crazy' ideas is to use HBase coprocessor mechanics (endpoints) and files blocks locality information for this as I have Hbase on the same cluster. If the word 'crazy' doesn't stop you, look at this: http://blogs.apache.org/hbase/entry/coprocessor_introduction

Using mahout and hadoop

I am a newbie trying to understand how will mahout and hadoop be used for collaborative filtering. I m having single node cassandra setup. I want to fetch data from cassandra
Where can I find clear installation steps for hadoop first and then mahout to work with cassandra?
(I think this is the same question you just asked on user#mahout.apache.org? Copying my answer.)
You may not need Hadoop at all, and if you don't, I'd suggest you not use it for simplicity. It's "necessary evil" to scale past a certain point.
You can have data on Cassandra but you will want to be able to read it into memory. If you can dump as a file, you can use FileDataModel. Or, you can emulate the code in FileDataModel to create one based on Cassandra.
Then, your two needs are easily answered:
This is not even a recommendation
problem. Just pick an implementation
of UserSimilarity, and use it to
compare a user to all others, and
pick the ones with highest
similarity. (Wrapping with
CachingUserSimilarity will help a
lot.)
This is just a recommender
problem. Use a
GenericUserBasedRecommender with
your UserSimilarity and DataModel
and you're done.
It of course can get much more complex than this, but this is a fine start point.
If later you use Hadoop, yes you have to set up Hadoop according to its instructions. There is no Mahout "setup". For recommenders, you would look at one of the RecommenderJob classes which invokes the necessary jobs on your Hadoop cluster. You would run it with the "hadoop" command -- again, this is where you'd need to just understand Hadoop.
The book Mahout in Action writes up most of the Mahout Hadoop jobs in some detail.
The book Mahout in Action did indeed just save me from a frustrating lack of docs.
I was following https://issues.apache.org/jira/browse/MAHOUT-180 ... which suggests a 'hadoop -jar' syntax that only gave me errors. The book has 'jar' instead, and with that fix my test job is happily running.
Here's what I did:
used the utility at http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html?showComment=1298565709376#c3501116664672385942 to convert a CSV representation of my matrix to a mahout file format. Copied it into Hadoop filesystem.
Uploaded mahout-examples-0.5-SNAPSHOT-job.jar from a freshly built Mahout on my laptop, onto the hadoop cluster's control box. No other mahout stuff on there.
Ran this: (assumes hadoop is configured; which I confirm with dfs -ls /user/danbri )
hadoop jar ./mahout-examples-0.5-SNAPSHOT-job.jar \
org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver \
--input svdoutput.mht --output outpath --numRows 0 --numCols 4 --rank 50
...now whether I got this right is quite another matter, but it seems to be doing something!
you can follow following tutorial to learn. its ease to understand and stated clearly about basics of Hadoop:
http://developer.yahoo.com/hadoop/tutorial/

Resources