Issue with Hadoop and Google Cloud Storage Connector - hadoop

I've deployed a hadoop cluster via Deployments interface in google console. (Hadoop 2.x)
My task was to filter data stored in one Google Storage (GS) bucket and put the results to another. So, this is a map only job with simple python script. Note that cluster and output bucket are in the same zone (EU).
Leveraging Google Cloud Storage Connector, I run the following streaming job:
hadoop jar /home/hadoop/hadoop-install/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-file file_1 \
-file mymapper.py \
-input gs://inputbucket/somedir/somedir2/*-us-* \
-output gs://outputbucket/somedir3/somedir2 \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper mymapper.py
What happens is all the mappers process data and store the results in temporary directory in GS, which looks like:
gs://outputbucket/somedir3/somedir2/_temporary/1/mapper-0000/part-0000.gz
After all mappers are finished, job progress hangs at 100% map, 0% reduce. Looking at output bucket with gsutil, I see that result files are being copied to the destination directory:
gs://outputbucket/somedir3/somedir2
This process takes a very long time and kills the whole benefit from using hadoop.
My questions are:
1) Is it a known issue or I just done something wrong? I couldn't find any relevant info.
2) Am I correct saying that normally hdfs would move those files to destination dir, but GS can't perform move and thus the files are copied?
3) What can I do to avoid this pattern?

You're almost certainly running into the "Slow FileOutputCommitter" issue which affects Hadoop 2.0 through 2.6 inclusive and is fixed in 2.7.
If you're looking for a nice managed Hadoop option on Google Cloud Platform, you should consider Google Cloud Dataproc (documentation here), where we maintain our distro to ensure we pick up patches relevant to Google Cloud Platform quickly. Dataproc indeed configures the mapreduce.fileoutputcommitter.algorithm.version so that the final commitJob is fast.
For something more "do-it-yourself", you can user our command-line bdutil tool , which also has the latest update to use the fast FileOutputCommitter.

Related

Insufficient number of DataNodes reporting when creating dataproc cluster

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.
As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.
The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage.
I tried to run this command
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -input gs://bucket-name/intro_to_mapreduce/purchases.txt -output gs://bucket-name/intro_to_mapreduce/output_prod_cat
But I got this error output :
File:
/home/ramaadhitia/gs:/bucket-name/intro_to_mapreduce/mapper_prod_cat.py
does not exist, or is not readable.
Try -help for more information Streaming Command Failed!
Is cloud connector not working in hadoop streaming? Is there any other way to run python map reduce job using hadoop streaming with python script and input file located in Google Cloud Storage ?
Thank You
The -file option from hadoop-streaming only works for local files. Note however, that its help text mentions that the -file flag is deprecated in favor of the generic -files option. Using the generic -files option allows us to specify a remote (hdfs / gs) file to stage. Note also that generic options must precede application specific flags.
Your invocation would become:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-files gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py,gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py \
-mapper mapper_prod_cat.py \
-reducer reducer_prod_cat.py \
-input gs://bucket-name/intro_to_mapreduce/purchases.txt \
-output gs://bucket-name/intro_to_mapreduce/output_prod_cat

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik#vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil for files upload. Can it be forced to create these empty objects on file upload?
Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.
For example, I have run this job many times:
./hadoop-install/bin/hadoop \
jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
-mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
-output gs://big-data-roadshow/output
This accesses the same read-only bucket you mention in your example above.
The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.
I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.
The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

Make files available locally on Elastic MapReduce

The Hadoop documentation states it's possible to make files available locally by use of the -file option.
How can I do this using the Elastic MapReduce Ruby CLI?
You could use the DistributedCache with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).
You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list

How to use Hadoop Streaming with LZO-compressed Sequence Files?

I'm trying to play around with the Google ngrams dataset using Amazon's Elastic Map Reduce. There's a public dataset at http://aws.amazon.com/datasets/8172056142375670, and I want to use Hadoop streaming.
For the input files, it says "We store the datasets in a single object in Amazon S3. The file is in sequence file format with block level LZO compression. The sequence file key is the row number of the dataset stored as a LongWritable and the value is the raw data stored as TextWritable."
What do I need to do in order to process these input files with Hadoop Streaming?
I tried adding an extra "-inputformat SequenceFileAsTextInputFormat" to my arguments, but this doesn't seem to work -- my jobs keep failing for some unspecified reason. Are there other arguments I'm missing?
I've tried using a very simple identity as both my mapper and reducer
#!/usr/bin/env ruby
STDIN.each do |line|
puts line
end
but this doesn't work.
lzo is packaged as part of elastic mapreduce so there's no need to install anything.
i just tried this and it works...
hadoop jar ~hadoop/contrib/streaming/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-input s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-all/1gram/ \
-inputformat SequenceFileAsTextInputFormat \
-output test_output \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper
Lzo compression has been removed from Hadoop 0.20.x onwards due to licensing issues. If you want to process lzo-compressed sequence files, lzo native libraries have to be installed and configured in hadoop cluster.
Kevin's Hadoop-lzo project is the current working solution I am aware of. I have tried it. It works.
Install ( if not done already so ) lzo-devel packages at OS. These packages enable lzo compression at the OS level without which hadoop lzo compression won't work.
Follow the instructions specified in the hadoop-lzo readme and compile it. After build, you would get hadoop-lzo-lib jar and hadoop lzo native libraries. Ensure that you compile it from the machine ( or machine of same arch ) where your cluster is configured.
Hadoop standard native libraries are also required which have been provided in the distribution by default for linux. If you are using solaris, you would also need to build hadoop from source inorder to get standard hadoop native libraries.
Restart the cluster once all changes are made.
You may want to look at this https://github.com/kevinweil/hadoop-lzo
I have weird results use lzo and my problem get resolved with some other codec
-D mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
Then things just work. You don't need (maybe also shouldn't) to change the -inputformat.
Version: 0.20.2-cdh3u4, 214dd731e3bdb687cb55988d3f47dd9e248c5690

Resources