Insufficient number of DataNodes reporting when creating dataproc cluster - hadoop

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.

As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.

The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Related

Sqoop failing when importing as avro in AWS EMR

I'm trying to perform an sqoop import in Amazon EMR(hadoop 2.8.5 sqoop 1.4.7). The import goes pretty well when no avro option(--as-avrodatafile) is specified. But once it's set, the job is failing with
19/10/29 21:31:35 INFO mapreduce.Job: Task Id : attempt_1572305702067_0017_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
Using this option -D mapreduce.job.user.classpath.first=true doesn't work.
Running locally(in my machine) I found that copying the avro-1.8.1.jar in sqoop to hadoop lib folder works, but in the EMR cluster I have only access to the master node, so doing the above doesn't work because it isn't the master node who runs the jobs.
Did anyone face this problem?
The solution I found was to connect to every node in the cluster(I thought I only had access to the master node, but I was wrong, in EMR we have access to all nodes) and replace the Avro jar that is included with Hadoop by the Avro jar that comes in Sqoop. It's not an elegant solution but it works.
[UPDATE]
Happened that the option -D mapreduce.job.user.classpath.first=true wasn't working because I was using s3a as target dir when Amazon says that we should use s3. As soon as I started using s3 Sqoop could perform the import correctly. So, no need of replacing any file in the nodes. Using s3a could lead to some strange errors under EMR due to Amazon own configuration, don't use it. Even in terms of performance s3 is better than s3a in EMR as the implementation for s3 is Amazon's.

Set YARN application name for Hadoop Distcp job

NOTE: I don't want to specify a YARN-queue name as in Hadoop: specify yarn queue for distcp
I frequently use hadoop distcp for moving data around HDFS and would like to have a descriptive application name for these jobs.
Presently all copying jobs just appear with the name "distcp" on Resource Manager UI and there's no way to distinguish between different jobs.
Is there a way to improve it?
Like many other MR tools, hadoop distcp also allows you to pass mapred properties using
-Dmapred.property.name=property-value
so when I use
hadoop distcp \
-Dmapred.job.name=billing_db.replicate \
-m 10 \
/user/hive/warehouse/billing_db.db/ \
s3a://my-s3-bucket/billing_db.db/
it appears nicely on Resource Manager UI
References
Hadoop: specify yarn queue for distcp
Sqoop User Guide: Using Generic and Specific Arguments

Issue with Hadoop and Google Cloud Storage Connector

I've deployed a hadoop cluster via Deployments interface in google console. (Hadoop 2.x)
My task was to filter data stored in one Google Storage (GS) bucket and put the results to another. So, this is a map only job with simple python script. Note that cluster and output bucket are in the same zone (EU).
Leveraging Google Cloud Storage Connector, I run the following streaming job:
hadoop jar /home/hadoop/hadoop-install/share/hadoop/tools/lib/hadoop-streaming-2.4.1.jar \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-D mapreduce.job.reduces=0 \
-file file_1 \
-file mymapper.py \
-input gs://inputbucket/somedir/somedir2/*-us-* \
-output gs://outputbucket/somedir3/somedir2 \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper mymapper.py
What happens is all the mappers process data and store the results in temporary directory in GS, which looks like:
gs://outputbucket/somedir3/somedir2/_temporary/1/mapper-0000/part-0000.gz
After all mappers are finished, job progress hangs at 100% map, 0% reduce. Looking at output bucket with gsutil, I see that result files are being copied to the destination directory:
gs://outputbucket/somedir3/somedir2
This process takes a very long time and kills the whole benefit from using hadoop.
My questions are:
1) Is it a known issue or I just done something wrong? I couldn't find any relevant info.
2) Am I correct saying that normally hdfs would move those files to destination dir, but GS can't perform move and thus the files are copied?
3) What can I do to avoid this pattern?
You're almost certainly running into the "Slow FileOutputCommitter" issue which affects Hadoop 2.0 through 2.6 inclusive and is fixed in 2.7.
If you're looking for a nice managed Hadoop option on Google Cloud Platform, you should consider Google Cloud Dataproc (documentation here), where we maintain our distro to ensure we pick up patches relevant to Google Cloud Platform quickly. Dataproc indeed configures the mapreduce.fileoutputcommitter.algorithm.version so that the final commitJob is fast.
For something more "do-it-yourself", you can user our command-line bdutil tool , which also has the latest update to use the fast FileOutputCommitter.

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik#vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil for files upload. Can it be forced to create these empty objects on file upload?
Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.
For example, I have run this job many times:
./hadoop-install/bin/hadoop \
jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
-mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
-output gs://big-data-roadshow/output
This accesses the same read-only bucket you mention in your example above.
The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.
I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.
The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

Resources