Issue connecting to hdfs using cloud shell - hadoop

I find it quite hard accessing my hadoop data file system using google cloud shell (I’ve created a cluster on the Google Cloud Platform just to learn).
The generic ‘hdfs dfs -ls’ or ‘hadoop fs -ls gs://‘ doesn’t seem to work and I’ve been doing quite a but of trial-errors to figure out how.
Can anyone help me out on this?
Thanks :)

You can use Cloud Storage connector which provides an implementation of the FileSystem abstraction, and is available in different HDP versions, to facilitate access to GCS, and then you should be able to use 'hadoop fs -ls gs://CONFIGBUCKET/dir/file' in the hadoop shell. Please check this tutorial and also be sure that you are properly configured access to Google Cloud Storage.

The simplest way to access HDFS through Hadoop CLI is to SSH on the Dataproc cluster master node and use CLI utilities there:
gcloud compute ssh ${DATAPROC_CLUSTER_NAME}-m
hdfs dfs -ls
hadoop fs -ls gs:/
It doesn't work in Cloud Shell because it doesn't have Hadoop CLI utilities pre-installed.

Related

Why hadoop commands don't work on google cloud shell

After creating cluster for my project in google DataProc I've tried to type several commands for Hadoop (like hadoop fs -ls). Unfortunately it appears cloud shell doesn't see Hadoop at all!
-bash: hadoop: command not found
Someone on stackoverflow said:
"It doesn't work in Cloud Shell because it doesn't have Hadoop CLI
utilities pre-installed."
But I've no idea how to install it or either activate it. Maybe through cluster creation, but had issue with creating it through dataproc API. I've done it through cloud shell instead.
What should I do to use Hadoop commands in cloud shell properly?
apparently hadoop commands works only on VM Instances not on general project directory. So make sure you connect to cluster via Compute Engine -> VM instances -> [your node] in INSTANCES tab via SSH

How to run HDFS Copy commands using Airflow?

May I know how to execute HDFS copy commands on DataProc cluster using airflow.
After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.
https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/
Airflow Dataproc operator to run shell scripts

Hortonworks VM - Hadoop batch upload?

Is there a way to batch upload files to Hadoop under a Hortonworks VM running CentOS? I see I can use the Ambari - Sandbox's HDFS Files tool, but that only allows uploading one-by-one. Apparently you could use Redgate's HDFS Explorer in the past, but it's no longer available. Hadoop is made to process big data, but it's absurd having to upload all files one-by-one...
Thank you!
Of course you can use the * wildcard in copyFromLocal, f.e.:
hdfs dfs -copyFromLocal input/* /tmp/input

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

hadoop file system change directory command

I was going through the HADOOP fs commands list. I am little perplexed not to find any "cd" command in hadoop fs.
Why is it so? It might sound silly question for the HADOOP users, but as I am beginner I can not understand why there is no list of cd command in HADOOP fs level?
Think about it like this:
Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd
Let's say we have following structure in hdfs:
d1/
d2/
f1
d3/
f2
d4/
f3
You could do cd in your Linux file system from moving from one to the other but do you think changing directory in hadoop would makes sense? HDFS is like virtual file system and you dont directly interact with hdfs except via hadoop command or job tracker.
HDFS provides various features that enable accessing HDFS(Hadoop Filesystem) easy on local machines or edge nodes. You have an option to mount HDFS using any of the following methods. Once Hadoop file system is mounted on your machine, you may use cd command to browse through the file system (It's is like mounting remote network filesystem like NAS)
Fuse dfs (Available from Hadoop 0.20 onwards )
NFSv3 Gateway access to HDFS data (Available from Hadoop version
Hadoop 2.2.0)

Resources