How can I see the underlying Hadoop file system from Spark - hadoop

I have started Spark like this:
spark-shell --master local[10]
I'm trying to see the files on the underlying Hadoop installation.
I want to do something like this:
hdfs ls
How can I do it?

You can execute any underlying system/OS commands (like hdfs dfs -ls or even pure shell/DOS commands) from scala (which comes default with spark) just by importing classes from sys.process package.
see below for example
Linux
import sys.process._
val oldcksum = "cksum oldfile.txt" !!
val newcksum = "cksum newfile.txt" !!
val hdpFiles = "hdfs dfs -ls" !!
Windows
import sys.process._ # This will let underlying OS commands to be executed.
val oldhash = "certUtil -hashFile PATH_TO_FILE" !!#CertUtil is a windows command
If you plan to read and write from/to HDFS in Spark you need to first integrate the spark and hadoop.
http://spark.apache.org/docs/latest/configuration.html#inheriting-hadoop-cluster-configuration

If I understand your question correctly you want to execute HDFS commands from shell. In my opinion running spark job may not help.
You need to start your HDFS instance first. Below are the commands from the documentation. Once HDFS is started you can run the shell commands.
To start a Hadoop cluster you will need to start both the HDFS and
YARN cluster.
The first time you bring up HDFS, it must be formatted. Format a new
distributed filesystem as hdfs:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format Start
the HDFS NameNode with the following command on the designated node as
hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR
--script hdfs start namenode Start a HDFS DataNode with the following command on each designated node as hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config
$HADOOP_CONF_DIR --script hdfs start datanode If etc/hadoop/slaves and
ssh trusted access is configured (see Single Node Setup), all of the
HDFS processes can be started with a utility script. As hdfs:
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh Start the YARN with the
following command, run on the designated ResourceManager as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config
$HADOOP_CONF_DIR start resourcemanager Run a script to start a
NodeManager on each designated host as yarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config
$HADOOP_CONF_DIR start nodemanager Start a standalone WebAppProxy
server. Run on the WebAppProxy server as yarn. If multiple servers are
used with load balancing it should be run on each of them:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config
$HADOOP_CONF_DIR start proxyserver If etc/hadoop/slaves and ssh
trusted access is configured (see Single Node Setup), all of the YARN
processes can be started with a utility script. As yarn:
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh Start the MapReduce
JobHistory Server with the following command, run on the designated
server as mapred:
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config
$HADOOP_CONF_DIR start historyserver
Second option is programmatic way. You can use FileSystem class from Hadoop (It is a java implementation.) and do the hdfs operations.
Below is the link for javadoc.
https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html

You can see the underlying file system of HDFS using the commands on spark-shell:
import scala.sys.process._
val lsOutput = Seq("hdfs","dfs","-ls","/path/to/folder").!!

Related

How to run HDFS Copy commands using Airflow?

May I know how to execute HDFS copy commands on DataProc cluster using airflow.
After the cluster is created using airflow, I have to copy few jar files from Google storage to the HDFS master node folder.
You can execute hdfs commands on dataproc cluster using something like this
gcloud dataproc jobs submit hdfs 'ls /hdfs/path/' --cluster=my-cluster --
region=europe-west1
The easiest way is [1] via
gcloud dataproc jobs submit pig --execute 'fs -ls /'
or otherwise [2] as a catch-all for other shell commands.
For a single small file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because
hdfs://<master node>
is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
For a large file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consider [3] for details.
[1] https://pig.apache.org/docs/latest/cmds.html#fs
[2] https://pig.apache.org/docs/latest/cmds.html#sh
[3] https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
I am not sure about your use case to do this via airflow because if its onetime setup then i think we can run commands directly on dataproc cluster. But found some links which might be of some help. As i understand we can use BashOperator and can run commands.
https://big-data-demystified.ninja/2019/11/04/how-to-ssh-to-a-remote-gcp-machine-and-run-a-command-via-airflow/
Airflow Dataproc operator to run shell scripts

Preferred way to start a hadoop datanode

Functionally, I was wondering whether there is a difference between starting a hadoop namenode with the command:
$HADOOP_HOME/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
and:
hdfs datanode
The first command gives me an error stating that the datanode cannot ssh into itself(this is running on a docker container), while the second command seems to run without that issue. The official hadoop documentation for this version, (2.9.1) doesn't mention "hdfs datanode" as a way to start a datanode.

Cannot connect slave1:8088 in hadoop 2.7.2

I am new in hadoop and I have installed hadoop 2.7.2 into two machines which are master and slave1. I have followed this tutorial. It was not mentioned in the tutorial but I have also edited JAVA_HOME and HADOOP_CONF_DIR variables in hadoop-env.sh. At the end I have two machines hadoop installed. In master NameNode, DataNode, SecondaryNameNode, ResourceManager and NodeManager are running and in slave1 DataNode and NodeManager are running.
I am able to go to master:8088 in the browser and when I go http://master:8088/cluster/nodes, there is only master node here. I am not able to go isci17:8088 and that is not a live node. Why could it be?
Port 8088 is the resource manager web ui port, so if it is running on master you probably won't have it on the slave.
You should be able to also go to the name node web ui on port 50070 on your name node as well to see status such as http://master:50070/ and the MapReduce JobHistory Server at http://hostname:19888/ for a web ui.
If you have access to a terminal session you run the following command on each server as root/sudo user to see which ports are listening on which server in a Linux terminal session;
sudo lsof -i tcp | grep -i LISTEN
You also also run hadoop cli commands that will give you info;
You can run the following to check hadoops ports in a terminal session.
hdfs portmap
Other Health checks on command line;
hdfs classpath
hdfs getconf -namenodes
hdfs dfsadmin -report -live
hdfs dfsadmin -report -dead
hdfs dfsadmin -printTopology
Depending on if the hadoop cli command works automatically you might have to find the executable to run ./hdfs. Also depending on distro/version you might have to replace the command hdfs with the command hadoop.
If you want to see your cluster configurations check your /etc/hadoop/conf folder along with /etc/hadoop/hive . You will find about 5-10 *-site.xml files. There configuration files contain your clusters configuration with the hostnames and ports.

Reading a file in Spark in cluster mode in Amazon EC2

I'm trying to execute a spark program in cluster mode in Amazon Ec2 using
spark-submit --master spark://<master-ip>:7077 --deploy-mode cluster --class com.mycompany.SimpleApp ./spark.jar
And the class has a line that tries to read a file:
JavaRDD<String> logData = sc.textFile("/user/input/CHANGES.txt").cache();
I'm unable to read this txt file in cluster mode even if I'm able to read in standalone mode. In cluster mode, it's looking to read from hdfs. So I put the file in hdfs at /root/persistent-hdfs using
hadoop fs -mkdir -p /wordcount/input
hadoop fs -put /app/hadoop/tmp/input.txt /wordcount/input/input.txt
And I can see the file using hadoop fs -ls /workcount/input. But Spark is still unable to read the file. Any idea what I'm doing wrong. Thanks.
You might want to check the following points:
Is the file really in the persistent HDFS?
It seems that you just copy the input file from /app/hadoop/tmp/input.txt to /wordcount/input/input.txt, all in the node disk. I believe you misunderstand the functionality of the hadoop commands.
Instead, you should try putting the file explicitly in the persistent HDFS (root/persistent-hdfs/), and then loading it using the hdfs://... prefix.
Is the persistent HDFS server up?
Please take a look here, it seems Spark only starts the ephemeral HDFS server by default. In order to switch to the persistent HDFS server, you must do the following:
1) Stop the ephemeral HDFS server: /root/ephemeral-hdfs/bin/stop-dfs.sh
2) Start the persistent HDFS server: /root/persistent-hdfs/bin/start-dfs.sh
Please try these things, I hope they can serve you well.

How to check if hdfs is running?

I would like to see if the hdfs file system for Hadoop is working properly. I know that jps lists the daemons that are running, but I don't actually know which daemons to look for.
I ran the following commands:
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start namenode
$HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
$HADOOP_PREFIX/sbin/yarn-daemon.sh start resourcemanager
$HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
Only namenode, resourcemanager, and nodemanager appeared when I entered jps.
Which daemons are supposed to be running in order for hdfs/Hadoop to function? Also, what could you do to fix hdfs if it is not running?
Use any of the following approaches for to check your deamons status
JPS command would list all active deamons
the below is the most appropriate
hadoop dfsadmin -report
This would list down details of datanodes which is basically in a sense your HDFS
cat any file available in hdfs path.
So, I spent two weeks validating my setup (it was fine) , finally found this command:
sudo -u hdfs jps
Initially my simple JPS command was showing only one process, but Hadoop 2.6 under Ubuntu LTS 14.04 was up. I was using 'Sudo' to run the startup scripts.
Here is the startup that work with JPS listing multiple processes:
sudo su hduser
/usr/local/hadoop/sbin/start-dfs.sh
/usr/local/hadoop/sbin/start-yarn.sh

Resources