can I connect to hdfs without hadoop installation? - hadoop

I've read the docs and tutorials and I can see that all nodes that are either namenode/datanode would need to install hadoop.
But what about the client that actually requests a file read/write operation on hdfs?
Does the client require hadoop installation too? Or can it just do hdfs i/o only by somehow communication with the namenode url?
For example in python, I've seen sample codes that import pyarrow and read data from hdfs by giving the namenode url as a parameter. In such cases, should hadoop installation be required?

You need to install Hadoop client libraries to be able to make RPC requests to the Hadoop services such as HDFS or YARN.
PyArrow, Spark, Flink, etc. are clients, and do not require a local Hadoop installation to run/write code.

Related

why I could use HBase without starting Hadoop/HDFS?

I am new to HBase, recently I installed HBase and tried to start it on my Mac. Everything is fine and I could play with HBase. In some articles, it said I should start Hadoop first when using HBase, I am wondering if this prerequisite changed?
Hadoop is not a hard requirement for HBase unless you are running fully distributed which you are not. Running on a single node like you are you can use the local filesystem. See HBase run modes: Standalone and Distributed for more information.
Your local filesystem (the file:// URI) is Hadoop-compatible. Hbase requires a Hadoop compatible storage layer, but that does not mean that it must literally be HDFS.
HDFS will simply provide scalability and reliability

What does "Client" exactly mean for Hadoop / HDFS?

I understand the general concept behind it, but I would like more clarification and a clear-cut definition of what a "client" is.
For example, if I just write an hdfs command on the Terminal, is it still a "client" ?
Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem. There are different type of Clients available with Hadoop to perform different tasks.
The basic filesystem client hdfs dfs is used to connect to a Hadoop Filesystem and perform basic file related tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.
To perform administrative tasks on HDFS, there is hdfs dfsadmin. For HA related tasks, hdfs haadmin.
There are similar clients available for performing YARN related tasks.
These Clients can be invoked using their respective CLI commands from a node where Hadoop is installed and has the necessary configurations and libraries required to connect to a Hadoop Filesystem. Such nodes are often referred as Hadoop Clients.
For example, if I just write an hdfs command on the Terminal, is it
still a "client" ?
Technically, Yes. If you are able to access the FS using the hdfs command, then the node has the configurations and libraries required to be a Hadoop Client.
PS: APIs are also available to create these Clients programmatically.
Edge nodes are the interface between the Hadoop cluster and the outside network. This node/host will have all the libraries and client components present, as well as current configuration of the cluster to connect to the hdfs.
This thread discusses same

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

Is it possible to write to a remote HDFS?

As title, is it possible to write to a remote HDFS?
E.g. I have installed a HDFS cluster on AWS EC2, and I want to write a file from my local computer to the HDFS cluster.
Two ways you could write to remote HDFS,
Use the WebHDFS api available.It supports the systems running outside
Hadoop clusters to access and manipulate the HDFS contents. It
doesn't require the client systems to have hadoop binaries installed.
Configure the client system as Hadoop edge node to interact with the
Hadoop cluster/HDFS.
Please refer,
https://hadoop.apache.org/docs/r1.2.1/webhdfs.html
http://www.dummies.com/how-to/content/edge-nodes-in-hadoop-clusters.html

Moving files to Hadoop HDFS using SFTP

I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.

Resources