As title, is it possible to write to a remote HDFS?
E.g. I have installed a HDFS cluster on AWS EC2, and I want to write a file from my local computer to the HDFS cluster.
Two ways you could write to remote HDFS,
Use the WebHDFS api available.It supports the systems running outside
Hadoop clusters to access and manipulate the HDFS contents. It
doesn't require the client systems to have hadoop binaries installed.
Configure the client system as Hadoop edge node to interact with the
Hadoop cluster/HDFS.
Please refer,
https://hadoop.apache.org/docs/r1.2.1/webhdfs.html
http://www.dummies.com/how-to/content/edge-nodes-in-hadoop-clusters.html
Related
I've read the docs and tutorials and I can see that all nodes that are either namenode/datanode would need to install hadoop.
But what about the client that actually requests a file read/write operation on hdfs?
Does the client require hadoop installation too? Or can it just do hdfs i/o only by somehow communication with the namenode url?
For example in python, I've seen sample codes that import pyarrow and read data from hdfs by giving the namenode url as a parameter. In such cases, should hadoop installation be required?
You need to install Hadoop client libraries to be able to make RPC requests to the Hadoop services such as HDFS or YARN.
PyArrow, Spark, Flink, etc. are clients, and do not require a local Hadoop installation to run/write code.
Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.
I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use
I understand the general concept behind it, but I would like more clarification and a clear-cut definition of what a "client" is.
For example, if I just write an hdfs command on the Terminal, is it still a "client" ?
Client in Hadoop refers to the Interface used to communicate with the Hadoop Filesystem. There are different type of Clients available with Hadoop to perform different tasks.
The basic filesystem client hdfs dfs is used to connect to a Hadoop Filesystem and perform basic file related tasks. It uses the ClientProtocol to communicate with a NameNode daemon, and connects directly to DataNodes to read/write block data.
To perform administrative tasks on HDFS, there is hdfs dfsadmin. For HA related tasks, hdfs haadmin.
There are similar clients available for performing YARN related tasks.
These Clients can be invoked using their respective CLI commands from a node where Hadoop is installed and has the necessary configurations and libraries required to connect to a Hadoop Filesystem. Such nodes are often referred as Hadoop Clients.
For example, if I just write an hdfs command on the Terminal, is it
still a "client" ?
Technically, Yes. If you are able to access the FS using the hdfs command, then the node has the configurations and libraries required to be a Hadoop Client.
PS: APIs are also available to create these Clients programmatically.
Edge nodes are the interface between the Hadoop cluster and the outside network. This node/host will have all the libraries and client components present, as well as current configuration of the cluster to connect to the hdfs.
This thread discusses same
I've a VPC subnet which has multiple machines inside it.
On of the machine, I've some files stored. On another machine, I've hadoop HDFS service installed and running.
I need to move those files from first machine to HDFS file system using SFTP.
Do Hadoop has some API's that can achieve this goal ?
PS : I've installed Hadoop using Cloudera CDH4 distribution.
This is a requirement which is much easier to implement on ftp/sftp server side than HDFS.
check out a ftp server works on top of HDFS hdfs-over-ftp
A workflow written in Apache Oozie would do it. It comes with the Cloudera distribution. Other tools for orchestration could be Talend or PDI Kettle.