How to copy a file from HDFS to a Windows machine? - bash

I want to copy a .csv file from our Hadoop cluster in my local Desktop, so I can edit the file and upload back (replace).
I tried:
hadoop fs -copyToLocal /c_transaction_label.csv C:/Users/E_SJIRAK/Desktop
which yielded:
copyToLocal: '/Users/E_SJIRAK/Desktop': No such file or directory:
file:////Users/E_SJIRAK/Desktop
Help would be appreciated.

If you have SSH'd into the Hadoop cluster, then you cannot copyToLocal into Windows.
You need a 2 step process. Download from HDFS to the Linux environment. Then use SFTP (WinSCP, Filezilla, etc) or Putty scp command from Windows host to get files into your Windows machine.
Otherwise, you need to setup hadoop CLI command on Windows itself.

Related

0 datanodes when copying file from local to hadoop

My OS is Windows 10.
Ubuntu 20.04.3 LTS (GNU/Linux 4.4.0-19041-Microsoft x86_64) installed on Windows 10.
When I copy the local file to hadoop, I am receiving an error as 0 datanodes available.
I am able to copy the file from hadoop to local folder. I can see the file in local directory using the command $ ls -l
Also I am able to create directory or files in hadoop. But if restart the ubuntu terminal again, there is no such directory or files exist. It shows empty.
The steps I followed:
1. start-all.sh
2. jps
(datanodes missing)
3. copy the local file to hadoop
ERROR as 0 datanodes available
4. copy files from hadoop to local directory successful
If you stop/restart the WSL2 terminal without running stop-dfs or stop-all, you run the risk of corrupting the namenode, and it needs to be reformatted using hadoop namenode -format, not rm the namenode directory.
After formatting, you can restart the datanodes and they should become healthy again.
Same logic applies in a production environment, which is why you should always have a standby namenode for failover

Copying a directory from a remote HDFS local file system to my local machine

I have a directory on my local hdfs environment, I want to copy it to my local computer. I am accessing the hdfs using ssh (with a password).
I tried many suggested copy command but did not work.
What I tried:
scp ‘username#hn0-sc-had:Downloads/*’ ~/Downloads
as mentioned in this link.
What am I doing wrong?
SCP will copy from the remote Linux server.
HDFS does not exist on a single server or is a "local filesystem", therefore SCP is not the right tool to copy from it directly
Your options include
SSH to remote server
Use hdfs dfs -copyToLocal in order to pull files from HDFS
Use SCP from your computer to get the files you just downloaded on the remote server
Or
Configure a local Hadoop CLI using XML files from remote server
Use hdfs dfs -copytoLocal directly against HDFS from your own computer
Or
Install HDFS NFS Gateway
Mount an NFS volume on your local computer, and copy of files from it

Saving a file from a remote hdfs server to my local computer via zeppelin

I have access to a zeppelin notebook which sits on a remote server.
In this notebook I can access files on a remote HDFS cluster.
For example, via this notebook I can see the files in the HDFS (in a folder called /user/zeppelin/, and I can see the files for example by running hadoop fs -ls with the shell interpreter) and there are some files there that I want to transfer to my local computer (a Mac) from which I access the notebook. Is that possible? How can I do that?
These files were created by me using spark code on the notebook.
I'm really new to spark, zeppelin and HDFS. I did not need to install anything to access this notebook.
Thanks

Retrieve files from remote HDFS

My local machine does not have an hdfs installation. I want to retrieve files from a remote hdfs cluster. What's the best way to achieve this? Do I need to get the files from hdfs to one of the cluster machines fs and then use ssh to retrieve them? I want to be able to do this programmatically through say a bash script.
Here are the steps:
Make sure there is connectivity between your host and the target cluster
Configure your host as client, you need to install compatible hadoop binaries. Also your host needs to be running using same operating system.
Make sure you have the same configuration files (core-site.xml, hdfs-site.xml)
You can run hadoop fs -get command to get the files directly
Also there are alternatives
If Webhdfs/httpFS is configured, you can actually download files using curl or even your browser. You can write bash scritps if Webhdfs is configured.
If your host cannot have Hadoop binaries installed to be client, then you can use following instructions.
enable password less login from your host to the one of the node on the cluster
run command ssh <user>#<host> "hadoop fs -get <hdfs_path> <os_path>"
then scp command to copy files
You can have the above 2 commands in one script

How to copy files from a remote server to hdfs location

I want to copy files from a remote server using sftp to an hdfs location directly without copying the files to local. The hdfs location is a secured cluster. Please suggest if this is feasible and how to proceed in that case.
Also I would want to know if there is any other way to connect and copy apart from sftp.
I think the most convenient way (given that your remote machine is able to connect to the hadoop cluster) is to make that remote machine act as an HDFS client. Just ssh to that machine, install the hadoop distribution, configure it properly, then run:
hadoop fs -put /local/path /hdfs/path

Resources