copy a file from wsl to hdfs running on docker - hadoop

I'm trying to copy a file from my local drive to hdfs.
I'm running Hadoop on docker as an image. I try to perform some exercise on MapReduce, therefore, I want to copy a data file from a local drive (let's say my d: drive) to hdfs.
i tried below command but it fails with ssh: connect to host localhost port 22: Connection refused:
scp -P 50070 /mnt/d/project/recreate.out root#localhost:/root
since I'm new to Hadoop and big data my explanation may terrible. Please tolerate with me.
I'm trying to do above things from windows subsystem for Linux (WSL)
Regards,
crf

SCP won't move data to Hadoop. And port 50070 is not accepting connections over that protocol (SSH)
You need to setup and use a command similar to hdfs dfs -copyFromLocal. You can install the HDFS cli on the Windows host command prompt, too, so you don't need WSL to upload files...
When using Docker, I would suggest doing this
Add a volume mount from your host to some Hadoop container outside of the datanode and namenode directories (in other words, don't override the data that is there, and mounting files here will not "upload to HDFS")
docker exec into this running container
Run above hdfs command, uploading from the mounted volume

Related

How to copy a file from HDFS to a Windows machine?

I want to copy a .csv file from our Hadoop cluster in my local Desktop, so I can edit the file and upload back (replace).
I tried:
hadoop fs -copyToLocal /c_transaction_label.csv C:/Users/E_SJIRAK/Desktop
which yielded:
copyToLocal: '/Users/E_SJIRAK/Desktop': No such file or directory:
file:////Users/E_SJIRAK/Desktop
Help would be appreciated.
If you have SSH'd into the Hadoop cluster, then you cannot copyToLocal into Windows.
You need a 2 step process. Download from HDFS to the Linux environment. Then use SFTP (WinSCP, Filezilla, etc) or Putty scp command from Windows host to get files into your Windows machine.
Otherwise, you need to setup hadoop CLI command on Windows itself.

Copying a directory from a remote HDFS local file system to my local machine

I have a directory on my local hdfs environment, I want to copy it to my local computer. I am accessing the hdfs using ssh (with a password).
I tried many suggested copy command but did not work.
What I tried:
scp ‘username#hn0-sc-had:Downloads/*’ ~/Downloads
as mentioned in this link.
What am I doing wrong?
SCP will copy from the remote Linux server.
HDFS does not exist on a single server or is a "local filesystem", therefore SCP is not the right tool to copy from it directly
Your options include
SSH to remote server
Use hdfs dfs -copyToLocal in order to pull files from HDFS
Use SCP from your computer to get the files you just downloaded on the remote server
Or
Configure a local Hadoop CLI using XML files from remote server
Use hdfs dfs -copytoLocal directly against HDFS from your own computer
Or
Install HDFS NFS Gateway
Mount an NFS volume on your local computer, and copy of files from it

Write to HDFS running in Docker from another Docker container running Spark

I have a docker image for spark + jupyter (https://github.com/zipfian/spark-install)
I have another docker image for hadoop. (https://github.com/kiwenlau/hadoop-cluster-docker)
I am running 2 containers from the above 2 images in Ubuntu.
For the first container:
I am able to successfully launch jupyter and run python code:
import pyspark
sc = pyspark.sparkcontext('local[*]')
rdd = sc.parallelize(range(1000))
rdd.takeSample(False,5)
For the second container:
In the host Ubuntu OS, I am able to successfully go to the
web browser localhost:8088 : And browse the Hadoop all applications
localhost:50070: and browse the HDFS file system.
Now I want to write to the HDFS file system (running in the 2nd container) from jupyter (running in the first container).
So I add the additional line
rdd.saveAsTextFile("hdfs:///user/root/input/test")
I get the error:
HDFS URI, no host: hdfs:///user/root/input/test
Am I giving the hdfs path incorrectly ?
My understanding is that, I should be able to talk to a docker container running hdfs from another container running spark. Am I missing anything ?
Thanks for your time.
I haven't tried docker compose yet.
The URI hdfs:///user/root/input/test is missing an authority (hostname) section and port. To write to hdfs in another container you would need to fully specify the URI and make sure the two containers were on the same network and that the HDFS container has the ports for the namenode and data node exposed.
For example, you might have set the host name for the HDFS container to be hdfs.container. Then you can write to that HDFS instance using the URI hdfs://hdfs.container:8020/user/root/input/test (assuming the Namenode is running on 8020). Of course you will also need to make sure that the path you're seeking to write has the correct permissions as well.
So to do what you want:
Make sure your HDFS container has the namenode and datanode ports exposed. You can do this using an EXPOSE directive in the dockerfile (the container you linked does not have these) or using the --expose argument when invoking docker run. The default ports are 8020 and 50010 (for NN and DN respectively).
Start the containers on the same network. If you just do docker run with no --network they will start on the default network and you'll be fine. Start the HDFS container with a specific name using the --name argument.
Now modify your URI to include the proper authority (this will be the value of the docker --name argument you passed) and port as described above and it should work

Retrieve files from remote HDFS

My local machine does not have an hdfs installation. I want to retrieve files from a remote hdfs cluster. What's the best way to achieve this? Do I need to get the files from hdfs to one of the cluster machines fs and then use ssh to retrieve them? I want to be able to do this programmatically through say a bash script.
Here are the steps:
Make sure there is connectivity between your host and the target cluster
Configure your host as client, you need to install compatible hadoop binaries. Also your host needs to be running using same operating system.
Make sure you have the same configuration files (core-site.xml, hdfs-site.xml)
You can run hadoop fs -get command to get the files directly
Also there are alternatives
If Webhdfs/httpFS is configured, you can actually download files using curl or even your browser. You can write bash scritps if Webhdfs is configured.
If your host cannot have Hadoop binaries installed to be client, then you can use following instructions.
enable password less login from your host to the one of the node on the cluster
run command ssh <user>#<host> "hadoop fs -get <hdfs_path> <os_path>"
then scp command to copy files
You can have the above 2 commands in one script

How to copy files from a remote server to hdfs location

I want to copy files from a remote server using sftp to an hdfs location directly without copying the files to local. The hdfs location is a secured cluster. Please suggest if this is feasible and how to proceed in that case.
Also I would want to know if there is any other way to connect and copy apart from sftp.
I think the most convenient way (given that your remote machine is able to connect to the hadoop cluster) is to make that remote machine act as an HDFS client. Just ssh to that machine, install the hadoop distribution, configure it properly, then run:
hadoop fs -put /local/path /hdfs/path

Resources