Cannot see the file I have upload to hadoop - hadoop

The pscp tell me, it is successful.
pscp -P 22 part-00003 username#172.31.143.131:/home/username/lab2_hdfs
username#172.31.143.131's password:
part-00000 | 758 kB | 758.9 kB/s | ETA: 00:00:00 | 100%
But I didn't see it in my hadoop when I use hdfs dfs -ls, why?

HDFS is not the same as your local filesystem. You can't upload files to HDFS using SCP

According your command you have just transfered your local file to a remote host and into a remote directory (/home/username/lab2_hdfs). At that stage HDFS wasn't involved at all and therefore do not know about the new file.
You may have a look into articles like Hadoop: copy a local file to HDFS and use commands like
hadoop fs -put part-00003 /hdfs/path

Related

how to copy file from remote server to HDFS

I have a remote server and servers authenticated Hadoop environment.
I want to copy file from Remote server to Hadoop machine to HDFS
Please advise efficient approach/HDFS command to copy files from remote server to HDFS.
Any example will be helpful.
as ordinary way to copy file from remote server to server itself is
scp -rp file remote_server:/tmp
but this approach not support copy directly to hdfs
You can try that:
ssh remote-server "hadoop -put - /tmp/file" < file
Here the remote server you mean to say it is not in the same network as the hadoop nodes. If that is the case may be you can scp from remote machine to hadoop nodes local file system and then use -put or -copyFromLocal command to move to HDFS.
example: hadoop fs -put file-name hdfs://namenode-uri/path-to-hdfs

Netezza utility NZLOAD to point -df location to the hdfs location

Currently, we are copying the files from hdfs to local and we are using the NZLOAD utility to load the data into Netezza, but just wanted to know if it is possible to provide the hdfs location of the files as below
nzload -host ${NZ_HOST} -u ${NZ_USER} -pw ${NZ_PASS} -db ${NZ_DB} -t ${TAR_TABLE} -df "hdfs://${HDFS_Location}"
As HDFS is different file system, nzload will not recognise the file if you provide hdfs file path in -df option of Netezza nzload.
You can use hdfs dfs -cat along with nzload to load Netezza table from hdfs directory.
$ hdfs dfs -cat /data/stud_dtls/stud_detls.csv | nzload -host 192.168.1.100 -u admin -pw password -db training -t stud_dtls -delim ','
Load session of table 'STUD_DTLS' completed successfully
Load HDFS file into Netezza Table Using nzload and External Tables

Curl, Kerberos authenticated file copy on hadoop

We need to establish a filecopy at HDFS location, between HDFS folders. We currently have used curl command, as shown below, in a shell script loop.
/usr/bin/curl -v --negotiate -u : -X PUT "<hnode>:<port>/webhdfs/v1/busy/rg/stg/"$1"/"$table"/"$table"_"$3".dsv?op=RENAME&destination=/busy/rg/data/"$1"/"$table"/"$table"_$date1.dsv"
However this achieves the file move. We need to establish a filecopy, such that file is maintained at the original staging location.
I Wanted to know if there is a corresponding curl operation? op=RENAME&destination instead of Rename, what else could work?
WebHDFS alone does not offer a copy operation in its interface. The WebHDFS interface provides lower-level file system primitives. A copy operation is a higher-level application that uses those primitive operations to accomplish its work.
The implementation of hdfs dfs -cp against a webhdfs: URL essentially combines op=OPEN and op=CREATE calls to complete the copy. You could potentially re-implement a subset of that logic in your script. If you want to pursue that direction, the CopyCommands class is a good starting point in the Apache Hadoop codebase for seeing how that works.
Here is a starting point for how this could work. There is an existing file at /hello1 that we want to copy to /hello2. This script calls curl to open /hello1 and pipes the output to another curl command, which creates /hello2, using stdin as the input source.
> hdfs dfs -ls /hello*
-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1
> curl -sS -L 'http://localhost:9870/webhdfs/v1/hello1?op=OPEN' |
> curl -sS -L -X PUT -d #- 'http://localhost:9870/webhdfs/v1/hello2?op=CREATE&user.name=cnauroth'
> hdfs dfs -ls /hello*
-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1
-rw-r--r-- 3 cnauroth supergroup 5 2017-07-06 09:20 /hello2
But my requirement is to connect from an external unix box, automated kerberos login into hdfs and then move the files within hdfs, hence the curl.
Another option could be a client-only Hadoop installation on your external host. You would have an installation of the Hadoop software and the same configuration files from the Hadoop cluster, and then you could issue the hdfs dfs -cp commands instead of running curl commands against HDFS.
I don't know what distribution you use, if you use Cloudera, try using BDR (Backup, Data recovery module) using REST APIs.
I used it to copy the files/folders within hadoop cluster and across hadoop clusters, it works against encrypted zones(TDE) as well

Move zip files from one server to hdfs?

What is the best approach to move files from one Linux box to HDFS should I use flume or ssh ?
SSH Command:
cat kali.txt | ssh user#hadoopdatanode.com "hdfs dfs -put - /data/kali.txt"
Only problem with SSH is I need to mention password every time need to check how to pass password without authentication.
Can flume move files straight to HDFS from one server?
Maybe you can make passwordless-ssh, then transfer files without entering password
Maybe you create a script in python for example which does the job for you
You could install hadoop client on a Linux box that has the files. Then you could "hdfs dfs -put" your data directly from that box to hadoop cluster.

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

Resources