It's possible initialize the connection with Hive passing a init script from HDFS?
I tried beeline -u <<url>> -i wasb://<path-to-init-script>, the connection is estabelished but the script initiliazation fails with 'no such file or directory', but the files exists and is listed with hdfs dfs -ls <<path-to-file>> and ẁith hdfs dfs -ls wasb://<path-to-file>> too. The same happens if i try with beeline -u <<url>> -f wasb://<path-to-file>.
I am running inside a HDInsight(HDI 3.5) connected by ssh and the connection with the Hive is running in http mode. The Hadoop version is the 2.7.
Related
I have a remote server and servers authenticated Hadoop environment.
I want to copy file from Remote server to Hadoop machine to HDFS
Please advise efficient approach/HDFS command to copy files from remote server to HDFS.
Any example will be helpful.
as ordinary way to copy file from remote server to server itself is
scp -rp file remote_server:/tmp
but this approach not support copy directly to hdfs
You can try that:
ssh remote-server "hadoop -put - /tmp/file" < file
Here the remote server you mean to say it is not in the same network as the hadoop nodes. If that is the case may be you can scp from remote machine to hadoop nodes local file system and then use -put or -copyFromLocal command to move to HDFS.
example: hadoop fs -put file-name hdfs://namenode-uri/path-to-hdfs
Currently, we are copying the files from hdfs to local and we are using the NZLOAD utility to load the data into Netezza, but just wanted to know if it is possible to provide the hdfs location of the files as below
nzload -host ${NZ_HOST} -u ${NZ_USER} -pw ${NZ_PASS} -db ${NZ_DB} -t ${TAR_TABLE} -df "hdfs://${HDFS_Location}"
As HDFS is different file system, nzload will not recognise the file if you provide hdfs file path in -df option of Netezza nzload.
You can use hdfs dfs -cat along with nzload to load Netezza table from hdfs directory.
$ hdfs dfs -cat /data/stud_dtls/stud_detls.csv | nzload -host 192.168.1.100 -u admin -pw password -db training -t stud_dtls -delim ','
Load session of table 'STUD_DTLS' completed successfully
Load HDFS file into Netezza Table Using nzload and External Tables
How determine Hive's database size from Bash or from Hive CLI?
hdfs and hadoop commands are also avaliable in Bash.
A database in hive is a metadata storage - meaning it holds information about tables and has a default location. Tables in a database can also be stored anywhere in hdfs if location is specified when creating a table.
You can see all tables in a database using show tables command in Hive CLI.
Then, for each table, you can find its location in hdfs using describe formatted <table name> (again in Hive CLI).
Last, for each table you can find its size using hdfs dfs -du -s -h /table/location/
I don't think there's a single command to measure the sum of sizes of all tables of a database. However, it should be fairly easy to write a script that automates the above steps. Hive can also be invoked from bash CLI using: hive -e '<hive command>'
Show Hive databases on HDFS
sudo hadoop fs -ls /apps/hive/warehouse
Show Hive database size
sudo hadoop fs -du -s -h /apps/hive/warehouse/{db_name}
if you want the size of your complete database run this on your "warehouse"
hdfs dfs -du -h /apps/hive/warehouse
this gives you the size of each DB in your warehouse
if you want the size of tables in a specific DB run:
hdfs dfs -du -h /apps/hive/warehouse/<db_name>
run a "grep warehouse" on hive-site.xml to find your warehouse path
I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.
Does hadoop version 2.0.0 and CDH4 have a SFTP file system in place ? I know hadoop has a support for FTP Filesystem . Does it have something similar for sftp ? I have seen some patches submitted for the sme though couldn't make sense of them ..
Consider using hadoop distcp.
Check here. That would be something like:
hadoop distcp
-D fs.sftp.credfile=/user/john/credstore/private/mycreds.prop
sftp://myHost.ibm.com/home/biadmin/myFile/part1
hdfs:///user/john/myfiles
After some research , I have figured out that hadoop currently doesn't have a FileSystem written for SFTP . Hence if you wish to read data using SFTP channel you have to either write a SFTP FileSystem (which is quite a big deal , extending and overriding lots of classes and methods) , patches of which are already been developed , though not yet integrated into hadoop , else get a customized InputFormat that reads from streams , which again is not implemented in hadoop.
You need to ensure core-site.xml is having property fs.sftp.impl set with value org.apache.hadoop.fs.sftp.SFTPFileSystem
Post this hadoop commands will work. Couple of samples are given below
ls command
Command on hadoop
hadoop fs -ls /
equivalent for SFTP
hadoop fs -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} -ls sftp://{hostname}:22/
Distcp command
Command on hadoop
hadoop distcp {sourceLocation} {destinationLocation}
equivalent for SFTP
hadoop distcp -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} sftp://{hostname}:22/{sourceLocation} {destinationLocation}
Ensure you are replacing all the place holders while trying these commands. I tried them on AWS EMR 5.28.1 which has Hadoop 2.8.5 installed on it
So hopefully cleaning up these answers a bit into something more digestible. Basically Hadoop/HDFS is capable of support SFTP, it's just not enabled by default, nor is it really documented in the core-default.xml very well.
The key configuration you need to set to enable SFTP support is:
<property>
<name>fs.sftp.impl</name>
<value>org.apache.hadoop.fs.sftp.SFTPFileSystem</value>
</property>
Alternatively, you can set it right at the CLI depending on your command
hdfs dfs \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=~/.ssh/java_sftp_testkey.ppk \
-ls sftp://$USER#localhost/tmp/
The biggest requirement is that your SSH Keyfile needs to be password-less to work. This can be done via
cp ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile.ppk.orig
ssh-keygen -p -P MyPass -N "" -f ~/.ssh/mykeyfile.ppk
mv ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile_nopass.ppk
mv ~/.ssh/mykeyfile.ppk.orig ~/.ssh/mykeyfile.ppk
And finally, the biggest (and maybe neatest) is using this via distcp, if you need to send/receive a large amount of data to/from an SFTP server. There's an oddity about the ssh keyfile being needed locally to generate the directory listing, as well as on the cluster for the actual workers.
Something like this should work well enough:
cd workdir
ln -s ~/.ssh/java_sftp_testkey.ppk
hadoop distcp \
--files ~/.ssh/java_sftp_testkey.ppk \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=java_sftp_testkey.ppk \
hdfs:///path/to/source/ \
sftp://user#FQDN/path/to/dest