Hadoop configure cluter queried based on a flag/env parameter - hadoop

Apologies beforehand if this turns out to be a silly question, I am new to hadoop environment.
I have two hadoop clusters my-prod-cluster and my-bcp-cluster.
Both are accessible over the same network.
Is there any way to configure my clusters in such a way that when I am in BCP mode, all my queries to my-prod-cluster gets routed to my-bcp-cluster (on the basis of some config parameter or environment variable)
So when flag=prod
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-prod-cluster/mydir
and fetches the data in /my-prod-cluster/mydir
when the flag=bcp
hadoop fs -ls /my-prod-cluster/mydir translates to hadoop fs -ls /my-bcp-cluster/mydir
and fetches data from /my-bcp-cluster/mydir
I am using [mapr][1] flavour of haddop(provided by HP), version 6.1, in case that matters

You could easily make a shell wrapper script that prepends the NameNode address to each query
For example, a fully-qualified command would look like this
hdfs dfs -ls hdfs://my-prod-cluster.domain.com/path/to/mydir
So, refactoring that, you could have a script like
#!/bin/sh
if [ $1 -eq "prod" ]; then
NAMENODE=hdfs://my-prod-cluster.domain.com
fi
# TODO: error handling and more clusters
PATH=$2
hdfs dfs -ls "${NAMENODE}${PATH}"
Then execute something like my-hdfs-ls prod /mydir
If you need something more complex than that like Kerberos tickets, and such, then creating a separate HADOOP_CONF_DIR variable with unique core-site and hdfs-site XMLs for each cluster would be recommended.

Related

How to create new user in hadoop

I am new to hadoop. I have done apache hadoop multinode installation and the user name is hadoop.
I am using total 3 nodes: 1 namenode and 2 datanodes
I have to create new user for data isolation. I have found few links on google, but those are not working and I am unable to access the hdfs.
**[user1#datanode1~]# hdfs dfs -ls -R /
bash: hdfs: command not found...**
Can someone help me with the steps to create the new user which can access hdfs for data isolation. And on which node I should create the new user.
Thanks
Hadoop doesn't have users like Linux does. Users are generally managed by external LDAP/Kerberos systems. By default, there is not even security features, all user-names are based on the HADOOP_USER_NAME environment variable, and can be overriden by export command. Also, by default, the user used is the current username, for example, your command user1#datanode1 # hdfs dfs -ls would actually run hdfs dfs -ls /user/user1, and return an error if that folder doesn't first exist.
However, your actual error is saying that your OS PATH variable does not include $HADOOP_HOME/bin, for example. Edit your .bashrc to fix this.
You'd create an HDFS folder for "user" username with
hdfs dfs -mkdir /user/username
hdfs dfs -chown username /user/username
hdfs dfs -chmod -R 770 /user/username
And you also should run useradd command on the namenode machine to make sure it knows about a user named "username"

HDFS space consumed: "hdfs dfs -du /" vs "hdfs dfsadmin -report"

Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?
Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du / shows space consume your data without replications.
Command hdfs dfsadmin -report (line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud command.
How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

How does one run "hadoop fs -text ." remotely using the Java API?

Basically, what I want is use the Hadoop Java API to call from local to a remote Hadoop cluster. I want the Hadoop cluster to execute the
It should be roughly equivalent to "ssh user#remote 'hadoop fs -text .'"
First of all, if all you want is exactly what hadoop fs -text gives you, then you can certainly just install the hadoop client on your local machine, and run it there, being sure to specify the full path:
hadoop fs -text hdfs://remote.namenode.host:9000/my/file
But if you do have a reason to do it from java, the basic answer is something like this:
Configuration conf = new Configuration();
Path p = new Path("hdfs://remote.namenode.host:9000/foo/bar");
FileSystem fs = p.getFileSystem(conf);
InputStream in = fs.open(p);
You can then read from that input stream however you like: copy it to stdout or whatever.
Note that the fs -text is a little bit more clever than just raw copying. It detects gzipped files and sequence files and "decodes" them into text. This is pretty tricky; you can check out the source code to see how its done internally.

SFTP file system in hadoop

Does hadoop version 2.0.0 and CDH4 have a SFTP file system in place ? I know hadoop has a support for FTP Filesystem . Does it have something similar for sftp ? I have seen some patches submitted for the sme though couldn't make sense of them ..
Consider using hadoop distcp.
Check here. That would be something like:
hadoop distcp
-D fs.sftp.credfile=/user/john/credstore/private/mycreds.prop
sftp://myHost.ibm.com/home/biadmin/myFile/part1
hdfs:///user/john/myfiles
After some research , I have figured out that hadoop currently doesn't have a FileSystem written for SFTP . Hence if you wish to read data using SFTP channel you have to either write a SFTP FileSystem (which is quite a big deal , extending and overriding lots of classes and methods) , patches of which are already been developed , though not yet integrated into hadoop , else get a customized InputFormat that reads from streams , which again is not implemented in hadoop.
You need to ensure core-site.xml is having property fs.sftp.impl set with value org.apache.hadoop.fs.sftp.SFTPFileSystem
Post this hadoop commands will work. Couple of samples are given below
ls command
Command on hadoop
hadoop fs -ls /
equivalent for SFTP
hadoop fs -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} -ls sftp://{hostname}:22/
Distcp command
Command on hadoop
hadoop distcp {sourceLocation} {destinationLocation}
equivalent for SFTP
hadoop distcp -D fs.sftp.user.{hostname}={username} -D fs.sftp.password.{hostname}.{username}={password} sftp://{hostname}:22/{sourceLocation} {destinationLocation}
Ensure you are replacing all the place holders while trying these commands. I tried them on AWS EMR 5.28.1 which has Hadoop 2.8.5 installed on it
So hopefully cleaning up these answers a bit into something more digestible. Basically Hadoop/HDFS is capable of support SFTP, it's just not enabled by default, nor is it really documented in the core-default.xml very well.
The key configuration you need to set to enable SFTP support is:
<property>
<name>fs.sftp.impl</name>
<value>org.apache.hadoop.fs.sftp.SFTPFileSystem</value>
</property>
Alternatively, you can set it right at the CLI depending on your command
hdfs dfs \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=~/.ssh/java_sftp_testkey.ppk \
-ls sftp://$USER#localhost/tmp/
The biggest requirement is that your SSH Keyfile needs to be password-less to work. This can be done via
cp ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile.ppk.orig
ssh-keygen -p -P MyPass -N "" -f ~/.ssh/mykeyfile.ppk
mv ~/.ssh/mykeyfile.ppk ~/.ssh/mykeyfile_nopass.ppk
mv ~/.ssh/mykeyfile.ppk.orig ~/.ssh/mykeyfile.ppk
And finally, the biggest (and maybe neatest) is using this via distcp, if you need to send/receive a large amount of data to/from an SFTP server. There's an oddity about the ssh keyfile being needed locally to generate the directory listing, as well as on the cluster for the actual workers.
Something like this should work well enough:
cd workdir
ln -s ~/.ssh/java_sftp_testkey.ppk
hadoop distcp \
--files ~/.ssh/java_sftp_testkey.ppk \
-Dfs.sftp.impl=org.apache.hadoop.fs.sftp.SFTPFileSystem \
-Dfs.sftp.keyfile=java_sftp_testkey.ppk \
hdfs:///path/to/source/ \
sftp://user#FQDN/path/to/dest

Resources