how to address HADOOP_CONF_DIR files(yarn-site.xml, ...) from client to remote hdfs directory - hadoop

I have one unique Yarn cluster which is used by many remote clients that submits spark applications to it.
I need to set HADOOP_CONF_DIR environment variable in each client as my master is yarn(--master yarn), but i don't want to copy it from Yarn cluster to each client separately.
I want to put HADOOP_CONF_DIR in hdfs which is accessible for all clients.
Now, how can i address this environment variable (HADOOP_CONF_DIR) in each client to access and read from hdfs URL.
for example when I used like this:
export HADOOP_CONF_DIR=hdfs://namenodeIP:9000/path/to/conf_dir
or in python code I used:
os.environ['HADOOP_CONF_DIR']=hdfs://namenodeIP:9000/path/to/conf_dir
both of them don't work for me.
what is the correct form?
and where should I set this? in code, in spark-env.sh, in terminal, ...

I don't think this is possible. You need that variable to also know HDFS namenode location, not only YARN.
If you have control over the client machines, you could use tools like Syncthing to distribute the files automatically, but that assumes your intra-cluster connection values are the same as external access (i.e. use FQDN values in all server addresses).

Related

How to send files from Local Machine to HortonBox instance running on Virtual Box?

I'm using Hortonbox 3.0.1 on a virtual box and ssh into it using putty. I have some files in my local machine (Windows 10), which I want to store in the hadoop file system.
SSH-ing into hortonbox instance, gives me a terminal of the instance, which means all files from the windows instance are not visible to the terminal.
Is there any way I can put files into the HDFS instance?
I am aware of WinSCP but that does not really serve my purpose. WinSCP would mean me sending the file onto the system, using my ssh to store the file on hadoop, and then deleting the file from the system after storing on data nodes. I might be wrong but this seems like additional and redundant work and I would always need a buffer for storage where hadoop is running, for extremely large files, this solution will almost certainly fail considering I would first need to store the entire file on the secondary disk, then send it to the data nodes through the name node. Is there any way to achieve this or the problem I'm facing is due to using a hortonbox instance? How does organizations handle sending data from several nodes to the namenode and then to datanodes?
First, you don't send data to the namenode for it to be placed on datanodes. When you issue hdfs put commands, the only information requested from the namenode is locations for the files to be placed.
That being said, if you want to skip SSH entirely, you need to forward the Namenode and datanode ports from the VM to your host, then install and configure the hadoop fs/hdfs commands on your windows host such that you can issue them directly from CMD.
The alternative is to use Fuse/SFTP/NFS/Samba mounts (referred to as a "shared folder" in the Virtualbox GUI) from Windows into the VM, where you could then run put without copying anything into the VM

hadoop access without ssh

Is there a way to allow a developer to access a hadoop command line without SSH? I would like to place some hadoop clusters in a specific environment where SSH is not permitted. I have searched for alternatives such as a desktop client but so far have not seen anything. I will also need to federate sign on info for developers.
If you're asking about hadoop fs and similar commands, you don't need SSH for this.
You just need to download Hadoop clients and configure the hdfs-site.xml file to point at a remote cluster. However, this is an administrative security hole, so setting up an edge node that does have trusted and audited SSH access is preferred.
Similarly, Hive or HBase or Spark jobs can be ran with the appropriate clients or configuration files without any SSH access, just local libraries
You don't need SSH to use Hadoop. Also Hadoop is a combination of different stacks, which part of Hadoop are you referring to specifically? If you are talking about HDFS you can use web HDFS. If you are talking about YARN you can use API call. There are also various UI tools such as HUE you can use. Notebook apps such as Zeppelin or Jupiter can also be helpful.

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Hadoop cluster on NFS

I'm trying to setup a hadoop cluster on 5 machines on same lan with NFS. The problem im facing is that the copy of hadoop on one machine is replicated on all the machines, so i cant provide exclusive properties for each slaves. Due to this, i get "Cannot create lock" kind of errors. The FAQ suggests that NFS should not be used, but i have no other option.
Is there a way where i can specify properties like, Master should pick its conf files from location1, slave1 should pick its conf files from location2 .....
Just to be clear, there's a difference between configurations for compute nodes and HDFS storage. Your issue appears to be solely the storage for configurations. This can and should be done locally, or at least let each machine map to a symlink based on some locally identified configuration (e.g. Mach01 -> /etc/config/mach01, ...).
(Revision 1) Regarding the comment/question below about symlinks: First, I'm going to admit that this is not something I can immediately solve. There are 2 approaches I see:
Have a script (e.g. at startup or as a wrapper for starting Hadoop) on the machine determine the hostname (e.g. hostname -a') which then identifies a local symlink (e.g./usr/local/hadoopConfig') to the correct directory on the NFS directory structure.
Set an environment variable, a la HADOOP_HOME, based on the local machine's hostname, and let various scripts work with this.
Although #1 should work, it is a method relayed to me, not one that I set up, and I'd be a little concerned about symlinks in the event that the hostname is misconfigured (this can happen). Method #2 is one that seems more robust.

Resources