Hadoop cluster on NFS - hadoop

I'm trying to setup a hadoop cluster on 5 machines on same lan with NFS. The problem im facing is that the copy of hadoop on one machine is replicated on all the machines, so i cant provide exclusive properties for each slaves. Due to this, i get "Cannot create lock" kind of errors. The FAQ suggests that NFS should not be used, but i have no other option.
Is there a way where i can specify properties like, Master should pick its conf files from location1, slave1 should pick its conf files from location2 .....

Just to be clear, there's a difference between configurations for compute nodes and HDFS storage. Your issue appears to be solely the storage for configurations. This can and should be done locally, or at least let each machine map to a symlink based on some locally identified configuration (e.g. Mach01 -> /etc/config/mach01, ...).
(Revision 1) Regarding the comment/question below about symlinks: First, I'm going to admit that this is not something I can immediately solve. There are 2 approaches I see:
Have a script (e.g. at startup or as a wrapper for starting Hadoop) on the machine determine the hostname (e.g. hostname -a') which then identifies a local symlink (e.g./usr/local/hadoopConfig') to the correct directory on the NFS directory structure.
Set an environment variable, a la HADOOP_HOME, based on the local machine's hostname, and let various scripts work with this.
Although #1 should work, it is a method relayed to me, not one that I set up, and I'd be a little concerned about symlinks in the event that the hostname is misconfigured (this can happen). Method #2 is one that seems more robust.

Related

how to address HADOOP_CONF_DIR files(yarn-site.xml, ...) from client to remote hdfs directory

I have one unique Yarn cluster which is used by many remote clients that submits spark applications to it.
I need to set HADOOP_CONF_DIR environment variable in each client as my master is yarn(--master yarn), but i don't want to copy it from Yarn cluster to each client separately.
I want to put HADOOP_CONF_DIR in hdfs which is accessible for all clients.
Now, how can i address this environment variable (HADOOP_CONF_DIR) in each client to access and read from hdfs URL.
for example when I used like this:
export HADOOP_CONF_DIR=hdfs://namenodeIP:9000/path/to/conf_dir
or in python code I used:
os.environ['HADOOP_CONF_DIR']=hdfs://namenodeIP:9000/path/to/conf_dir
both of them don't work for me.
what is the correct form?
and where should I set this? in code, in spark-env.sh, in terminal, ...
I don't think this is possible. You need that variable to also know HDFS namenode location, not only YARN.
If you have control over the client machines, you could use tools like Syncthing to distribute the files automatically, but that assumes your intra-cluster connection values are the same as external access (i.e. use FQDN values in all server addresses).

Sync config files between nodes on hadoop cluster

I have a hadoop cluster consisting of 4 nodes on which I am running a pyspark script. I have a config.ini file which contains details like locations of certificates, passwords, server names etc which are needed by the script. Each time this file is updated I need to sync the changes across all 4 nodes. Is there a way to avoid that?
I have needed to sync or update changes to my script. Making them on just one node and running it from there is enough. Is the same possible for the config file?
The most secure answer is likely to learn how to use a keystore with spark.
A little less secure but still good. Have you considered you could just put the file in HDFS and then just reference it? (lower security but easier to use)
Unsecure methods that are easy to use:
You can also pass it as a file to spark-submit to transfer the file for you.
Or you could add the values to your spark submit.

How to send files from Local Machine to HortonBox instance running on Virtual Box?

I'm using Hortonbox 3.0.1 on a virtual box and ssh into it using putty. I have some files in my local machine (Windows 10), which I want to store in the hadoop file system.
SSH-ing into hortonbox instance, gives me a terminal of the instance, which means all files from the windows instance are not visible to the terminal.
Is there any way I can put files into the HDFS instance?
I am aware of WinSCP but that does not really serve my purpose. WinSCP would mean me sending the file onto the system, using my ssh to store the file on hadoop, and then deleting the file from the system after storing on data nodes. I might be wrong but this seems like additional and redundant work and I would always need a buffer for storage where hadoop is running, for extremely large files, this solution will almost certainly fail considering I would first need to store the entire file on the secondary disk, then send it to the data nodes through the name node. Is there any way to achieve this or the problem I'm facing is due to using a hortonbox instance? How does organizations handle sending data from several nodes to the namenode and then to datanodes?
First, you don't send data to the namenode for it to be placed on datanodes. When you issue hdfs put commands, the only information requested from the namenode is locations for the files to be placed.
That being said, if you want to skip SSH entirely, you need to forward the Namenode and datanode ports from the VM to your host, then install and configure the hadoop fs/hdfs commands on your windows host such that you can issue them directly from CMD.
The alternative is to use Fuse/SFTP/NFS/Samba mounts (referred to as a "shared folder" in the Virtualbox GUI) from Windows into the VM, where you could then run put without copying anything into the VM

CoreOS & HDFS - Running a distributed file system in Linux Containers/Docker

I need some sort of distributed file system running on a CoreOS cluster.
As such I'd like to run HDFS on CoreOS nodes. Is this possible?
I can see 2 options;
Expand CoreOS - Install HDFS directly onto CoreOS - not ideal as it breaks the whole concept of CoreOS's containerisation and would mean installing a lot of additional components
Somehow run HDFS in a Docker container on CoreOS and set affinities
Option 2 seems like the best approach, however, there are some potential blockers;
How do I reliably expose the physical disks to the Docker container running HDFS?
How do you scale container affinities?
How does this work the the Name nodes etc?
Cheers.
I'll try to provide two possibilities. I haven't tried either of these, so they are mostly suggestions. But could get you down the right path.
The first, if you want to do HDFS and it requires device access on the host, would be to run the HDFS daemons in a privileged container that had access to the required host devices (the disks directly). See https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration for information on the --privileged and --device flags.
In theory, you could pass the devices to the container that is handling the access to disks. Then you could use something like --link to talk to each other. The NameNode would store the metadata on the host using a volume (passed with -v). Though, given the little reading I have done about NameNode, it seems like there won't be a good solution yet for high availability anyways and it is a single point of failure.
The second option to explore, if you are looking for a clustered file system and not HDFS in particular, would be to check out the recent Ceph FS support added to the kernel in CoreOS 471.1.0: https://coreos.com/releases/#471.1.0. You might then be able to use the same approach of privileged container to access host disks to build a Ceph FS cluster. Then you might have a 'data only' container that had Ceph tools installed to mount a directory on the Ceph FS cluster, and expose this as a volume for other containers to use.
Though both of these are only ideas and I haven't used HDFS or Ceph personally (though I am keeping an eye on Ceph and would like to try something like this soon as a proof of concept).

Hadoop Configuration Physical Location

I want to ask a basic question that I couldn't find in online tutorials.
Does the hadoop config files need to be on all nodes? (NameNode, DataNode, JobTracker, and etc)
Or do they only need to reside on the user#machine where NameNode resides?
In other words, to properly set up a fully distributed cluster, do I need to replicate the config files to every single node?
Thanks!
Yes you are right, the config files need to be on every slave.
I say just slave, because a master has usually other configurations you may want to use, which makes the configuration on the slaves a bit more verbose.
Two things that make live more easier:
Use a NFS Mount for the configuration of the slaves
Or use a tool that does this for you like Chef

Resources