How to run parallelized python jobs on yarn using Dask? - hadoop

I have a couple of questions on using Dask with Hadoop/Yarn.
1 ) How do I connect Dask to Hadoop/YARN and parallelize a job?
When I try using:
from dask.distributed import Client
client = Client('Mynamenode:50070')
It results in the error:
CommClosedError: in : Stream is closed: while trying to call remote method 'identity'
Should I pass the address of the name node or a datanode? Can I refer Zookeeper instead?
2 ) How do I read data from HDFS, using Dask and HDFS3?
When I try to read a file using:
import dask.dataframe as dd
import distributed.hdfs
df = dd.read_csv('hdfs:///user/uname/dataset/temps.csv')
It results in the following error :
ImportError: No module named lib
I have tried uninstalling and reinstalling hdfs3, but the error is still persistent.
I have installed knit and tried launching yarn containers using this example:
http://knit.readthedocs.io/en/latest/examples.html#ipython-parallel
This fails with a security error.
I do not have sudo access on the cluster, so installing any packages on each node in the cluster is out of the question, the only installations I can do are via conda and pip under my userid.
Finally, it would be greatly helpful, if someone could post a working example of Dask on Yarn.
Greatly appreciate any help,

The simplest implementation of dask-on-yarn would look like the following
install knit with conda install knit -c conda-forge (soon the package "dask-yarn" will be available, perhaps a more obvious name)
The simplest example of how to create a dask cluster can be found in the documentation. Here you create a local conda environment, upload it to HDFS and have YARN distribute it to workers, so you do not need sudo access.
Note that there are a lot of parameters that you can pass, so you are encouraged to read the usage and troubleshooting parts of the docs.
Specific answers to questions
1) Client('Mynamenode:50070') - hadoop does not know anything about dask, there is no reason that a namenode server should know what to do with a dask client connection
2) No module named lib - this is very strange and perhaps aa bug that should be logged by itself. I would encourage you to check that you have compatible versions of hdfs3 (ideally the latest) in the client and any workers
3) fails with a security error - this is rather nebulous and I cannot say more without further information. What security tdo you have enabled, what error do you see? It may be that you need to authenticate with kerberos but have not run kinit.

Related

Is there a way to load the install-interpreter.sh file in EMR in order to load 3rd party interpreters?

I have an Apache Zeppelin notebook running and I'm trying to load the jdbc and/or postgres interpreter to my notebook in order to write to a postgres DB from Zeppelin.
The main resource to load new interpreters here tells me to run the code below to get other interpreters:
./bin/install-interpreter.sh --all
However, when I run this command in EMR terminal, I find that the EMR cluster does not come with an install-interpreter.sh executable file.
What is the recommended path?
1. Should I find the install-interpreter.sh file and load that to the EMR cluster under ./bin/?
2. Is there an EMR configuration on start time that would enable the install-interpreter.sh file?
Currently all tutorials and documentations assumes that you can run the install-interpreter.sh file.
The solution is to not run this code below in root (aka - ./ )
./bin/install-interpreter.sh --all
Instead in EMR, run the code above in Zeppelin, which in the EMR cluster, is in /usr/lib/zeppelin

Starting Hadoop Services using Command Line (CDH 5)

I know how to start services using Cloudera manager interface, but I prefer to know what is really happening behind the scene and not rely on "magic".
I read this page but it does not give the desired information
I know there are some .sh files to be used but they seem to vary from version to version, and I'm using the latest as of today (5.3).
I would be grateful to have a list of service starting commands (specifically HDFS)
PS : Looks like somehow Cloudera ditched the classic Apache scripts (start-dfs.sh etc.)
You can figure this out by installing Cloudera's optional service packages.
These use the service command to start services instead of Cloudera Manager.
hadoop-hdfs-namenode - for namenode
hadoop-hdfs-secondarynamenode - for secondary namenode
hadoop-hdfs-datanode - for datanode
hadoop-hdfs-journalnode - for journalnode
You can see the CDH5.9 RPMs here:
http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/5.9/RPMS/x86_64/
After you install them, you can look at the respective /etc/init.d/SERVICENAME
to understand how they are run (assuming you're comfortable looking at shell scripts).

H2O: unable to connect to h2o cluster through python

I have a 5 node hadoop cluster running HDP 2.3.0. I setup a H2O cluster on Yarn as described here.
On running following command
hadoop jar h2odriver_hdp2.2.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 512m -nodes 3 -output /user/hdfs/H2OTestClusterOutput
I get the following ouput
H2O cluster (3 nodes) is up
(Note: Use the -disown option to exit the driver after cluster formation)
(Press Ctrl-C to kill the cluster)
Blocking until the H2O cluster shuts down...
When I try to execute the command
h2o.init(ip="10.113.57.98", port=54321)
The process remains stuck at this stage.On trying to connect to the web UI using the ip:54321, the browser tries to endlessly load the H2O admin page but nothing ever displays.
On forcefully terminating the init process I get the following error
No instance found at ip and port: 10.113.57.98:54321. Trying to start local jar...
However if I try and use H2O with python without setting up a H2O cluster, everything runs fine.
I executed all commands as the root user. Root user has permissions to read and write from the /user/hdfs hdfs directory.
I'm not sure if this is a permissions error or that the port is not accessible.
Any help would be greatly appreciated.
It looks like you are using H2O2 (H2O Classic). I recommend upgrading your H2O to the latest (H2O 3). There is a build specifically for HDP2.3 here: http://www.h2o.ai/download/h2o/hadoop
Running H2O3 is a little cleaner too:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Also, 512mb per node is tiny - what is your use case? I would give the nodes some more memory.

unable to setup psuedo distributed hadoop cluster

I am using centos 7. Downloaded and untarred hadoop 2.4.0 and followed the instruction as per the link Hadoop 2.4.0 setup
Ran the following command.
./hdfs namenode -format
Got this error :
Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode
I see a number of posts with the same error with no accepted answers and I have tried them all without any luck.
This error can occur if the necessary jarfiles are not readable by the user running the "./hdfs" command or are misplaced so that they can't be found by hadoop/libexec/hadoop-config.sh.
Check the permissions on the jarfiles under: hadoop-install/share/hadoop/*:
ls -l share/hadoop/*/*.jar
and if necessary, chmod them as the owner of the respective files to ensure they're readable. Something like chmod 644 should be sufficient to at least check if that fixes the initial problem. For the more permanent fix, you'll likely want to run the hadoop commands as the same user that owns all the files.
I followed the link Setup hadoop 2.4.0
and I was able to get over the error message.
Seems like the documentation on hadoop site is not complete.

How to disable install pig step from aws data pipeline

I am creating a data pipeline using EMR Cluster as resource.
As server is creating after bootstrap,
It is executing some step automatically, that are
enable debugging,
Install Hive
Install Pig
install Task runner
Everthing is okey.
But I want to remove the step Install pig from that.
is there any way to do that?
This answer considers the info about the error in the comments.
It appears you are getting a very old Hadoop version installed. Look at the hadoopVersion field defined in on the EmrCluster object, it likely says "0.20". I would say remove this field and replace it with amiVersion (http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html). Select a more recent version as listed at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html.

Resources