Hadoop benchmark tests - hadoop

I am currently performing some tests in a small cluster (4 nodes) and I would like to know if there are any best practices regarding the node that actually launches the test scripts in the cluster. I am using test tools like hive-testbench and HiBench.
There should be no impact in the performance if I run a spark script in the same machine that is hosting the thrift server, or should I add another machine to the cluster just to launch the test/benchmark scripts?
Thanks in advance.

Related

Jenkins as JobServer on Hadoop EdgeNode

I´m not sure that someone can help me but I´ll take a try.
I´m running Jenkins on an Openshift-Cluster to use it for Deployment and as a jobserver for running ETL-Jobs. These jobs are transferring data from flatfiles to databases and from db to db.
Now, I should expand the system to transfer data to a hadoop cluster using MapR.
What I would like to know is, how can I use a new Jenkins-Slave as a jobserver on an EdgeNode from the hadoop-cluster using MapR. Do I need the Jenkins on the EdgeNode or am I able to use MapR from my existing Jenkins-Jobserver?
Mabye, someone is able to help me or has some informations/links how to solve it.
Thx to all....
"Use MapR" isn't quite clear to me because I just view it as Hadoop at the end of the day, but you can effectively make your Jenkins slave an "edge node" by installing only the Hadoop Java (maybe also MapR) client utilities plus any XML configuration files from the other edge nodes that define how to communicate with the cluster.
Then, Jenkins would be able to run sh("hadoop jar app.jar"), for example
If you're using Openshift, you might also try putting a Hadoop client inside a Docker image that could run in Jenkins, or anywhere else

Jmeter master slave in AWS on demand

I was hoping to get some help/suggestions regarding my JMeter Master/slave test set up.
Here is my scenario:
I need to do load testing using Jmeter master slave set up. I am planning to launch the master and slave nodes on AWS (window boxes, dependency on one of the tool I launch via jmeter). I want to launch these master-slave set up in AWS on demand where I can tell how many slave nodes I want. I looked around a lot of blogs around using Jmeter with AWS and everywhere they assume these nodes will be launched manually and needs further configuration for master and slave nodes to talk to each other. For the tests where we might have 5 or 10 slave nodes this will be fine but for my tests I want to launch 50 instances(again the tool I use with jmeter has limitation that forces me to use each jmeter slave node as 1 user, instead of using 1 slave node to act as multiple users) like this and manually updating each of the slave nodes will be very cumbersome. So I was wondering if anybody else ran into this issue and have any suggestions. In the mean time I am looking into other solutions that will help me to use same slave node to mimic multiple users, which will help me to reduce the need to launch these many slave nodes.
Regards,
Vikas
Have you seen JMeter ec2 Script? It seems to be something you're looking for.
If for any reason you don't want to use particularly this script be aware that Amazon has the API to you should be able to automate instances creation by using a script AWS Java SDK or Amazon CLI.
You can even automate instances creation using a separate JMeter script with either JSR223 Sampler
or OS Process Sampler (this approach will require a separate JMeter script of course)

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

Hadoop YARN - performance of LocalJobRunner vs. cluster deployed job

I'm doing some tests with M/R jobs running on 2 nodes Hadoop 2.2.0 cluster. One thing I would like to understand is the performance considerations of running the job in local mode (not managed by the ResourceManager) and running it on YARN. Tests I made show it runs much much faster when the job is being executed via LocalJobRunner than when it being managed by YARN. When set up the cluster I was following the steps described here http://raseshmori.wordpress.com/2012/10/14/install-hadoop-nextgen-yarn-multi-node-cluster/ , perhaps there is some configuration the guide forgot to mention?
Thanks!
You'd run LocalJobRunner for tests and small examples. You'd use the cluster when you need to processes amounts of data that would justify using Hadoop in the first place (a.k.a "Big data").
When you run a small example the overhead of running things distributed overwhelms the benefits of parallelization
Arnon is right. I found out that in one of my usecases that running using LocalJobRunner is much faster than using yarn. Running using LocalJobRunner would run the map processes as in-process and in local machine. Jobs are not submitted to HDFS cluster. Hence, map tasks are not scheduled in multiple machines. So, use LocalJobRunner shall be used for unit testing the code. Thats it. For all other practical purposes, use yarn.

Resources