How could I run a MapReduce program in Java with testNG? - hadoop

I have a MapReduce program based on Hadoop. I know how to run it with the command of ${hadoop_home}/bin/hadoop jar .
However, I want to run this program with testNG. What should I do to start a testNG case with hadoop?

If you're looking for a unit testing framework, look at mr-unit (Apache). You can wrap the driver code in either JUnit or im assuming TestNG (disclaimer - I've only ever used JUnit).
This allows you to push values through a mapper or reducer and assert that particular output / counters came out

Finally I figured it out, the key point is the classpath setting.
Besides the hadoop-core-.jar, all jar files under ${hadoop_home}/share/hadoop/lib/, jar files under ${hadoop_home}/share/hadoop/lib/jsp-/, and other jar files the program code depend on, "${hadoop.home}/conf", "${hadoop.home}/share/hadoop" and the MR program itself must been compiled and packed as a Jar file and added into runtime class path.
After such setting finished, testNG test cases invoking the MR method could be run by the regular way.

Related

how to run tests from a jar file on a remote machine?

I have a spring boot project where I'm using gradle for dependency management. I copy the project jar to a remote machine. There I want to run the unit tests using my jar file. I'm using junit for running the unit tests. How do I run the unit tests from my jar file in the remote machine?
There is something unclear in your understanding of Jar / unit testing.
Unit tests are made to help building your application in a proper way. You are able to perfomr some tests against your classes. To relate this to a JAR file, your unit tests are here to make sure you build a JAR that "works" (i.e passes your tests).
There is therefor no reason to try executing your tests on the remote machine. Besides, if you open your Jar file (which is just a Zip), you will see that your test classes are not inside. That is because, in a Jar, you only want classes that will be used on production.
Instead, asks you this :
- What are you trying to achieve but running unit tests on the remote machine ?
- Isn't it more like a integration or end to end tests ? Basically, what you want is : deploy the Jar on the server an make sure it still working.
If you really run your tests on your remote machine, then you can go check the link in your comment : How to run JUnit test cases from the command line. I wouldn't recomment, because, like I said, there is a few chances (depends on your configuration of course) that your test classes are embedded on your jar.
Hope it helps you thinking. Don't hesitate to update your posts to add more information.

Correct way of submitting a job to YARN cluster in which job has dependencies on external jars?

I am trying to understand on what is the correct way of submitting a MR (for that matter a Spark based Java) job to YARN cluster.
Consider the situation below:
Using client machine develop code (MR or Spark) jobs, and say the codes uses 3rd party jar's. Now, when a developer has to submit the job to the YARN cluster, what is the correct way of submitting the job to cluster so that there is no run time exception of class not found. Since job is submitted as jar file, how can a developer "put" the 3rd party jars?
I am having difficulty in understanding this, can anyone help me understand this?
You have to simply build a "fat jar," with Gradle or Maven, that contains not only your compiled code but also all transitive dependencies.
You can use either the Maven Assembly Plugin or any of the Gradle plugins like the Shadow Plugin.
The output of these is what you should supply to spark-submit.

Maven Test Automation Project

I have a maven test automation project developed using selenium and testng. This is what I am doing with my framework:
1.I have main class in src/main/java and within the main class I trigger methods to dynamically create and run the testng xml.
2.The tests to be run are determined from the XMLFlag.xls sheet and test data for the tests is set in TestNG.xlsx sheet in src/main/resources.
3.I have successfully created a jar of my entire package.
Since I have put my Test Data sheets(TestNG.xlsx and XMLFlag.xls)under src/main/resources, these Test data sheets get packaged within the jar.
But ideally, I would like to run my test scripts for different sets of test data and see if the scripts are successful.
for example: I would like to run my scripts with ,say, username:abcd and password:1234 for the first time and then run the same set of scripts with username:efgh and password:9876.
But with my Test data sheets packaged within the jar I will not be able to achieve the above scenario since I cannot edit the test data sheets.
So let me say to overcome the above problem:
1.I put my Test data sheets in src/test/resources and not src/main/resources and then create a jar.
But when I do this and try to run the jar,
I get an error message:
.\src\test\resources\XMLFlag.xls (The system cannot find the path specified)
This, I believe, is because the Test data sheets get placed in test-classes folder under target folder and not within the jar.
To put it in simple words:
i. I want the test data sheets to be outside my jar, so that it can be edited and test scripts can be run based on the users requirement.
ii.If the test data sheets are outside my jar and everything else within my framework is dependent on test data information(ie.test scripts, testng.xml) and is packaged within the jar, my jar would not run.
Is there a way to avoid this error and accomplish what I want to do?
Or should I restructure my entire framework??
Kindly help me out.
How about passing the Test Data sheets as program arguments when you're executing your jar?
That is,
java -jar c:/path/to/your/jar c:/path/to/your/testng.xlsx c:/path/to/your/xmlflag.xlsx
and then just in your main method, read the file names and their content such as
File testNgXlsxFile = new File(args[0]);
and pass the information from the files to your framework.
Is that possible?

Hadoop - submit a job with lots of dependencies (jar files)

I want to write some sort of "bootstrap" class, which will watch MQ for incoming messages and submit map/reduce jobs to Hadoop. These jobs use some external libraries heavily. For the moment I have the implementation of these jobs, packaged as ZIP file with bin,lib and log folders (I'm using maven-assembly-plugin to tie things together).
Now I want to provide small wrappers for Mapper and Reducer, which will use parts of the existing application.
As far as I learned, when a job is submitted, Hadoop tries to find out JAR file, which has the mapper/reducer classes, and copy this jar over network to data node, which will be used to process the data. But it's not clear how do I tell Hadoop to copy all dependencies?
I could use maven-shade-plugin to create an uber-jar with the job and dependencies, And another jar for bootstrap (which jar would be executed with hadoop shell-script).
Please advice.
One way could be to put the required jars in distributed cache. Another alternative would be to install all the required jars on the Hadoop nodes and tell TaskTrackers about their location. I would suggest you to go through this post once. Talks about the same issue.
Use maven to manage the dependencies and ensure the correct versions are used during builds and deployment. Popular IDE's have maven support that makes it so you don't have to worry about building class paths for edit and build. Finally, you can instruct maven to build a single jar (a "jar-with-dependencies") containing your app and all dependencies, making deployment very easy.
As for dependencies, like hadoop, which are guaranteed to be in the runtime class path, you can define them with a scope of "provided" so they're not included in the uber jar.
Use -libjars option of hadoop launcher script for specify dependencies for jobs running on remotes JVMs;
Use $HADOOP_CLASSPATH variable for set dependencies for JobClient running on local JVM
Detailed discussion is here: http://grepalex.com/2013/02/25/hadoop-libjars/

Hadoop Distributed Cache don't work

I'm new with Hadoop. I'm using Hadoop 0.22.
In the Driver I'used this code:
Job job = Job.getInstance(configuration);
...
job.addArchiveToClassPath(new Path(JAR_DIR);
...
In the Map class what code I have to use to add the jar in the local classpath?
More details
I have a job that need in the map and reduce phases htmlunit.jar.I add this jar to classpath with the code above, but when I submit the job, I have a ClassNotFoundException at line that I use htmlunit references.If the code above it's ok, and the DistributedCache add automatically the jar in the tasktrackers classpath, what could be the problem?
I also have to use the option -libjars htmlunit.jar when I submit the job? I have to use another hadoop component?
You don't need to do anything.
Once you add a jar to the job classpath what you're saying is
"include this in the class path of the map and reduce jobs"
So long as your mappers and reducers extend from the Mapper and Reducer base classes then it will 'just work'.
Worth noting, you should probably instead use addFileToClassPath for each individual Jar you need.
An alternative (we do this) is to create one single jar containing your source and your dependencies.
Build your code jar as usual, then create a subdirectory in the jar called 'lib', and add all of your dependency jars in here. Then your entire job is self contained and you don't need to worry about adding other jars to the distributed cache.
So for example you'd have a jar with the following contents:
/com/example/Something.class
/com/example/SomethingElse.class
/lib/dependency.jar
/lib/dependency2.jar
(a jar is just a zip file, so you can use regular zip creation utilities to build it)
For various reasons this also performs better than adding the .class files of your dependency to the jar directly.

Resources