How can I execute a MapReduce jar programatically from another Java Program? - hadoop

I have my MapReduce program (DataProfiler.jar) which performs some data profiling by taking in the table name and column name as command line parameters.
hadoop -jar DataProfiler.jar -D tableName=MyTable -D columnFamilyName=CF1
Is there a way I could wrap this in another java program. So that I can execute this jar for all the tables (by connecting to the database and getting a list of all the tables and columns).
Thanks!

I would suggest, instead of calling the MapReduce jar from the simple java program, You can write a logic MapReduce driver class in a little generic way.
Let's call this class JobRunner. You can define member variables which will hold the information about table and columns you need to process. Then you can setup the MapReduce configuration and start the job. Technically you are achieving the same but in a little different way. I think it would be a better approach then calling a jar and starting the MapReduce job.

Related

Passing Parameters to MapReduce Program

I need to pass some parameters to map program. The values for these parameters need to be fetched from database and these values are dynamic. I know how to pass the parameters using Configuration API. If I write JDBC code to retrieve these values from database in the driver or client and then set the values to configuration API, Then how many times this code will be executed. The driver code will be distributed and executed on each data node where hadoop framework identifies to run the MR program ?
What is the best way to do this ?
Yes driver code will be executed on each machine.
I suggest to fetch the data outside the map-reduce program and then pass it as a parameter.
Say you have a script to execute then you just fetch the data from database in a variable and then pass that variable to the hadoop job.
I think this will do your work.
If the data you need is big (more than a few kilobytes), Configuration may not be suitable. A better alternative is to use Sqoop to fetch those data from database to your HDFS. Then use hadoop distribute cache so in your map or reduce code, you can just get those data without any parameters passing in.
You can retrieve the values from DB in the driver code. The driver code will execute only once per Job.

Mapreduce Job to implement a HiveQL statement

I have a question. How to do Mapreduce Job to implement a HiveQL statement. for example we have a table with column names color, width and some other columns. Suppose if i want to select color in hive i can give select color from tablename;. In the same way what is the code for getting color in Mapreduce.
You can use the Thrift server. You can connect to hive through JDBC. All you need is to include the hive-jdbc jar in your classpath.
However is this advisable? Well that i am not really sure. This is a very bad design pattern if you are doing it in the mapper as no. of mappers is determined by data size.
The same can be achieved as multiple inputs into MR job.
But then i do not know that much about your use case. So thrift will be the way to go.
For converting hive queries to mapreduce jobs, ysmart is the best option
http://ysmart.cse.ohio-state.edu/
Either ysmart can be downloaded or online version can be used.
Check the companion code Chapter 5 - Join Patterns from the MapReduce Design Patterns book. In the join pattern the fields are extracted in the mapper and emitted.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Pig HbaseStorage customization

How can I customize HbaseStorage for pig script? Actually I want to perform some business logic on the data before loading it to the pig script. It would be something like custom storage on top of HbaseStorage.
e.g I've my row key has structure like this A_B_C. Currently, I'm passing A_B_C key in HbaseStorage in my pig script but I want to perform some logic like filtering etc against key like A_B_C_D before serving input data to actual pig script. How is it possible
You may have to end up looking at the HBaseStorage java class and implementing your own classes based on that. Depending on how the HBaseStorage and associated classes have been written, this could vary from being easy (just extend HBaseStorage itself and overwrite where necessary) to a real headache.
You then have to ensure that the .jar containing your code is on the pig classpath.
I find HbaseStorage to be a real pain, so I write regular Java MR jobs to query HBase and create custom sequence files, which I then use from Pig with a simple custom loader. I find this saves a ton of time since the sequence file can be re-used many times throughout the day to get quick results, rather than scanning everything in Hbase for every Pig script.

replace text in input file with hadoop MR

I am a newbie on the MR and Hadoop front.
I wrote an MR for finding missing's in csv file and it is working fine.
now I have an usecase where i need to parse a csv file and code it with the regarding category.
ex: "11,abc,xyz,51,61,78","11,adc,ryz,41,71,38",.............
now this has to be replaced as "1,abc,xyz,5,6,7","1,adc,ryz,4,7,3",.............
here i am doing a mod of 10 but there will be different cases of mod's.
data size is in gb's.
I want to know how to replace the content in-place for the input. Is this achievable with MR?
Basically i have not seen any file handling or writing based hadoop examples any where.
At this point i do not want to go to HBase or other db tools.
You can not replace data in place, since HDFS files are append only, and can not be edited.
I think simplest way to achiece your goal is to register your data in the Hive as external table, and write your trnasformation in HQL.
Hive is a system sitting aside of hadoop and translating your queries to MR Jobs.
Its usage is not serious infrastructure decision as HBASE usage

Resources