I need to pass some parameters to map program. The values for these parameters need to be fetched from database and these values are dynamic. I know how to pass the parameters using Configuration API. If I write JDBC code to retrieve these values from database in the driver or client and then set the values to configuration API, Then how many times this code will be executed. The driver code will be distributed and executed on each data node where hadoop framework identifies to run the MR program ?
What is the best way to do this ?
Yes driver code will be executed on each machine.
I suggest to fetch the data outside the map-reduce program and then pass it as a parameter.
Say you have a script to execute then you just fetch the data from database in a variable and then pass that variable to the hadoop job.
I think this will do your work.
If the data you need is big (more than a few kilobytes), Configuration may not be suitable. A better alternative is to use Sqoop to fetch those data from database to your HDFS. Then use hadoop distribute cache so in your map or reduce code, you can just get those data without any parameters passing in.
You can retrieve the values from DB in the driver code. The driver code will execute only once per Job.
Related
I have some question regarding the effective way of reading values in DB and generating report.
I use hadoop to see data from multiple tables and do data analysis based on the results.
I want to know if there is effective tool or way which can read data from multiple tables and evaluate if the values of certain columns are same across tables and send report if they are not same... I have 2 options, either I can read data from hadoop or I can connect to DB in DB2 and do it. Without creating a new java program, is there a tool which helps for the same? Like Talend tool which reads XML and writes output in DB ?
You can use Talend for this. Using Talend, you can read data from Hadoop as well as from database. In between you can perform your operation after fetching data and generate report.
if your using alot of data, and do this sort of function alot elasticsearch is also a great help in this area. use ELK stack. although you would not need the 'L' logstash part of this necessarily
I have my MapReduce program (DataProfiler.jar) which performs some data profiling by taking in the table name and column name as command line parameters.
hadoop -jar DataProfiler.jar -D tableName=MyTable -D columnFamilyName=CF1
Is there a way I could wrap this in another java program. So that I can execute this jar for all the tables (by connecting to the database and getting a list of all the tables and columns).
Thanks!
I would suggest, instead of calling the MapReduce jar from the simple java program, You can write a logic MapReduce driver class in a little generic way.
Let's call this class JobRunner. You can define member variables which will hold the information about table and columns you need to process. Then you can setup the MapReduce configuration and start the job. Technically you are achieving the same but in a little different way. I think it would be a better approach then calling a jar and starting the MapReduce job.
scenario: you are writing a MR job which will use mappers to process data and then use Reducers to insert the resultant data directly into an external RDBMS.what must you be sure to do?? and why
Pre-requsite:
1.Ensure that the database driver is present on the client machine which is submitting the job.
2.Disable speculative execution for the data insert job
1)If you forgot to disable speculative execution, It is possible that multiple instances of given Reducer could run, which would result extra data than expected into RDBMS.
2)Even we need the database driver for client machine, If you plan to connect to RDBMS from that client , it is not needed.
So "1" option is correct.
I got this solution , Can any body Improve this answer or let me correct If any issues. Thank you
I have a Hadoop Map reduce program which takes a text file as input. The metadta about this file is stored in an oracle database. In mapper I want this information- metadata from Oracle table.
What's the best practice to get this?
Solution1:
In map reduce, driver class get details using JDBC connectivity. Store the information in a distributed cache. From mapper, in setup method access it.
My thoughts: Any other quick solutions?
Solution 2:
Access metadata from mapper setup method.
My thoughts: No. I dont want to do this. DB hit will be very more. Bad coding.
Any other smart solutions???
I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew