I have a Hadoop Map reduce program which takes a text file as input. The metadta about this file is stored in an oracle database. In mapper I want this information- metadata from Oracle table.
What's the best practice to get this?
Solution1:
In map reduce, driver class get details using JDBC connectivity. Store the information in a distributed cache. From mapper, in setup method access it.
My thoughts: Any other quick solutions?
Solution 2:
Access metadata from mapper setup method.
My thoughts: No. I dont want to do this. DB hit will be very more. Bad coding.
Any other smart solutions???
Related
I need to pass some parameters to map program. The values for these parameters need to be fetched from database and these values are dynamic. I know how to pass the parameters using Configuration API. If I write JDBC code to retrieve these values from database in the driver or client and then set the values to configuration API, Then how many times this code will be executed. The driver code will be distributed and executed on each data node where hadoop framework identifies to run the MR program ?
What is the best way to do this ?
Yes driver code will be executed on each machine.
I suggest to fetch the data outside the map-reduce program and then pass it as a parameter.
Say you have a script to execute then you just fetch the data from database in a variable and then pass that variable to the hadoop job.
I think this will do your work.
If the data you need is big (more than a few kilobytes), Configuration may not be suitable. A better alternative is to use Sqoop to fetch those data from database to your HDFS. Then use hadoop distribute cache so in your map or reduce code, you can just get those data without any parameters passing in.
You can retrieve the values from DB in the driver code. The driver code will execute only once per Job.
I am wondering that if such a large datasets are used in Hadoop Map Reduce then what are the data structures which are used by hadoop. If possible please somebody provide me a detail view of underlying data structures in hadoop.
HDFS is the default underlying storage platform of Hadoop.
Its like any other file system in the sense that - it does not care what structure the files have. It only ensures that files will be saved in a redundant fashion and available for retrieval quickly.
So it is totally upto you the user, to store files with whatever structure you like inside them.
A Map Reduce program simply gets the file data fed to it as an input. Not necessarily the entire file, but parts of it depending on InputFormats etc. The Map program then can make
use of the data in whatever way it wants to.
'Hive' - on the other hand deals with TABLES (columns/rows). And you can query them in a SQL like fashion using Hive-QL.
Thanks to all of you
I got the answer of my question. The underlying HDFS uses block as a storing units a detail description of which is mentioned in the following book and networking streaming concepts.
All the details are available in the third chapter of Hadoop: The Definitive Guide.
I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew
I am trying to convert a application that have relational database as backend. Can I store the data relationaly in HDFS as well?
Just for the sake of storing, you can store anything in HDFS. But that won't make any sense. First of all, you should not think of Hadoop as a replacement to your RDBMS(which you are trying to do here). Both are meant for totally different purposes. Hadoop is not a good fit for your transactional, relational or real-time kind of needs. It was meant to serve your offline batch processing needs. So, it's better to analyze your use case properly and then freeze your decision.
As a suggestion I would like to point you to Hive. It provides you warehousing capabilities on top of your existing Hadoop cluster. It also provides an SQL like interface to your warehouse, which will make your life much easier if you are coming from SQL background. But again, Hive is also a batch processing system and is not a good fit if you need something real time.
You can have a look at HBase though, as suggested by abhinav. It's a DB that can run on top of your Hadoop cluster and provides you random, real time read/write access to your data. But you should keep 1 thing in mind that it's a NoSQL db. It doesn't follow the SQL terminologies and conventions. So, you might find it a bit alien initially. You might have to think about issues like how to store your data in a new storage style(columnar) unlike the row style storage of your RDBMS. Otherwise it's not a problem to setup and use it.
HTH
Any file can be stored in HDFS. But if you want an SQL type DB you should go for HBASE. If you directly store your data into HDFS you will not be able to store rationality.
Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.