Can I use HCatInputFormat with MultipleInputs in Hadoop? - hadoop

I'm attempting to do a join between two datasets, one is stored in a Hive table, the other one is not. I see according to what people do this is not very normal, as in, they either define everything as a Hive table or they don't.
Now there's the MultipleInputs class, but the addInputPath method takes Configuration, Path, InputFormat, Mapper
I could use the input format there and try the to put the table name disguised as a Path but that sounds like a wild guess at best.
There's a patch for newer version of Hive (I'm on CDH4 so that means hive 0.10 and hcat 0.5 sadly). I found this patch which is not quite straight forward to translate into my current version and also seems to only work with multiple tables and not a mix of them.
https://issues.apache.org/jira/browse/HIVE-4997
Is this possible or have you any recommendations?
The only thing I can think of is reading the raw data without using the table, but that implies logic over hive specific formats I'd rather avoid.

HCatMultipleInputs can be used for reading multiple hive tables.
Here is a patch (for 0.13) that we can look at installing for multiple table support. It has HCatMultipleInputs to support multiple hive tables.
https://issues.apache.org/jira/i#browse/HIVE-4997
Example useage:
HCatMultipleInputs.addInput(job,Table1, db1, properites1, Mapper1.class);
You can use the working code in the below link:
https://github.com/abhirj87/training/tree/master/multipleinputs

The solution here apparently is either upgrade to 0.14.0 (or patch the old version) or not use HCatalog but read the metastore directly and manually add each partition subdirectory to MultipleInputs.
Personally since I can't upgrade easily and the subpartitioning is too much work, I just focused on optimising the jobs in other ways and be contempt with running a sequence of jobs for now.

Is there a way to implement the patch alone in a seperate mapreduce program. It seems that the Patch is still not committed, but i want to use the solution in my job.

Related

Best method to compare data between 2 different databases on separate servers

We are migrating data from DB2 database to Hadoop. The migration really is running select * from table1 on DB2, exporting it to a delimited file, taking that file and putting it Hadoop. DB2 and Hadoop reside on different servers, different network. We need to run some validation steps to make sure that the data that is extracted from DB2 has entirely been imported to Hadoop. So, just running select count(1) from table1 on both the systems would not help since we could have cases where some column values could not be imported due to specific character issue(e.x. newline etc).
What would be the best method to programmatically test that data is identical on both the systems?
P.S: Both Hadoop and DB2 are running on RHEL, so if any Linux specific tools that would be helpful in this process can be included.
Not sure if this is the "best" way, but my approach would be:
As as one of the previous posters has suggested, export data from Hadoop to a delimited file and run a diff against the DB2 import files. This is probably the easiest method.
Write a simple utility which connects to both databases simultaneously, fetches data from two tables to compare, and compares the result set. Having Googled a bit, seems like there are some utilities out there already - for example http://www.dbsolo.com/datacomp.html.
Hope this helps.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Can we store relational data in hdfs

I am trying to convert a application that have relational database as backend. Can I store the data relationaly in HDFS as well?
Just for the sake of storing, you can store anything in HDFS. But that won't make any sense. First of all, you should not think of Hadoop as a replacement to your RDBMS(which you are trying to do here). Both are meant for totally different purposes. Hadoop is not a good fit for your transactional, relational or real-time kind of needs. It was meant to serve your offline batch processing needs. So, it's better to analyze your use case properly and then freeze your decision.
As a suggestion I would like to point you to Hive. It provides you warehousing capabilities on top of your existing Hadoop cluster. It also provides an SQL like interface to your warehouse, which will make your life much easier if you are coming from SQL background. But again, Hive is also a batch processing system and is not a good fit if you need something real time.
You can have a look at HBase though, as suggested by abhinav. It's a DB that can run on top of your Hadoop cluster and provides you random, real time read/write access to your data. But you should keep 1 thing in mind that it's a NoSQL db. It doesn't follow the SQL terminologies and conventions. So, you might find it a bit alien initially. You might have to think about issues like how to store your data in a new storage style(columnar) unlike the row style storage of your RDBMS. Otherwise it's not a problem to setup and use it.
HTH
Any file can be stored in HDFS. But if you want an SQL type DB you should go for HBASE. If you directly store your data into HDFS you will not be able to store rationality.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources