Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata - hadoop

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Related

Does anybody know how to choose the data model when using impala?

There several kind of file format like impala internal table or external table format like csv, parquet, hbase. Now we need to guarantee the average insert rate is 50K row/s and each row is about 1K. And, some of the data also can be updated occasionally. We also need to do some aggregation operation on those data.
I think Hbase is not a good choose for large aggregation compute when using impala with external table. Does anybody have suggestion about it?
Thanks, Chen.
I've never worked with Impala, but I can tell you a few things based on my experience with Hive.
HBase will be faster if you have a good key design and a proper schema, because just like with Hive, Impala will translate your WHERE into scan filters, it'll depend a lot on the type of queries you run. There are multiple techniques to reduce the amount of data read by a job: from simple ones like providing start and stop rowkeys, timeranges, reading only some families/columns, the already mentioned filters... to more complex like solutions like performing realtime aggregations on your data (*) and keeping them as counters.
Regarding your insert rate, it can perfectly handle it with the proper infrastructure (better to use the HBase native JAVA API), also, you can buffer your writes to get even better performance.
*Not sure if Impala supports HBase counters.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Sequence File of Objects into Hive

We started with a bunch of data stored in NetCDF files. From there, some Java code was written to create sequence files from the NetCDF files. We don't know much about the original intentions of the code, but we have been able to learn a little bit about the sequence files themselves. Ultimately, we are trying to create tables within Hive using these sequence files, but seem incapable of doing so at the moment.
We know that the keys and values within the sequence files are stored as objects that implements WritableComparable. We are also capable of creating Java code to iterate through all of the data in the sequence files.
So, what would be necessary to actually get Hive to read the data within the objects of these sequence files properly?
Thanks in advanced!
UPDATE: The reason it is so difficult to describe where I am having trouble exactly is because I am not necessarily getting any errors. Hive is simply just reading the sequence files incorrectly. When running the Hadoop -text command on my sequence file I get a list of objects as such:
NetCDFCompositeKey#263c7e3f , NetCDFRecordWritable#4d846db5
The data is within those objects themselves. So, currently from the help of #Tariq I believe what I have to do in order to actually read those objects is to create a custom InputFormat to read the keys and a custom SerDe to serialize and deserialize the objects?
I'm sorry, i'm not able to understand from your question where exactly you are facing the problem. If you wish to use SequenceFiles through Hive you just have to add STORED AS SEQUENCEFILE clause while issuing CREATE TABLE(most probably you already know this, nothing new). When you work on SequenceFiles Hive treats each key/value pair of the SequenceFiles similar to rows in normal files. Important thing here is that keys will be ignored. Apart from that nothing very special.
Having said that, if you wish to read both keys and values, you might have to write a custom InputFormat that can read both keys and values. See this project for example. It allows us to access data stored in a SequenceFile's key.
Also, if your keys and values are custom classes, you will require to write a SerDe as well to serialize and deserialize your data.
HTH
P.S. : I don't know if this is exactly what you were looking for. Do let me know if it is not and add some more detail to your question. I'll try addressing that.

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Can HBase Access Text Documents and CSV Documents Just as Hadoop?

In Hadoop, I can easily create Map/Reduce apps which access and process data in huge text files and csv files. My question is can Hbase do the same and access such huge files, or HBase has other uses?
Hbase runs queries just as relational databases; so, I kind of have a hard time to understand the advantage of HBase, unless it can access huge text and csv files just as Hadoop does.
First of all Hbase is just a store. And a store never accesses anything. Rather you access the store to fetch or put the data. Like any other datastore Hbase has only one job to do, store your data and make it available to you whenever you need it. You can write MapReduce jobs or sequential Java programs etc etc to put data into Hbase or fetch data from it. It's totally upto you which path you prefer.
Coming to the second part of your question, Hbase never ever works like traditional relational databases. Everything, starting from storing the data to accessing the data, is totally different. The advantage of using Hbase is that you can store really really huge amount of data into it and have random read/write access. The data can be of any type viz. text, csv, tsv, binary etc etc. But, before going ahead, you must think well whether Hbase is a suitable choice for you or not, as one size doesn't fit all.
HTH

Resources