Read data from multiple tables and evaluate results and generate report - hadoop

I have some question regarding the effective way of reading values in DB and generating report.
I use hadoop to see data from multiple tables and do data analysis based on the results.
I want to know if there is effective tool or way which can read data from multiple tables and evaluate if the values of certain columns are same across tables and send report if they are not same... I have 2 options, either I can read data from hadoop or I can connect to DB in DB2 and do it. Without creating a new java program, is there a tool which helps for the same? Like Talend tool which reads XML and writes output in DB ?

You can use Talend for this. Using Talend, you can read data from Hadoop as well as from database. In between you can perform your operation after fetching data and generate report.

if your using alot of data, and do this sort of function alot elasticsearch is also a great help in this area. use ELK stack. although you would not need the 'L' logstash part of this necessarily

Related

Hadoop data visualization

I am a new hadoop developer and I have been able to install and run hadoop services in a single-node cluster. The problem comes during data visualization. What purpose does MapReduce jar file play when I need to use a data visualization tool like Tableau. I have a structured data source in which I need to add a layer of logic so that the data could make sense during visualization. Do I need to write MapReduce programs if I am going to visualize with other tools? Please shed some light on how I could go about on this issue.
This probably depends on what distribution of Hadoop you are using and which tools are present. It also depends on the actual data preparation task.
If you don't want to actually write map-reduce or spark code yourself you could try SQL-like queries using Hive (which translates to map-reduce) or the even faster Impala. Using SQL you can create tabular data (hive tables) which can easily be consumed. Tableau has connectors for both of them that automatically translate your tableau configurations/requests to Hive/Impala. I would recommend connecting with Impala because of its speed.
If you need to do work that requires more programming or where SQL just isn't enough you could try Pig. Pig is a high level scripting language that compiles to map-reduce code. You can try all of the above in their respective editor in Hue or from CLI.
If you feel like all of the above still don't fit your use case I would suggest writing map-reduce or spark code. Spark does not need to be written in Java only and has the advantage of being generally faster.
Most tools can integrate with hive tables meaning you don't need to rewrite code. If a tool does not provide this you can make CSV extracts from the hive tables or you can keep the tables stored as CSV/TSV. You can then import these files in your visualization tool.
The existing answer already touches on this but is a bit broad, so I decided to focus on the key part:
Typical steps for data visualisation
Do the complex calculations using any hadoop tool that you like
Offer the output in a (hive) table
Pull the data into the memory of the visualisation tool (e.g. Tableau), for instance using JDBC
If the data is too big to be pulled into memory, you could pull it into a normal SQL database instead and work on that directly from your visualisation tool. (If you work directly on hive, you will go crazy as the simplest queries take 30+ seconds.)
In case it is not possible/desirable to connect your visualisation tool for some reason, the workaround would be to dump output files, for instance as CSV, and then load these into the visualisation tool.
Check out some end to end solutions for data visualization.
For example like Metatron Discovery, it uses druid as their OLAP engine. So you just link your hadoop with Druid and then you can manage and visualize your hadoop data accordingly. This is an open source so that you also can see the code inside it.

While creating table how to identify the data types in hive

I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!

Does anybody know how to choose the data model when using impala?

There several kind of file format like impala internal table or external table format like csv, parquet, hbase. Now we need to guarantee the average insert rate is 50K row/s and each row is about 1K. And, some of the data also can be updated occasionally. We also need to do some aggregation operation on those data.
I think Hbase is not a good choose for large aggregation compute when using impala with external table. Does anybody have suggestion about it?
Thanks, Chen.
I've never worked with Impala, but I can tell you a few things based on my experience with Hive.
HBase will be faster if you have a good key design and a proper schema, because just like with Hive, Impala will translate your WHERE into scan filters, it'll depend a lot on the type of queries you run. There are multiple techniques to reduce the amount of data read by a job: from simple ones like providing start and stop rowkeys, timeranges, reading only some families/columns, the already mentioned filters... to more complex like solutions like performing realtime aggregations on your data (*) and keeping them as counters.
Regarding your insert rate, it can perfectly handle it with the proper infrastructure (better to use the HBase native JAVA API), also, you can buffer your writes to get even better performance.
*Not sure if Impala supports HBase counters.

Writing to multiple HCatalog schemas in single reducer?

I have a set of Hadoop flows that were written before we started using Hive. When we added Hive, we configured the data files as external tables. Now we're thinking about rewriting the flows to output their results using HCatalog. Our main motivation to make the change is to take advantage of the dynamic partitioning.
One of the hurdles I'm running into is that some of our reducers generate multiple data sets. Today this is done with side-effect files, so we write out each record type to its own file in a single reduce step, and I'm wondering what my options are to do this with HCatalog.
One option obviously is to have each job generate just a single record type, reprocessing the data once for each type. I'd like to avoid this.
Another option for some jobs is to change our schema so that all records are stored in a single schema. Obviously this option works well if the data was just broken apart for poor-man's partitioning, since HCatalog will take care of partitioning the data based on the fields. For other jobs, however, the types of records are not consistent.
It seems that I might be able to use the Reader/Writer interfaces to pass a set of writer contexts around, one per schema, but I haven't really thought it through (and I've only been looking at HCatalog for a day, so I may be misunderstanding the Reader/Writer interface).
Does anybody have any experience writing to multiple schemas in a single reduce step? Any pointers would be much appreciated.
Thanks.
Andrew
As best I can tell, the proper way to do this is to use the MultiOutputFormat class. The biggest help for me was the TestHCatMultiOutputFormat test in Hive.
Andrew

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources