Hbase / Hadoop Query Help - hadoop

I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?

I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.

If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.

I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)

I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.

Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.

Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David

Related

Where can we get Resources for vertica DB?

Can i find any resources like PDF or User Guide for learning vertica DB?
As i am beginner in vertica, also I am looking for the performance factor which affects while loading the data as well.
All of the documentation is posted publicly on my.vertica.com. Data-load performance depends on many factors; you should probably start with Bulk-Loading Data and then review the many COPY parameters. For a general beginner introduction to Vertica, see Getting Started.

Big data case study or use case example

I have read lot of blogs\article on how different type of industries are using Big Data Analytic. But most of these article fails to mention
What kinda data these companies used. What was the size of the data
What kinda of tools technologies they used to process the data
What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
How they selected the tool\technology to suit their need.
What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I wonder if someone can provide me answer to all these questions or a link which at-least answer some of the the questions.
It would be great if someone share how finance industry is making use of Big Data Analytic.
Your question is very large but I will try to answer with my own experience
1 - What kinda data these companies used ?
One of the strength of Hadoop is that you can use a very large origin for your data. It can be .csv / .txt files, json, mysql, photos, videos ...
It can contains data about marketing, social network, server logs ...
What was the size of the data ?
There is no rules about that. It can start from 50 - 60 Go to 1Po. Depends of the data and the company.
2 - What kinda of tools technologies they used to process the data
No rules about that. Depends of the needs. To organize and process data they use Hadoop with Hive and Pig. To query data, they want some short response time so they use NoSQL / in-memory database with a shorter dataset (refined by Hadoop). In some cases, company use ETL like Talend in order to go faster.
3 - What was the problem they were facing and how the insight they got the data helped them to resolve the issue.
The main issue for company is the growth of their data. At a moment, the data are too big and it is impossible to process with traditional tools like Mysql or others. So they start to use Hadoop for example.
4 - How they selected the tool\technology to suit their need.
I think it's an internal problematic. Company choose their tools because of the price of the licence, their own skills, their finals needs ...
5 - What kinda pattern they identified from the data & what kind of patterns they were looking from the data.
I don't really understand this question
Hope it will help you.
I think getting what you want is a difficult job getting data little by little from different resources. just make sure to visit these links:
a bunch of free reports. I am studying the list right now.
http://www.oreilly.com/data/free/
and the famous McKinsey Report:
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20and%20pubs/MGI/Research/Technology%20and%20Innovation/Big%20Data/MGI_big_data_full_report.ashx

Cassandra/Hadoop WITH COMPACT STORAGE option. Why is it needed, is it possible to add it to existing tables/cf

I'm working on a Hadoop / Cassandra integration I have a couple of questions I was hoping someone could help me with.
First, I seem to require the source table/cf to have been created with the option WITH COMPACT STORAGE otherwise I get an error can't read keyspace in map/reduce code.
I was wondering if this is just how it needs to be?
And if this is the case, my second question was, is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table? .. or am I going to have to re-create them and move data around.
I am using Cassandra 1.2.6
thanks in advance
Gerry
I'm assuming you are using job.setInputFormatClass(ColumnFamilyInputFormat.class);
Instead, try using job.setInputFormatClass(CqlPagingInputFormat.class);
The Mapper input for this is Map<String, ByteBuffer>, Map<String,ByteBuffer>
Similarly, if you need to write out to Cassandra us CqlPagingOutputFormat and the appropriate output types.
See http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive for more info.
#Gerry
The "WITH COMPACT STORAGE" thing is CQL3 syntax to create tables structure compatible with Thrift clients and legacy column families.
Essentially, when using this option, the table, or should I say column family, is created without using any Composite.
You should know that CQL3 tables heavily rely on composites to work.
Now to answer your questions:
I was wondering if this is just how it needs to be?
Probably because your map/reduce code cannot deal with Composites. But I believe in version 1.2.6 of Cassandra, you have all the necessary code to deal with CQL3 tables. Look at classes in package org.apache.cassandra.hadoop.
is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table?
No, it's not possible to modify/change table structure once created. You'll need some kind of migrations.

Hbase 0.92.1 Secondary Index Example

I am working on million Rows and columns in hbase 0.92.1. Now, I want to know how to create secondary index using Co-processor. Give some examples program for this.
Kindly give the program which supports hbase 0.92.1.
There's no single great way to do secondary indexing with HBase. The way you will approach the problem will be dictated by your data and your use case. Some good discussion of secondary indexing is located here
As far as I know, prior to 0.20, in Hbase API, you'd have HTableDescriptor which is writable still, so you can call HtabelDescriptor.addIndex() to create indices against the columns.
Example can be found here.
Then indexing starts moving to IHbase see the Jira story here.
To answer your question, in 0.92.1, I don't think there is anything out of box yet, you will have to write the coprocessor yourself, But There is a jira story for coprocessor secondary index you might want to watch the progress:)
meanwhile you can try idxColumnDescriptor here, also looking at the test TestIdxColumnDescriptor.java might help.

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Resources