Hbase 0.92.1 Secondary Index Example - hadoop

I am working on million Rows and columns in hbase 0.92.1. Now, I want to know how to create secondary index using Co-processor. Give some examples program for this.
Kindly give the program which supports hbase 0.92.1.

There's no single great way to do secondary indexing with HBase. The way you will approach the problem will be dictated by your data and your use case. Some good discussion of secondary indexing is located here

As far as I know, prior to 0.20, in Hbase API, you'd have HTableDescriptor which is writable still, so you can call HtabelDescriptor.addIndex() to create indices against the columns.
Example can be found here.
Then indexing starts moving to IHbase see the Jira story here.
To answer your question, in 0.92.1, I don't think there is anything out of box yet, you will have to write the coprocessor yourself, But There is a jira story for coprocessor secondary index you might want to watch the progress:)
meanwhile you can try idxColumnDescriptor here, also looking at the test TestIdxColumnDescriptor.java might help.

Related

Elasticsearch to index RDBMS data

These are three simple questions which was surprisingly hard to find definite answers.
Does ElasticSearch support indexing data in RDBMS tables ( Oracle/SQLServer/Informix) out of the box?
If yes, can you please point me to documentation on how to do it
If not, what are alternate ways (plugins like Rivers - deprecated) with good reputation
I'm surprised there isn't any solid answer as yet for this. So here's the solution. Logstash directly gives us the ability to push data from a RDBMS into Elasticsearch.
Here's a link to a tutorial which tell you how to go about it. Briefly(all details in link 1), you simply need a JDBC driver for the relational database you'll be using (Postgres, MySQL etc) and make a config file specifying your input as the Relational Database and your output as Elasticsearch. You can also specify a cron which would allow you to keep updating one regular intervals.
Here's the article which mentions the configuration and gets you started (See Example 2): https://www.elastic.co/blog/logstash-jdbc-input-plugin
Here's the article which tells you how to configure the Cronjob as such: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html#_scheduling

Suggestion technology/design on this BIGDATA usecase

I am new to big data technologies and design so looking for help from java world.
I have concept of tags and tagcombinations.
For example U.S.A and Pen are two tags AND if they come together in some definition then register a tagcombination(U.S.A-Pen) for that..
tags (U.S.A, Pen, Pencil, India, Shampoo)
tagcombinations(U.S.A-Pen, India-pencil, U.S.A-Pencil, India-Pen, India-Pen-Shampoo)
millions of tags
billions of tagcombinations
one tagcombination generally have 2-8 tags....
Every day we get lakhs of new tagcombinations to write
daily crores of queries to find matching combination by set of tags
Query need to support :
one tag or set of tags appears in how many tagcombinationids ????
If i query for Pen,India then it should return two tagcombinaions (India-Pen, India-Pen-Shampoo))..Query will be fired by application in realtime.
Please suggest a solution which is distributed with java client and can
handle scale of data i am looking for..
Already tried on cassandra but not able toconclude that as right match for my problem..
Thanks
Naresh
I suggest you look into Apache Lucene project:
http://lucene.apache.org/
You won't be able to use Cassandra directly for this but if you store your data inside Cassandra, you can use Solr to add extra indexes on top of your data. DataStax has a bundle solution called DataStax Enterprise that has Cassandra/Solr together:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise

Cassandra/Hadoop WITH COMPACT STORAGE option. Why is it needed, is it possible to add it to existing tables/cf

I'm working on a Hadoop / Cassandra integration I have a couple of questions I was hoping someone could help me with.
First, I seem to require the source table/cf to have been created with the option WITH COMPACT STORAGE otherwise I get an error can't read keyspace in map/reduce code.
I was wondering if this is just how it needs to be?
And if this is the case, my second question was, is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table? .. or am I going to have to re-create them and move data around.
I am using Cassandra 1.2.6
thanks in advance
Gerry
I'm assuming you are using job.setInputFormatClass(ColumnFamilyInputFormat.class);
Instead, try using job.setInputFormatClass(CqlPagingInputFormat.class);
The Mapper input for this is Map<String, ByteBuffer>, Map<String,ByteBuffer>
Similarly, if you need to write out to Cassandra us CqlPagingOutputFormat and the appropriate output types.
See http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive for more info.
#Gerry
The "WITH COMPACT STORAGE" thing is CQL3 syntax to create tables structure compatible with Thrift clients and legacy column families.
Essentially, when using this option, the table, or should I say column family, is created without using any Composite.
You should know that CQL3 tables heavily rely on composites to work.
Now to answer your questions:
I was wondering if this is just how it needs to be?
Probably because your map/reduce code cannot deal with Composites. But I believe in version 1.2.6 of Cassandra, you have all the necessary code to deal with CQL3 tables. Look at classes in package org.apache.cassandra.hadoop.
is it possible/how do I add the WITH COMPACT STORAGE option on to a pre-exsting table?
No, it's not possible to modify/change table structure once created. You'll need some kind of migrations.

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Hbase / Hadoop Query Help

I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?
I think you, like many of us, are making the mistake of treating bigtable and HBase like just another RDBMS when it's actually a column-oriented storage model meant for efficiently storing and retrieving large sets of sparse data. This means storing, ideally, many-to-one relationships within a single row, for example. Your queries should return very few rows but contain (potentially) many datapoints.
Perhaps if you told us more about what you were trying to store, we could help you design your schema to match the bigtable/HBase way of doing things.
For a good rundown of what HBase does differently than a "traditional" RDBMS, check out this awesome article: Matching Impedance: When to use HBase by Bryan Duxbury.
If you want to access HBase using a query language and a JDBC driver it is possible. Paul Ambrose has released a library called HBQL at hbql.com that will help you do this. I've used it for a couple of projects and it works well. You obviously won't have access to full SQL, but it does make it a little easier to use.
I looked at Hadoop and Hbase and as Sean said, I soon realised it didn't give me what I actually wanted, which was a clustered JDBC compliant database.
I think you could be better off using something like C-JDBC or HA-JDBC which seem more like what I was was after. (Personally, I haven't got farther with either of these other than reading the documentation so I can't tell which of them is any good, if any.)
I'd recommend taking a look at Apache Hive project, which is similar to HBase (in the sense that it's a distributed database) which implements a SQL-esque language.
Thanks for the reply Sean, and sorry for my late response. I often make the mistake of treating HBase like a RDBMS. So often in fact that I've had to re-write code because of it! It's such a hard thing to unlearn.
Right now we have only 4 tables. Which, in this case, is very few considering my background. I was just hoping to use some RDBMS functionality while mostly sticking to the column-oriented storage model.
Glad to hear you guys are using HBase! I'm not an expert by any stretch of the imagination, but here are a couple of things that might help.
HBase is based on / inspired by BigTable, which happens to be exposed by AppEngine as their db api, so browsing their docs should help a great deal if you're working on a webapp.
If you're not working on a webapp, the kind of iterating you're describing is usually handled with via map/reduce (don't emit the values you don't want). Skipping over values using iterators virtually guarantees your application will have bottlenecks with HBase-sized data sets. If you find you're still thinking in SQL, check out cloudera's pig tutorial and hive tutorial.
Basically the whole HBase/SQL mental difference (for non-webapps) boils down to "Send the computation to the data, don't send the data to the computation" -- if you keep that in mind while you're coding you'll do fine :-)
Regards,
David

Resources