Hbase Schema design - hadoop

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ...
NOTE: we could add more information in future, we dont know all information now.
for example:
name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany
The main idea is to be able to do queries like:
return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.
so you can add any criteria to the search query.
what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future )
what is the best way to write such query ( scan, get, MapReduce ).
Thank you

I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....
It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?
HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.
Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")
References
Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855
Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/
Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters
Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase
Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase
Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics
Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version

This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for.
One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). With this schema, you could look up any one user (and all of their attributes) by the unique user id. That'd be blazingly fast to do, no matter how many users you have.
However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; it does not have secondary indexing of any kind. That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query).
How to fix this? You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. But it would perform well for those specific queries.
The trick? That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. That sounds like the tool you want to be using here.
If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; google that for pointers.

Related

Elassandra data modeling: When to create a secondary index and when not to

I'm not an absolute expert of Cassandra, but what I know (correct me if I'm wrong) is that creating a secondary index for all the fields in a data model is an anti-pattern.
I'm using Elassandra and my data model looks like this :
A users object that represents a user, with : userID, name, phone, e-mail, and all kind of infos on users (say these users are selling things)
A sales object that represent a sale made by the user, with : saleID, userID, product name, price, etc. (There can be a lot more fields)
Given that I want to make complex searches on the user (search by phone, search by e-mail, etc etc) only on name, e-mail and phone, is it a good idea to create the 3 following tables from this data model :
"User core" table with only userID, name, phone and e-mail (fields for search) [Table fully indexed and mapped in Elasticsearch]
"User info" table with userID + the other infos [Table not indexed or mapped in Elasticsearch]
"Sales" table with userID, saleID, product name, price, etc. [Table not indexed or mapped in Elasticsearch]
I see at least one advantage : Any kind of indexation (or reindexation when changes happen) and associated costs will happen only if there is a change in the "User core" table, which should not change too frequently.
Also, if I need to get all other infos (User other infos or sales), I can just make 2 queries: 1 in "User core" to get the userID and 1 in the other table (with the userID) to get the other data.
But I'm not sure this is a good pattern, or maybe I should not worry about secondary indexation and just basically index any other table ?
In a more summarized way, what are the key reasons to chose - a secondary index like Elasticsearch in Elassandra - VS - denormalizing tables and use partition&clustering keys - ?
Please feel free to ask if you need more examples on my use case.
You should not normalise the tables when you're using Cassandra. The most important aspect of data modelling for Cassandra is to design one table for each application query. To put it another way, you should always denormalise your tables.
After you've modelled a table for each query, index the table with Elassandra which contains the most columns that you need to query.
It's important to note that Elassandra is not a magic bullet. In a lot of cases, you do not need to index the tables if you have modelled them against your application queries correctly.
The use case for Elassandra is to take advantage of features such as free-form text search, faceting, boosting, etc., but it will not be as performant as a native table. The fact is index lookups require more "steps" than a straight-forward single-partition Cassandra read. Of course, YMMV depending on your use case and access patterns. Cheers!
I dont think Erick´s answer is fully correct in case of Elassandra.
It is correct that native Cassandra queries will outperform elastic and in pure cassandra you should wrap your tables around the queries.
But if you prefer flexibility over performance (and this is why you mainly choose to use elassandra), you can use cassandra as primary storage and benefit from cassandra´s replication performance and index the tables for search in elastic.
This enables you to be flexible on the search side and still be sure not to lose data, in case something goes wrong on the elastic side.
In fact on production we use a combination of both: tables have its partition / clustering keys and are indexed in elastic (when necessary). In backend you can decide, if you can query by cassandra keys or if elastic is required.

Using Hadoop for storing stock market tick data

I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
Timestamp including date,day,time
Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs
Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.
Storage: You can store the data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
Inside the hours folder, the data specific to that hour of the day can reside.
One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.
In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.
Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.
Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.

HBase schema/key for real-time analytics solution

We are looking at using HBase for real-time analytics.
Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).
My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.
For example, assume that we store the following lines in a database:
timestamp, client_ip, url, referrer, useragent
and say our map-reduce job produces three different output fields, each of which we want to store in a separate "table" (HBase column family):
date, operating_system, browser
date, url, referrer
date, url, country
(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)
My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:
operating_system by date ("How many iPad users in this date range?")
url by country and date ("How many users to this url from this country for the last month?")
and basically any other custom query?
Should we use keys like this:
date_os_browser
date_url_referrer
date_url_country
and if so, can we fulfill the sort of queries specified above?
You've got the gist of it, yes. Both of your example queries filter by date, and that's a natural "primary" dimension in this domain (event reporting).
A common note you'll get about starting your keys with a date is that it will cause "hot spotting" problems; the essence of that problem is, date ranges that are contiguous in time will also be contiguous servers, and so if you're always inserting and querying data that happened "now" (or "recently"), one server will get all the load while the others sit idle. This doesn't sound like it'd be a huge concern on insert, since you'll be batch loading exclusively, but it might be a problem on read; if all of your queries go to one of your 20 servers, you'll effectively be at 5% capacity.
OpenTSDB gets around this by prepending a 3-byte "metric id" before the date, and that works well to spray updates across the whole cluster. If you have something that's similar, and you know you always (or usually) include a filter for it in most queries, you could use that. Or you could prepend a hash of some higher order part of the date (like "month") and then at least your reads would be a little more spread out.

Hbase Map reduce and Index

I am crawling different industry data and storing the data into single hbase table. For example I am crawling Electronics and Computer industries and stored in a table called 'industry_tbl'. Now I want to run a map reduce on the sets of data namely for Electronics and computer industries and produce the reducer output with the different sets of data collected but currently hbase is taking the entire data of both the industries and giving me the reduced results which I cant differentiate by Industries.
Any Help or idea on how to solve this?
Include industry as part of the key you emit in the mapper.
Make industry the most-significant part of your hbase key and use pass that to the SCAN you define for the map-reduce
You could also do a Column Scan on the Hbase Table.
In order to do that, put all the information for a particular industry under a particular industry column family.
For example, my industry table would probably look like this.
For a given row: cf1-science cf2-technology etc.
This way, your industry data would be closely partitioned in certain regions, bringing down your query time.
Now I would just query by using the Scan api and include a particular column family to scan.
So the scan would return me only the details pertaining to a particular industry.
The row in this case would still remain the same as you would have had it previously.
Hope this explanation helps.

Can OLAP be done in BigTable?

In the past I used to build WebAnalytics using OLAP cubes running on MySQL.
Now an OLAP cube the way I used it is simply a large table (ok, it was stored a bit smarter than that) where each row is basically a measurement or and aggregated set of measurements. Each measurement has a bunch of dimensions (i.e. which pagename, useragent, ip, etc.) and a bunch of values (i.e. how many pageviews, how many visitors, etc.).
The queries that you run on a table like this are usually of the form (meta-SQL):
SELECT SUM(hits), SUM(bytes),
FROM MyCube
WHERE date='20090914' and pagename='Homepage' and browser!='googlebot'
GROUP BY hour
So you get the totals for each hour of the selected day with the mentioned filters.
One snag was that these cubes usually meant a full table scan (various reasons) and this meant a practical limitation on the size (in MiB) you could make these things.
I'm currently learning the ins and outs of Hadoop and the likes.
Running the above query as a mapreduce on a BigTable looks easy enough:
Simply make 'hour' the key, filter in the map and reduce by summing the values.
Can you run a query like I showed above (or at least with the same output) on a BigTable kind of system in 'real time' (i.e. via a user interface and the user get's their answer ASAP) instead of batch mode?
If not; what is the appropriate technology to do something like this in the realm of BigTable/Hadoop/HBase/Hive and the likes?
It's even kind of been done (kind of).
LastFm's aggregation/summary engine: http://github.com/zohmg/zohmg
A google search turned up a google code project "mroll" but it doesn't have anything except contact info (no code, nothing). Still, might want to reach out to that guy and see what's up. http://code.google.com/p/mroll/
We managed to create low latency OLAP in HBase by preagragating a SQL query and mapping it into appropriate Hbase qualifiers. For more detail visit below site.
http://soumyajitswain.blogspot.in/2012/10/hbase-low-latency-olap.html
My answer relates to HBase, but applies equally to BigTable.
Urban Airship open-sourced datacube, which I think is close to what you want. See their presentation here.
Adobe also has a couple of presentations (here and here) on how they do "low-latency OLAP" with HBase.
Andrei Dragomir made an interesting talk about how Adobe performs OLAP functionality with M/R and HBase.
Video: http://www.youtube.com/watch?v=5U3EnfiKs44
Slides: http://hstack.org/hbasecon-low-latency-olap-with-hbase/
If you are looking for a table-scan approach, have you considered Google BigQuery? BigQuery does automatic scale-out on the back-side that gives interactive response. There is a good session by Jordan Tigani from the 2012 Google I/O event that explains some of the internals.
http://www.youtube.com/watch?v=QI8623HlYd4
It's not MapReduce but it is geared towards high-speed table scan like what you described.

Resources