HBase schema/key for real-time analytics solution - hadoop

We are looking at using HBase for real-time analytics.
Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).
My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.
For example, assume that we store the following lines in a database:
timestamp, client_ip, url, referrer, useragent
and say our map-reduce job produces three different output fields, each of which we want to store in a separate "table" (HBase column family):
date, operating_system, browser
date, url, referrer
date, url, country
(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)
My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:
operating_system by date ("How many iPad users in this date range?")
url by country and date ("How many users to this url from this country for the last month?")
and basically any other custom query?
Should we use keys like this:
date_os_browser
date_url_referrer
date_url_country
and if so, can we fulfill the sort of queries specified above?

You've got the gist of it, yes. Both of your example queries filter by date, and that's a natural "primary" dimension in this domain (event reporting).
A common note you'll get about starting your keys with a date is that it will cause "hot spotting" problems; the essence of that problem is, date ranges that are contiguous in time will also be contiguous servers, and so if you're always inserting and querying data that happened "now" (or "recently"), one server will get all the load while the others sit idle. This doesn't sound like it'd be a huge concern on insert, since you'll be batch loading exclusively, but it might be a problem on read; if all of your queries go to one of your 20 servers, you'll effectively be at 5% capacity.
OpenTSDB gets around this by prepending a 3-byte "metric id" before the date, and that works well to spray updates across the whole cluster. If you have something that's similar, and you know you always (or usually) include a filter for it in most queries, you could use that. Or you could prepend a hash of some higher order part of the date (like "month") and then at least your reads would be a little more spread out.

Related

How database sharding based on timestamp and pkid is better in performance?

I was checking a system design of twitter.
Sharding based on UserID: We can try storing all the data of a user on one server. While storing, we can pass the UserID to our hash function that will map the user to a database server where we will store all of the user’s tweets, favorites, follows, etc. While querying for tweets/follows/favorites of a user, we can ask our hash function where can we find the data of a user and then read it from there. This approach has a couple of issues:
What if a user becomes hot? There could be a lot of queries on the server holding the user. This high load will affect the performance of our service.
Over time some users can end up storing a lot of tweets or having a lot of follows compared to others. Maintaining a uniform distribution of growing user data is quite difficult.
To recover from these situations either we have to repartition/redistribute our data or use consistent hashing.
Sharding based on TweetID: Our hash function will map each TweetID to a random server where we will store that Tweet. To search for tweets, we have to query all servers, and each server will return a set of tweets. A centralized server will aggregate these results to return them to the user. Let’s look into timeline generation example; here are the number of steps our system has to perform to generate a user’s timeline:
Our application (app) server will find all the people the user follows.
App server will send the query to all database servers to find tweets from these people.
Each database server will find the tweets for each user, sort them by recency and return the top tweets.
App server will merge all the results and sort them again to return the top results to the user.
This approach solves the problem of hot users, but, in contrast to sharding by UserID, we have to query all database partitions to find tweets of a user, which can result in higher latencies.
We can further improve our performance by introducing cache to store hot tweets in front of the database servers.
Sharding based on Tweet creation time: Storing tweets based on creation time will give us the advantage of fetching all the top tweets quickly and we only have to query a very small set of servers. The problem here is that the traffic load will not be distributed.
What if we can combine sharding by TweetID and Tweet creation time? If we don’t store tweet creation time separately and use TweetID to reflect that, we can get benefits of both the approaches. This way it will be quite quick to find the latest Tweets. For this, we must make each TweetID universally unique in our system and each TweetID should contain a timestamp too.
We can use epoch time for this. Let’s say our TweetID will have two parts: the first part will be representing epoch seconds and the second part will be an auto-incrementing sequence. So, to make a new TweetID, we can take the current epoch time and append an auto-incrementing number to it. We can figure out the shard number from this TweetID and store it there. How does this approach helps better than the above approaches?

HBASE versus HIVE: What is more suitable for data that is uniquely defined by multiple fields?

We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.

Simulating server-side group and sort in Azure table storage

I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.

Using Hadoop for storing stock market tick data

I'm having fun learning about Hadoop and the various projects around it and currently have 2 different strategies I'm thinking about for building a system to store a large collection of market tick data, I'm just getting started with both Hadoop/HDSF and HBase but hoping someone can help me plant a system seed that I won't have to junk later using these technologies. Below is an outline of my system and requirements with some query and data usage use cases and lastly my current thinking about the best approach from the little documentation I have read. It is an open ended question and I'll gladly like any answer that is insightful and accept the best one, feel free to comment on any or all of the points below. - Duncan Krebs
System Requirements - Be able to leverage the data store for historical back testing of systems, historical data charting and future data mining. Once stored, data will always be read-only, fast data access is desired but not a must-have when back testing.
Static Schema - Very Simple, I want to capture 3 types of messages from the feed:
Timestamp including date,day,time
Quote including Symbol,timestamp,ask,askSize,bid,bidSize,volume....(About 40 columns of data)
Trade including Symbol,timestamp,price,size,exchange.... (About 20 columns of data)
Data Insert Use Cases - Either from a live market stream of data or lookup via broker API
Data Query Use Cases - Below demonstrates how I would like to logically query my data.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
Get me the number of trades for these 50 symbols for each day over the last 90 days.
The Holy Grail - Can MapReduce be used for uses cases like these below??
Generate meta-data from the raw market data through distributed agents. For example, Write a job that will compute the average trading volume on a 1 minute interval for all stocks and all sessions stored in the database. Create the job to have an agent for each stock/session that I tell what stock and session it should compute this value for. (Is this what MapReduce can do???)
On the classpath of the agents can I add my own util code so that the use case above for example could publish its value into a central repo or Messaging server? Can I deploy an agent as an OSGI bundle?
Create different types of agents for different types of metrics and scores that are executed every morning before pre-market trading?
High Frequency Trading
I'm also interested if anyone can share some experience using Hadoop in the context of high frequency trading systems. Just getting into this technology my initial sense is Hadoop can be great for storing and processing large volumes of historic tick data, if anyone is using this for real-time trading I'd be interested in learning more! - Duncan Krebs
Based of my understanding of your requirements, Hadoop would be really good solution to store your data and run your queries on it using Hive.
Storage: You can store the data in Hadoop in a directory structure like:
~/stock_data/years=2014/months=201409/days=20140925/hours=01/file
Inside the hours folder, the data specific to that hour of the day can reside.
One advantage of using such structure is that you can create external tables in Hive over this data with your partitions on years, months, days and hours. Something like this:
Create external table stock_data (schema) PARTITIONED BY (years bigint, months bigint, days bigint, hours int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION
'~/stock_data'
Coming to the queries part, once you have the data stored in the format mentioned above you can easily run simple queries.
Get me all Quotes,Trades,Timestamps for GOOG on 9/22/2014
select * from stock_data where stock = 'GOOG' and days = 20140922
Get me all Trades for GOOG,FB BEFORE 9/1/2014 AND AFTER 5/1/2014
select * from stock_data where stock in ('GOOG', 'FB') and days > 20140501 and days < 20140901)
You can run any such aggregation queries once in a day and use the output to come up with the metrics before pre-market trading. Since Hive internally runs mapreduce these queries won't be very fast.
In order to get faster results, you can use some of the in memory projects like Impala or Spark. I have myself used Impala to run queries on my hive tables and I have seen a major improvement in the run time for my queries (around 40x). Also you wouldn't need to make any changes to the structure of the data.
Data Insert Use Cases : You can use tools like Flume or Kafka for inserting data in real time to Hadoop (and thus to the hive tables). Flume is linearly scalable and can also help in processing events on the fly while transferring.
Overall, a combination of multiple big data technologies can provide a really decent solution to the problem you proposed and these solution would scale to huge amounts of data.

Hbase Schema design

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ...
NOTE: we could add more information in future, we dont know all information now.
for example:
name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany
The main idea is to be able to do queries like:
return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.
so you can add any criteria to the search query.
what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future )
what is the best way to write such query ( scan, get, MapReduce ).
Thank you
I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....
It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?
HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.
Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")
References
Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855
Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/
Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters
Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase
Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase
Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics
Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version
This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for.
One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). With this schema, you could look up any one user (and all of their attributes) by the unique user id. That'd be blazingly fast to do, no matter how many users you have.
However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; it does not have secondary indexing of any kind. That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query).
How to fix this? You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. But it would perform well for those specific queries.
The trick? That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. That sounds like the tool you want to be using here.
If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; google that for pointers.

Resources