Please help me understand the best way storing information in HBase.
Basically, I have a rowkey like hashed_uid+date+session_id with metrics like duration, date, time, location, depth and so on.
I have read a lot of materials where I am bit confused. People have suggested less column family for better performance, so I am facing three options to choose:
Have each metrics sits in one row like rowkey_key cf1->alias1:value
Have many columns like rowkey cf1->key1:val1, cf1->key2:val2 ...
Have all the key-value pairs coded into one big string like rowkey cf1->"k1:v1,k2:v2,k3:v3..."
Thank you in advance. I don't know which to choose. The goal of my HBase design is to prepare for incremental windowing functions of a user profiling output, like percentiles, engagement and stat summary for last 60 days. Most likely, I will use hive for that.

Possibly you are confused by the similarity of naming of column family and column. These concepts are different things in HBase. Column family consist of several columns. This design is to improve the speed of access to data when you need to read only some type of columns. E.g., you have raw data and processed data. Reading processed data will not involve raw data if they are stored in separated column families. You can partially to have any numbers of columns per row key; it should be stored in one region, no more than 10GB. The design depends on what you what:
The first variant has no alternatives when you need to store a lot of
data per one-row key, that can't be stored in on a region. More than
Second is good when you need to get only a few metrics per
single read per row key.
The last variant is suitable when you
always get all metrics per single read per row key.


Cassandra Modeling for filter and range queries

I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.
I want to be able to write queries like these:
get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds
get all users who are men who are 6'0" are ages 31-37 and have black hair
How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.
Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?
If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?
So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:
or my long answer here:
cassandra - how to perform table query?
All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.
Edited with Additional Information
Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:
CREATE TABLE user_by_haircolor_weight_height (
haircolor text,
weight float,
height_in int,
user varchar,
PRIMARY KEY ((haircolor), weight, height_in)
You could then query this by:
SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;
For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a
PRIMARY KEY ((haircolor, sex), height_in, age)
In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.
my recommendation is :
1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,
2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.
3) use multiple query to get what you need.
4) last option is, use DSE solr search capability.
Just to reiterate the end of the conversation:
"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"
Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.

Determining Bucketing Configuration on Hive Table

I was curious if someone could provide a little more clarification on how to configure the bucketing property on a Hive table. I see that it helps with joins and I believe i read that its good to put it on a column that you will use to join. That could be wrong. I am also curious about how to determine the number of buckets to choose.
If anyone could give a brief explanation and some documentation on how to determine all of these things that would be great.
Thanks in advance for the assistance.
If you want to implement bucketing in your table first you should set the property
set hive.enforce.bucketing=true;
it will enforce the bucketing.
carnality : no.of possible values for column.
if your implementing bucketing using Cluster By clause, your bucketing column should have high carnality,then you will get the better performance.
if your implementing partitioning using Partitioned By clause your partitioned column should have low carnality,then you will get the better performance
depending on the use case you can choose the number of buckets.It's good to choose (number of buckets) < (your hdfs block size) and it should be power of 2.
bucketing will always creates file's not directories.
The following are few suggestions to be considered while designing buckets.
Buckets are generally created on the most critical columns , a single column or a set of columns, so it implies that these columns would be the primary columns for various join conditions , as the concept of bucketing is to hash these set of columns and store it in such a way that its easily accessible from the hdfs faster.Thus retrieving speed is fast.Its advised not to use all the join columns only the critical and which is we think would improve performance.
The number of buckets would be in exponents of 2. The number of buckets determine the number of reducers to be run and that determines the final number files in which the data is stored. So number of buckets has to be designed keeping in mind the size of data we are handling and there by keeping in mind of avoiding large number of small files in hdfs and few number of big files , thus improving the hive query retrieving speed and optimizations.

Time series in Cassandra when measures can go "back in time"

this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.

Bad performance when writing log data to Cassandra with timeuuid as a column name

Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family

join a record of a text file with all the other records in the same file in mapreduce

this article xrds:article in the subsection "An Example of the Tradeoff" describes a way (the first one) that every single record is joined with all the other records of the input file. I wonder how could that be possible in mapreduce without passing the whole input file in only one mapper.
There are three major types of joins (there are a few others out there) for MapReduce.
Reduce Side Join - For both data sets, you output the "foreign key" as the output key of the mapper. You use something like MultipleInputs to load two data sets at once. In the reducer, data from both data sets is brought together by foreign key, which allows you to do the join logic (like Cartesian product, perhaps) there. This is general purpose and will work for just about every situation.
Replicated Join - You push out the smaller data set into the DistributedCache. In each matter, you load the smaller data set from there into memory. As records pass through the mapper, join the data up against the in-memory data set. This is what you suggest in your question. It should be only used when the smaller data set can be stored in memory.
Composite Join - This one is a bit niche because it needs to be set up. If two data sets are sorted and partitioned by the foreign key, then you can do a composite join using CompositeInputFormat. It basically does a merge-like operation that is pretty efficient.
Shameless plug for my book MapReduce Design Patterns: there is a whole chapter on joins (chapter 5).
Check out the code examples for the book here: https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5
