I am crawling different industry data and storing the data into single hbase table. For example I am crawling Electronics and Computer industries and stored in a table called 'industry_tbl'. Now I want to run a map reduce on the sets of data namely for Electronics and computer industries and produce the reducer output with the different sets of data collected but currently hbase is taking the entire data of both the industries and giving me the reduced results which I cant differentiate by Industries.
Any Help or idea on how to solve this?
Include industry as part of the key you emit in the mapper.
Make industry the most-significant part of your hbase key and use pass that to the SCAN you define for the map-reduce
You could also do a Column Scan on the Hbase Table.
In order to do that, put all the information for a particular industry under a particular industry column family.
For example, my industry table would probably look like this.
For a given row: cf1-science cf2-technology etc.
This way, your industry data would be closely partitioned in certain regions, bringing down your query time.
Now I would just query by using the Scan api and include a particular column family to scan.
So the scan would return me only the details pertaining to a particular industry.
The row in this case would still remain the same as you would have had it previously.
Hope this explanation helps.
Related
We are building a DB infrastructure on the top of Hadoop systems. We will be paying a vendor to do that and I do not think we are getting the right answers from the first vendor. So, I need the help from some experts to validate if I am right or I miss something
1. We have about 1600 fields in the data. A unique record is identified by those 1600 records
We want to be able to search records in a particular timeframe
(aka, records for a given time frame)
There are some fields that change overtime (monthly)
The vendor stated that the best way to go is HBASE and that they have to choices: (1) make the search optimize for machine learning (2) make adhoc queries.
The (1) will require a concatenate key with all the fields of interest. The key length will determine how slow or fast the search will run.
I do not think this is correct.
1. We do not need to use HBASE. We can use HIVE
2. We do not need to concatenate field names. We can translate those to a number and have a key as a number
3. I do not think we need to choose one or the other.
Could you let me know what you think about that?
It all depends on what is your use case. In simpler terms, Hive alone is not good when it comes to interactive queries however one of the best when it comes to analytics.
Hbase on the other hand, is really good for interactive queries, however doing analytics would not be that easy as hive.
We have about 1600 fields in the data. A unique record is identified by those 1600 records
HBase
Hbase is a NoSQL, columner database which stores information in Map(Dictionary) like format. Where each row needs to have one column which uniquely identifies the row. This is called key.
You can have key as a combination of multiple columns as well if you don't have a single column which can uniquely identifies the row. And then you can search record using partial key. However this is going to affect the performance ( as compared to have single column key).
Hive:
Hive has a SQL like language (HQL) to Query HDFS, which you can use for analytics. However, it doesn't require any primary key so you can insert duplicate records if required.
The vendor stated that the best way to go is HBASE and that they have
to choices: (1) make the search optimize for machine learning (2) make
adhoc queries. The (1) will require a concatenate key with all the
fields of interest. The key length will determine how slow or fast the
search will run.
In a way your vendor is correct, as I explained earlier.
We do not need to use HBASE. We can use HIVE 2. We do not need to concatenate field names. We can translate those to a number and have a key as a number 3. I do not think we need to choose one or the other.
Weather you can use HBASE or Hive is depends on your use case. However, if you are planning to use Hive then you don't even need to generate a pseudo key (row numbers you are talking about)
There is one more option if you have hortonworks deployment. Consider Hive for analytics and LLAP for interactive queries.
My hbase row key is different and also I need to aggregate the data and store seperatly. In this use case which one is best approach
What is best approach creating multiple hbase tables or multiple column families in single hbase table
I am Refining my question
Below is my usecase.
I am processing weblogs which has retailer, Category, Product clicks.
I am storing above weblog into one hbase table (Log) with separate rowkey and same column family
Ex.
A.
for Retailer -- IP | DateTime | Sid | Retailer
B.
for Category -- IP | DateTime | Sid | Retailer | Category
C.
for Product -- IP | DateTime | Sid | Retailer | Category |Product
From above table I am calculating Day clicks and storing into other hbase tables like ( Retailer_Day_cnt, Category_Day_Cnt, Product_Day_Cnt)
Here my question is what is the best way to store the data into hbase with above 1 and 2 cases, is it separate hbase tables or column family.
Note: In case1 I am doing only writes, but in case2 I will do multiple reads and writes.
Thanks in advance
Surendra
From performance perspective, lesser the column families better it is. As all the column families in table are flushed at same time even if some of the column families have very little data, making flush less efficient. . If your table is heavy on write this will result lot hfiles -> increased in compactions -> increased GC pauses, this can make whole hbase very slow so better don't use multiple column family if you don't really need them or all column families will have same amount data.
Find more details here:
Hbase Book
Similar question
This depends on you use case.
In case you have the same rowKey but different data then you can divide into different column families. But if the rowkeys are different put it into different tables.
This also will depend on whether you have single write multiple reads (i.e. low write throughput is ok) or you want high write throughput. Also how you data is dictributed. If one column family has a lot of data (in size) compared to rest of column families better to put the column families into different tables.
If you give more details on your use case i can be more specific.
Row key design is the main challenge in these scenarios.
If you are able to make your row key in such a way so that you can use it for all of your purposes then you may proceed with different column families otherwise multiple tables would be the only option. For your case, it seems like you are storing aggregated result in the second table which must have different logical row key. So, you should go with two tables approach where first table to store all the inputs (write once read multiple times) and second table to store processed/aggregated data.
I have a client that mostly uses calculations on a single column of many rows from a table (each time another column), which is classic for a columnar DB.
The problem is that he is using Oracle, so what I thought of doing was to build a bunch of cluster table where each table has just one column besides the PK and this way allow him to work in a pseudo-columnar model.
What are you thoughts on the subject?
Will it even work as expected or am I just forcing the solution here ?
Thanks,
Daniel
I didn't test it in the end but I did achieve close to vertical performance time using sorted table hash cluster.
I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ...
NOTE: we could add more information in future, we dont know all information now.
for example:
name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany
The main idea is to be able to do queries like:
return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.
so you can add any criteria to the search query.
what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future )
what is the best way to write such query ( scan, get, MapReduce ).
Thank you
I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....
It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?
HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.
Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")
References
Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855
Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/
Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters
Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1
Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase
Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase
Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics
Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version
This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for.
One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). With this schema, you could look up any one user (and all of their attributes) by the unique user id. That'd be blazingly fast to do, no matter how many users you have.
However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; it does not have secondary indexing of any kind. That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query).
How to fix this? You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. But it would perform well for those specific queries.
The trick? That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. That sounds like the tool you want to be using here.
If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; google that for pointers.
We are looking at using HBase for real-time analytics.
Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).
My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.
For example, assume that we store the following lines in a database:
timestamp, client_ip, url, referrer, useragent
and say our map-reduce job produces three different output fields, each of which we want to store in a separate "table" (HBase column family):
date, operating_system, browser
date, url, referrer
date, url, country
(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)
My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:
operating_system by date ("How many iPad users in this date range?")
url by country and date ("How many users to this url from this country for the last month?")
and basically any other custom query?
Should we use keys like this:
date_os_browser
date_url_referrer
date_url_country
and if so, can we fulfill the sort of queries specified above?
You've got the gist of it, yes. Both of your example queries filter by date, and that's a natural "primary" dimension in this domain (event reporting).
A common note you'll get about starting your keys with a date is that it will cause "hot spotting" problems; the essence of that problem is, date ranges that are contiguous in time will also be contiguous servers, and so if you're always inserting and querying data that happened "now" (or "recently"), one server will get all the load while the others sit idle. This doesn't sound like it'd be a huge concern on insert, since you'll be batch loading exclusively, but it might be a problem on read; if all of your queries go to one of your 20 servers, you'll effectively be at 5% capacity.
OpenTSDB gets around this by prepending a 3-byte "metric id" before the date, and that works well to spray updates across the whole cluster. If you have something that's similar, and you know you always (or usually) include a filter for it in most queries, you could use that. Or you could prepend a hash of some higher order part of the date (like "month") and then at least your reads would be a little more spread out.