How to choose columns for Partitioning and bucketing in hive Table? - hadoop

What will be the ideal columns for partitioning and bucketing for the below schema? Is it necessary to implement both or one is good to go?
user_id INTEGER UNSIGNED,
product_id VARCHAR(20),
gender ENUM('M','F') default NULL,
age VARCHAR(6),
occupation TINYINT UNSIGNED default NULL,
city_category ENUM('A','B','C','D','E') default NULL,
stay_in_current_city_years VARCHAR(6),
martial_status TINYINT UNSIGNED default 0,
product_category_1 TINYINT UNSIGNED default 0,
product_category_2 TINYINT UNSIGNED default 0,
product_category_3 TINYINT UNSIGNED default 0,
purchase_amount INTEGER UNSIGNED default 0
The main goal is to do some analysis based on the above attributes using Hive.

In hive, you create a table based on the usage pattern and so you should choose both partitioning the bucketing based on what your Analysis Queries would look like.
However, the following things are advisable
Partitioning
Partitioning helps you speed up the queries with predicates (i.e. Where conditions). So in your case, if city_category is the field you are going to use most of the time in your where condition you should choose that field for partition.
It might degrade the performance of other queries.
Need to make sure that cardinality is not too high, otherwise, your query performance would be degraded.
To understand the above points you need to understand how partitioning works. When you create a partition (or subpartition), Hive creates a subfolder with that name and stores the data (files) into those folders.
So if you partition based on city_category your file would look like this.
/data/table_name/city_category=A
/data/table_name/city_category=B
...
/data/table_name/city_category=E
This helps hive to find a particular record if you provide city_category in Where condition as it has to just scan one folder.
However, if you try to find a record based on user_id or product_id then hive need to scan all the folders.
And let's say if you end up partitioning based on purchase_amount, then you will have a lot many folders. NameNode has to maintain the location of each folder and files and so it will create a lot of load on your NameNode and obviously decrease the performance of your query.
Bucketing
It helps you in speeding up your join query if another table you are joining has similar bucketing.
However, it's a good idea to make sure data is distributed evenly in bucketing.
What bucketing does it, it applies a hashing on a given field and based on that it stores the given record in bucketing.
So let's say if you bucket based on city_category and tell to create 50 buckets.
CLUSTERED BY (city_category) INTO 50 BUCKETS
as we have only 5 categories, other 45 buckets would be empty, this is something you don't want as it will degrade the performance of your query.

Related

Frequent, highly fragmented index on heap table

I have a table (a heap table) with a nonclustered index that frequently becomes highly fragmented. The data in column ID comes from (is imported from) data from a csv file and the ID is thereafter used in other table relations for reporting purpuses. The table is updated (data is inserted) from a csv several times a day. I frequently run INDEX REORGANIZATION to reduce the fragmentation.
Do you have any other ideas to help keep fragmentation from occurring so frequently?
The following is a sample script of the table:
CREATE TABLE [dbo].[MyTable](
[ID] uniqueidentifier NOT NULL,``
[EventID] uniqueidentifier NOT NULL,
[AssemblyID] uniqueidentifier NOT NULL,
[TimeStamp] [smalldatetime] NOT NULL,
[IsTrue] [bit] NOT NULL,
[IsExempt] [bit] NOT NULL CONSTRAINT [DF_IsExempt] DEFAULT ((0)),
CONSTRAINT [UQ_MyTable_ID] UNIQUE NONCLUSTERED ([ID] ))
GO
Do you have any other ideas to help keep fragmentation from occurring so frequently?
Another idea would be to lower the fill factor of that particular table.
While the fill factor does not affect the heap itself, the index is affected.
Also see: Intro to fill factor.
SQL Server only uses fillfactor when you’re creating, rebuilding, or reorganizing an index. It does not use fillfactor if it’s allocating a fresh new page at the end of the index.
Let’s look at the example of a clustered index where the key is an increasing INT identity value again. We’re just inserting rows and it’s adding new pages at the end of the index. The index was created with an 70% fillfactor (which maybe wasn’t a good idea). As inserts add new pages, those pages are filled as much as possible– likely over 70%. (It depends on the row size and how many can fit on the page.)
Having a lower fill factor will let SQL Server insert more rows randomly across your table and somewhat lower the amount of page splits. You'll have to test what works best for your data insertion patterns.

Hive partition table query optimisation

I am new to hive,and hadoop ecosystem in general.From what I learnt of the basics of Hive you can create partitions on hive table based on certain attributes.And if a query has any mention of that attribute then it should supposedly get a performance boost as hive only scans that particular partition file instead of scanning the whole table.My question is suppose we have some hierarchical structure in the data.Say I partition a table based on unique state values and every time a query is based on state hive would only scan that particular state partition instead of scanning the whole table.However say every state also has unique district names.If I make a query based only on district values would hive scan the whole table?
If so then is there some way to change the query in such a way that I can manually instruct hive to query the particular state file to which the district belongs to.And then perform other operations only on that partition file,instead of scanning the whole table for matching district values.
One of the strengths of Hive is that it has strong support for partitioning. However, it cannot read your mind when you write queries.
If you have a partition on state, then you need state in the where clause for partition pruning. So, if you query only on district, the whole table would be scanned.
If you have a partition on district, then you need the district. A query on state would scan the whole table.
If you have a partition on both . . . well, then it is a little more complicated to declare, but your queries would read a minority of partitions with either state or district.
If you are just learning about partitions, I would advise you to start with date partitions. These are the most common and a good way to get familiar with the concept.

how to we define hbase rowkey so we get reords in optimize manner when millons of records in table

I have 30 millions of records into table but when tried to find one of records from there then it i will take to much time retrieve. Could you suggest me how I can I need to generate row-key in such a way so we can get fetch records fast.
Right now I have take auto increment Id of 1,2,3 like so on as row-key and what steps need to take to performance improvement. Let me know your concerns
generally when we come for performance to a SQL structured table, we follow some basic/general tuning like apply proper index to columns which are being used in query. apply proper logical partition or bucketing to table. give enough memory for buffer to do some complex operations.
when it comes to big data , and specially if you are using hadoop , then the real problems comes with context switching between hard disk and buffer. and context switching between different servers. you need to make sure how to reduce context switching to get better performance.
some NOTES :
use Explain Feature to know Query structure and try to improve performance.
if you are using integer row-key , then it is going to give best performance, but always create row-key/index at the beginning of table. because later performance killing.
When creating external tables in Hive / Impala against hbase tables, map the hbase row-key against a string column in Hive / Impala. If this is not done, row-key is not used in the query and entire table is scanned.
never use LIKE in row-key query , because it scans whole table. use BETWEEN or = , < , >=.
If you are not using a filter against row-key column in your query, your row-key design may be wrong. The row key should be designed to contain the information you need to find specific subsets of data

Delete rows vs Delete Columns performance

I'm creating the datamodel for a timeseries application on Cassandra 2.1.3. We will be preserving X amount of data for each user of the system and I'm wondering what is the best approach to design for this requirement.
Option1:
Use a 'bucket' in the partition key, so data for X period goes into the same row. Something like this:
((id, bucket), timestamp) -> data
I can delete a single row at once at the expense of maintaining this bucket concept. It also limits the range I can query on timestamp, probably resulting in several queries.
Option2:
Store all the data in the same row. N deletes are per column.
(id, timestamp) -> data
Range queries are easy again. But what about performance after many column deletes?
Given that we plan to use TTL to let the data expire, which of the two models would deliver the best performance? Is the tombstone overhead of Option1 << Option2 or will there be a tombstone per column on both models anyway?
I'm trying to avoid to bury myself in the tombstone graveyard.
I think it will all depend on how much data you plan on having for the given partition key you end up choosing, what your TTL is and what queries you are making.
I typically lean towards option #1, especially if your TTL is the same for all writes. In addition if you are using LeveledCompactionStrategy or DataTieredCompactionStrategy, Cassandra will do a great job keeping data from the same partition in the same SSTable, which will greatly improve read performance.
If you use Option #2, data for the same partition could likely be spread across multiple levels (if using LCS) or just in general multiple sstables, which may cause you to read from a lot of SSTables, depending on the nature of your queries. There is also the issue of hotspotting, where you could overload particular cassandra nodes if you have a really wide partition.
The other benefit of #1 (which you allude to), is that you can easily delete the entire partition, which creates a single tombstone marker which is much cheaper. Also, if you are using the same TTL, data within that partition will expire pretty much at the same time.
I do agree that it is a bit of a pain to have to make multiple queries to read across multiple partitions as it pushes some complexity into the application end. You may also need to maintain a separate table to keep track of the buckets for the given id if they can not be determined implicitly.
As far as performance goes, do you see it as likely that you will need to read cross-partitions when your application makes queries? For example, if you have a query for 'the most recent 1000 records' and a partition typically is wider than that, you may only need to make 1 query for Option #1. However, if you want to have a query like 'give me all records', Option #2 may be better as otherwise you'll need to a make queries for each bucket.
After creating the tables you described above:
CREATE TABLE option1 (
... id bigint,
... bucket bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY ((id, bucket), timestamp)
... ) WITH default_time_to_live=10;
CREATE TABLE option2 (
... id bigint,
... timestamp timestamp,
... data text,
... PRIMARY KEY (id, timestamp)
... ) WITH default_time_to_live=10;
I inserted a test row:
INSERT INTO option1 (id,bucket,timestamp,data) VALUES (1,2015,'2015-03-16 11:24:00-0500','test1');
INSERT INTO option2 (id,timestamp,data) VALUES (1,'2015-03-16 11:24:00-0500','test2');
...waited 10 seconds, queried with tracing on, and I saw identical tombstone counts for each table. So I either way that shouldn't be too much of a concern for you.
The real issue, is that if you think you'll ever hit the limit of 2 billion columns per partition, then Option #1 is the safe one. If you have a lot of data Option #1 might perform better (because you'll be eliminating the need to look at partitions that don't match your bucket), but really either one should be fine in that respect.
tl;dr;
As the issues of performance and tombstones are going to be similar no matter which option you choose, I'm thinking that Option #2 is the better one, just due to ease of querying.

Primary Key Effect on Performance in SQLite

I have an sqlite database used to store information about backup jobs. Each run, it increases approximately 25mb as a result of adding around 32,000 entries to a particular table.
This table is a "map table" used to link certain info to records in another table... and it has a primary key (autoincrement int) that I don't use.
sqlite will reserve 1, 2, 4, or 8 bytes for INT column depending on its value. This table only has 3 additional columns, also of INT type.
I've added indexes to the database on the columns that I use as filters (WHERE) in my queries.
In the presence of indexes, etc. and in the situation described, do primary keys have any useful benefit in terms of performance?
Note: Performance is very, very important to this project - but not if 10ms saved on a 32,000 entry job means an additional 10MB of data!
A primary key index is used to look up a row for a given primary key. It is also used to ensure that the primary key values are unique.
If you search your data using other columns, the primary key index will not be used, and as such will yield no performance benefit. Its mere existence should not have a negative performance impact either, though.
An unnecessary index wastes disk space, and makes INSERT and UPDATE statements execute slower. It should have no negative impact on query performance.
If you really don't use this id what don't you drop this column + primary key? The only reason to keep a non-used primary key id column alive is to make it possible to create a master-detail relation with another table.
Another possibility is to keep the column but to drop the primary key. That will mean that the application has to take care of providing a unique id with every insert statement. Before and after each batch operation you have to check whether this column is still unique. This doesn't work in for instance MySQL and Oracle because of multi concurrency issues but it does work in sqlite.

Resources