How do you efficiently design a Hive/Impala table considering the following facts?
The table receives tool data of about 100 million rows every
day. The date on which it receives the data is stored in a column in
the table along with its tool id.
Each tool receives about
500 runs per day which is identified by column run id. Each run id
contains data approximately of size 1 mb.
The default size of the block is 64 mb.
The table can be searched by date, tool id and run id in this order.
If you are doing analytics on this data then a solid choice with Impala is using Parquet format. What has worked well for our users is to partition the date by year, month, day based a date value on the record.
So for example CREATE TABLE foo (tool_id int, eff_dt timestamp) partition (year int, month int, day int) stored as parquet
When loading the data into this table we use something like this to create dynamic partitions:
INSERT INTO foo partition (year, month, day)
SELECT tool_id, eff_dt, year(eff_dt), month(eff_dt), day(eff_dt)
FROM source_table;
Then you train your users that if they want the best performance to add YEAR, MONTH, DAY to their WHERE clause so that it hits the partition for better performance. Then have them add the eff_dt in the SELECT statement so they have a date value in the format they like see in their final results.
In CDH, Parquet is storing by default data in 256MB chunks (which is configurable). Here is how to configure it: http://www.cloudera.com/documentation/enterprise/latest/topics/impala_parquet_file_size.html
Related
My hive table will have call record data.
3 columns of the table are field1- CALL_DATE, field2-FROM_PHONE_NUM, field3- TO_PHONE
I would query something like
1) i want to get all call records between particular dates.
2) I want to get all call records for a FROM_PHONE phone number between certain dates.
2) I want to get all call records for a TO_PHONE phone number between certain dates.
My table size is approximately 6TB.
Can i know How do i need to apply partitioning or bucketing for better performance of all of my queries?
Your requirement is always to get data between certain dates and do filtering on it, so do table partition biased on date .
How to create Link for dynamic partition
You can have partition key date as yyyymmdd .
(like -- 20170406 for today(6th april 2017 ))
I have bucketed table based on column flightnum(10 buckets), data size is approx 700MB, bucketing enforced as well.
When I am executing the query :
select count(flightnum) from flight_buck where flightnum=10;
getting the response in approx 46s. Total number of mappers were 27.
When executing the same query on non bucketed table with same data :
select count(flightnum) from flight_temp where flightnum=10;
getting the response in approx 47s. Total number of mappers used were 30.
Why I am getting the response in same amount of time?
Bucketing helps join to be faster, to increase the simple SELECT speed you have to use partitioned tables.
Try to partition table by flightnum and run again the selects.
Why does this happen ?
Let's create a bucketed not partitioned table like this:
create table `t1b`(
`exchange` string,
`stock_symbol` string,
`date` string,
`stock_price_open` float,
`stock_price_high` float,
`stock_price_low` float,
`stock_price_close` float,
`stock_volume` int,
`stock_price_adj_close` float)
clustered by ( `stock_symbol` ) sorted by ( `date` ) into 306 buckets;
And let's fill it with data... There are as many reducers as many buckets because each reducer will process only record with the same keys and will store the data into its file using the sorting you like, in this case by date
Let's look at HDFS...
Please note what we got.. 306 files (buckets) ...
and inside each of them there are records which have the same clustering key...
But all the files are into the same folder, and when SELECTing with hive there is no way to understand which files hold the value we are looking for, so bucketing with no partitioning does not speed up select because there are no info about where are the data we are looking for.
What does bucketing do ? When you are JOINing data, the whole bucket can be loaded into RAM and we can get a fast join in MAP instead to get a slow join in REDUCE.
I have daily transactions with up to 5-10 GB of data per day. In my view it makes more sense to partition by month..
Here is an example:
My Table has the following columns:
TRANSACTION_DATE TIMESTAMP -- transaction date
TRANSACTION_AMOUNT INTEGER - transaction amount
DWH_PARTITION STRING -- technical field that goes into PARTITIONED BY section
Now I want to query for the amount of transactions between January 15st 2015 and November 15th 2015.
My query would be
select sum(TRANSACTION_AMOUNT) from TEST where TRANSACTION_DATE >= CAST('2015-01-15' as timestamp) AND TRANSACTION_DATE < CAST('2015-11-15' as timestamp)
This query returns correct data but it does full table scan while I would like it to just use partitions 2015-01, 2015-02, .... 2015-11.
To do so I need to specify manually which partitions should I use so the query would be as follows:
select sum(TRANSACTION_AMOUNT) from TEST where TRANSACTION_DATE >= CAST('2015-01-15' as timestamp) AND TRANSACTION_DATE < CAST('2015-11-15' as timestamp) and DWH_PARTITION in ('2015-01',.........'2015-11');
Because we cannot partition by timestamp business analyst would have to know the exact partitioning pattern (whether given table is partitioned by month, day, etc.).
Please also note that information about dates need to be specified two times: one for transaction date and then for partitions.
Do you know some partitioning methods that can help to avoid having to specify the same information twice and release the user from having to know partitioning patters of all the tables they need to query?
It can be only achieved by range partitioning and currently it is not supported. Probably UDF might help, but 100% not sure.
We have solved that problem by providing simple web interface where user can choose the table, filter columns and under the covers application is intelligent enough to generate the query leveraging partition pruning.
I have a table that stores millions of url, date and name entries. Each row is unique in terms of either:
url + date
or
date + name.
I require this table to be stored in descending date order so that when I query it I can simply "SELECT * FROM mytable LIMIT 1000" to get me the most recent 1000 records, no sorting involved. Does anyone know how to set things up to do this please? To the best of my current understanding I am trying the following but it does not store them in date order:
CREATE TABLE mytable (
url text,
date timestamp,
name text,
PRIMARY KEY ((url, name), date)
)
WITH CLUSTERING ORDER BY (date DESC);
To store the data according to an order, you'd need to change the partitioner to byte ordered. This is no longer a good idea...it's maintained for back compat, but there are issues:
http://www.datastax.com/documentation/cassandra/2.1/cassandra/architecture/architecturePartitionerBOP_c.html
You could also apply bucketing and query over your buckets. Each bucket being a partition, and each partition would have data stored in order. Not exactly what you want, but worth trying.
I have a log table with a lot of information.
I would like to partition it into two: first part is the logs from the past month, since they are commonly viewed. Second part is the logs from the rest of the year (Compressed).
My problem is that all the examples of partitions where "up until 1/1/2013", "more recent than 1/1/2013" - That is with fixed dates...
What I am looking for/expecting is a way to define a partition on the last month, so that when the day changes, the logs from 30 days ago, are "automatically" transferred to the compressed partition.
I guess I can create another table which is completley compressed and move info using JOBS, but I was hoping for a built-in solution.
Thank you.
I think you want interval partitions based on a date. This will automatically generate the partitions for you. For example, monthly partitions would be:
create table test_data (
created_date DATE default sysdate not null,
store_id NUMBER,
inventory_id NUMBER,
qty_sold NUMBER
)
PARTITION BY RANGE (created_date)
INTERVAL(NUMTOYMINTERVAL(1, 'MONTH'))
(
PARTITION part_01 values LESS THAN (TO_DATE('20130101','YYYYMMDD'))
)
As data is inserted, Oracle will put into the proper partition or create one if needed. The partition names will be a bit cryptic (SYS_xxxx), but you can use the "partition for" clause to grab only the month you want. For example:
select * from test_data partition for (to_date('20130101', 'YYYYMMDD'))
It is not possible to automatically transfer data to a compressed partition. You can, however, schedule a simple job to compress last month's partition at the beginning of every month with this statement:
ALTER TABLE some_table
MOVE PARTITION FOR (add_months(trunc(SYSDATE), -1)
COMPRESS;
If you wanted to stay with only two partitions: current month and archive for all past transactions you could also merge partitions with ALTER TABLE MERGE PARTITIONS, but as far as I'm concerned it would rebuild the whole archive partition, so I would discourage doing so and stay with storing each month in its separate partition.