How does data distribution happens in bucketing in HIVE? - hadoop

I have created a table as below with 3 buckets, and loaded some data into it.
create table testBucket (id int,name String)
partitioned by (region String)
clustered by (id) into 3 buckets;
I have set bucketing property as well. $set hive.enforce.bucketing=true;
But when I listed the table files in HDFS I could see that that 3 files are creates as I have mentioned 3 buckets.
But data got loaded in only one file and rest 2 files are just empty. So I am confused why my data got loaded into only file?
So could someone please explain me how data distribution happens in bucketing?
[test#localhost user]$ hadoop fs -ls /user/hive/warehouse/database2.db/buckettab/region=USA
Found 3 items
-rw-r--r-- 1 user supergroup 38 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000000_0
-rw-r--r-- 1 user supergroup 0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000001_0
-rw-r--r-- 1 user supergroup 0 2016-06-27 08:34 /user/hive/warehouse/database2.db/buckettab/region=USA/000002_0

Bucketing is a method to evenly distributed the data across many files. Create multiple buckets and then place each record into one of the buckets based on some logic mostly some hashing algorithm.
Bucketing feature of Hive can be used to distribute/organize the table/partition data into multiple files such that similar records are present in the same file. While creating a Hive table, a user needs to give the columns to be used for bucketing and the number of buckets to store the data into. Which records go to which bucket are decided by the Hash value of columns used for bucketing.
[Hash(column(s))] MOD [Number of buckets]
Hash value for different columns types is calculated differently. For int columns, the hash value is equal to the value of int. For String columns, the hash value is calculated using some computation on each character present in the String.
Data for each bucket is stored in a separate HDFS file under the table directory on HDFS. Inside each bucket, we can define the arrangement of data by providing the SORT BY column while creating the table.
Lets See an Example
Creating a Hive table using bucketing
For creating a bucketed table, we need to use CLUSTERED BY clause to define the columns for bucketing and provide the number of buckets. Following query creates a table Employee bucketed using the ID column into 5 buckets.
CREATE TABLE Employee(
ID BIGINT,
NAME STRING,
AGE INT,
SALARY BIGINT,
DEPARTMENT STRING
)
COMMENT 'This is Employee table stored as textfile clustered by id into 5 buckets'
CLUSTERED BY(ID) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
Inserting data into a bucketed table
We have following data in Employee_old table.
0: jdbc:hive2://localhost:10000> select * from employee_old;
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| employee_old.id | employee_old.name | employee_old.age | employee_old.salary | employee_old.department |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
| 1 | Sudip | 34 | 62000 | HR |
| 2 | Suresh | 45 | 76000 | FINANCE |
| 3 | Aarti | 25 | 37000 | BIGDATA |
| 4 | Neha | 27 | 39000 | FINANCE |
| 5 | Rajesh | 29 | 59000 | BIGDATA |
| 6 | Suman | 37 | 63000 | HR |
| 7 | Paresh | 42 | 71000 | BIGDATA |
| 8 | Rami | 33 | 56000 | HR |
| 9 | Arpit | 41 | 46000 | HR |
| 10 | Sanjeev | 51 | 99000 | FINANCE |
| 11 | Sanjay | 32 | 67000 | FINANCE |
+------------------+--------------------+-------------------+----------------------+--------------------------+--+
We will select data from the table Employee_old and insert it into our bucketed table Employee.
We need to set the property ‘hive.enforce.bucketing‘ to true while inserting data into a bucketed table. This will enforce bucketing, while inserting data into the table.
Set the property
0: jdbc:hive2://localhost:10000> set hive.enforce.bucketing=true;
Insert data into Bucketed table employee
0: jdbc:hive2://localhost:10000> INSERT OVERWRITE TABLE Employee SELECT * from Employee_old;
Verify the Data in Buckets
Once we execute the INSERT query, we can verify that 5 files are created Under the Employee table directory on HDFS.
Name Type
000000_0 file
000001_0 file
000002_0 file
000003_0 file
000004_0 file
Each file represents a bucket. Let us see the contents of these files.
Content of 000000_0
All records with Hash(ID) mod 5 == 0 goes into this file.
5,Rajesh,29,59000,BIGDATA
10,Sanjeev,51,99000,FINANCE
Content of 000001_0
All records with Hash(ID) mod 5 == 1 goes into this file.
1,Sudip,34,62000,HR
6,Suman,37,63000,HR
11,Sanjay,32,67000,FINANCE
Content of 000002_0
All records with Hash(ID) mod 5 == 2 goes into this file.
2,Suresh,45,76000,FINANCE
7,Paresh,42,71000,BIGDATA
Content of 000003_0
All records with Hash(ID) mod 5 == 3 goes into this file.
3,Aarti,25,37000,BIGDATA
8,Rami,33,56000,HR
Content of 000004_0
All records with Hash(ID) mod 5 == 4 goes into this file.
4,Neha,27,39000,FINANCE
9,Arpit,41,46000,HR

I feel all ID MOD 3 will be same for USA partition (region=USA) in sample data.
Bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

Take a look at the language Manual here
It states:
How does Hive distribute the rows across the buckets? In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.
In your case because you are clustering by Id which is an Int and then you're bucketing it into 3 buckets only it looks like all values are being hashed into one of these buckets. To ensure this is working, add some rows that have different ids from the ones you have in the file and increase the number of buckets and see if they get hashed into separate files this time around.

Related

In hiveql, what is the most elegant/performatic way of calculating an average value if some of the data is implicitly not present?

In Hiveql, what is the most elegant and performatic way of calculating an average value when there are 'gaps' in the data, with implicit repeated values between them? i.e. Considering a table with the following data:
+----------+----------+----------+
| Employee | Date | Balance |
+----------+----------+----------+
| John | 20181029 | 1800.2 |
| John | 20181105 | 2937.74 |
| John | 20181106 | 3000 |
| John | 20181110 | 1500 |
| John | 20181119 | -755.5 |
| John | 20181120 | -800 |
| John | 20181121 | 1200 |
| John | 20181122 | -400 |
| John | 20181123 | -900 |
| John | 20181202 | -1300 |
+----------+----------+----------+
If I try to calculate a simple average of the november rows, it will return ~722.78, but the average should take into account the days that are not shown have the same balance as the previous register. In the above data, John had 1800.2 between 20181101 and 20181104, for example.
Assuming that the table always have exactly one row for each date/balance and given that I cannot change how this data is stored (and probably shouldn't since it would be a waste of storage to write rows for days with unchanged balances), I've been tinkering with getting the average from a select with subqueries for all the days in the queried month, returning a NULL for the absent days, and then using case to get the balance from the previous available date in reverse order. All of this just to avoid writing temporary tables.
Step 1: Original Data
The 1st step is to recreate a table with the original data. Let's say the original table is called daily_employee_balance.
daily_employee_balance
use default;
drop table if exists daily_employee_balance;
create table if not exists daily_employee_balance (
employee_id string,
employee string,
iso_date date,
balance double
);
Insert Sample Data in original table daily_employee_balance
insert into table daily_employee_balance values
('103','John','2018-10-25',1800.2),
('103','John','2018-10-29',1125.7),
('103','John','2018-11-05',2937.74),
('103','John','2018-11-06',3000),
('103','John','2018-11-10',1500),
('103','John','2018-11-19',-755.5),
('103','John','2018-11-20',-800),
('103','John','2018-11-21',1200),
('103','John','2018-11-22',-400),
('103','John','2018-11-23',-900),
('103','John','2018-12-02',-1300);
Step 2: Dimension Table
You will need a dimension table where you will have a calendar (table with all the possible dates), call it dimension_date. This is a normal industry standard to have a calendar table, you could probably download this sample data over the internet.
use default;
drop table if exists dimension_date;
create external table dimension_date(
date_id int,
iso_date string,
year string,
month string,
month_desc string,
end_of_month_flg string
);
Insert some sample data for entire month of Nov 2018:
insert into table dimension_date values
(6880,'2018-11-01','2018','2018-11','November','N'),
(6881,'2018-11-02','2018','2018-11','November','N'),
(6882,'2018-11-03','2018','2018-11','November','N'),
(6883,'2018-11-04','2018','2018-11','November','N'),
(6884,'2018-11-05','2018','2018-11','November','N'),
(6885,'2018-11-06','2018','2018-11','November','N'),
(6886,'2018-11-07','2018','2018-11','November','N'),
(6887,'2018-11-08','2018','2018-11','November','N'),
(6888,'2018-11-09','2018','2018-11','November','N'),
(6889,'2018-11-10','2018','2018-11','November','N'),
(6890,'2018-11-11','2018','2018-11','November','N'),
(6891,'2018-11-12','2018','2018-11','November','N'),
(6892,'2018-11-13','2018','2018-11','November','N'),
(6893,'2018-11-14','2018','2018-11','November','N'),
(6894,'2018-11-15','2018','2018-11','November','N'),
(6895,'2018-11-16','2018','2018-11','November','N'),
(6896,'2018-11-17','2018','2018-11','November','N'),
(6897,'2018-11-18','2018','2018-11','November','N'),
(6898,'2018-11-19','2018','2018-11','November','N'),
(6899,'2018-11-20','2018','2018-11','November','N'),
(6900,'2018-11-21','2018','2018-11','November','N'),
(6901,'2018-11-22','2018','2018-11','November','N'),
(6902,'2018-11-23','2018','2018-11','November','N'),
(6903,'2018-11-24','2018','2018-11','November','N'),
(6904,'2018-11-25','2018','2018-11','November','N'),
(6905,'2018-11-26','2018','2018-11','November','N'),
(6906,'2018-11-27','2018','2018-11','November','N'),
(6907,'2018-11-28','2018','2018-11','November','N'),
(6908,'2018-11-29','2018','2018-11','November','N'),
(6909,'2018-11-30','2018','2018-11','November','Y');
Step 3: Fact Table
Create a fact table from the original table. In normal practice, you ingest the data to hdfs/hive then process the raw data and create a table with historical data where you keep inserting in increment manner. You can look more into data warehousing to get the proper definition but I call this a fact table - f_employee_balance.
This will re-create the original table with missing dates and populate the missing balance with earlier known balance.
--inner query to get all the possible dates
--outer self join query will populate the missing dates and balance
drop table if exists f_employee_balance;
create table f_employee_balance
stored as orc tblproperties ("orc.compress"="SNAPPY") as
select q1.employee_id, q1.iso_date,
nvl(last_value(r.balance, true) --initial dates to be populated with 0 balance
over (partition by q1.employee_id order by q1.iso_date rows between unbounded preceding and current row),0) as balance,
month, year from (
select distinct
r.employee_id,
d.iso_date as iso_date,
d.month, d.year
from daily_employee_balance r, dimension_date d )q1
left outer join daily_employee_balance r on
(q1.employee_id = r.employee_id) and (q1.iso_date = r.iso_date);
Step 4: Analytics
The query below will give you the true average for by month:
select employee_id, monthly_avg, month, year from (
select employee_id,
row_number() over (partition by employee_id,year,month) as row_num,
avg(balance) over (partition by employee_id,year,month) as monthly_avg, month, year from
f_employee_balance)q1
where row_num = 1
order by year, month;
Step 5: Conclusion
You could have just combined step 3 and 4 together; this would save you from creating extra table. When you are in the big data world, you don't worry much about wasting extra disk space or development time. You can easily add another disk or node and automate the process using workflows. For more information, please look into data warehousing concept and hive analytical queries.

Oracle partitioned table query cost vs non-partitioned table query cost

I have a table PO_HEADER with ~20 million records. Considering our future load on the table we have decided to partitioned the table to increase the performance of the sql queries. Below are the queries used to create the new partitioned tables.
CREATE TABLE PO_HEADER_LP
PARTITION BY LIST (BUYER_IDENTIFIER)
(PARTITION GC66287246AA VALUES ('GC66287246AA') TABLESPACE MITRIX_TABLES,
PARTITION GC43837235JK VALUES ('GC43837235JK') TABLESPACE MITRIX_TABLES,
PARTITION GC84338293AA VALUES ('GC84338293AA') TABLESPACE MITRIX_TABLES,
PARTITION DEFAULTBUID VALUES (DEFAULT) TABLESPACE MITRIX_TABLES)
AS SELECT *
FROM PO_HEADER;
create index PO_HEADER_LP_SI_IDX on PO_HEADER_LP("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES LOCAL;
Old Table PO_HEADER has two indexes on "BUYER_IDENTIFIER" and "SUPPLIER_IDENTIFIER" columns as follows:
create index PO_HEADER_BI_IDX on PO_HEADER("BUYER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
create index PO_HEADER_SI_IDX on PO_HEADER("SUPPLIER_IDENTIFIER") TABLESPACE MITRIX_INDEXES;
To test the performance of the query, I executed below query on both the tables. But, to my wonder I saw the cost of the 2nd query is almost double than the 1st one. Can any body know, why is the query cost is high of the partitioned table compared to normal table. Thanks in Advance.
select * from po_header where buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 56,941
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT'; --cost: 93,309
PO_HEADER with Global Index on buyer_identifier & supplier_identifier column
PO_HEADER_LP with Global Index on supplier_identifier column
PO_HEADER_LP with Local Index on supplier_identifier column
From your DDL I assume, you have three big buyers (say 5M records each) and a bunch of smaller ones. In other word this would be the correct setup for you list partitioning schema.
You may verify, whether it works testing access on buyer only:
EXPLAIN PLAN SET STATEMENT_ID = 'jara1' into plan_table FOR
select * from tab_lp where BUYER_ID = 1;
;
SELECT * FROM table(DBMS_XPLAN.DISPLAY('plan_table', 'jara1','ALL'));
------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6662K| 82M| 4445 (2)| 00:00:01 | | |
| 1 | PARTITION LIST SINGLE| | 6662K| 82M| 4445 (2)| 00:00:01 | KEY | KEY |
| 2 | TABLE ACCESS FULL | TAB_LP | 6662K| 82M| 4445 (2)| 00:00:01 | 2 | 2 |
------------------------------------------------------------------------------------------------
The same query for the non-partitioned table should produce much higher cost. Why?
In the partitioned table the selected buyer (in your case GC84338293AA, I'm using surrogate keys) has it own partition.
So full scan of this partition is the best access.
select * from tab where BUYER_ID = 1;
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6596K| 81M| 14025 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| TAB | 6596K| 81M| 14025 (1)| 00:00:01 |
--------------------------------------------------------------------------
1 - filter("BUYER_ID"=1)
For the non-partitioned table (to get approximately one fourth of the data) the FULL TABLE SCAN is OK as well,
but of course has higher cost as all data must be scanned.
Note - if you see here lower cost, unrealistically low Rows count and/or INDEX ACCESS,
than this is the cause of the problem of the underestimating of the cost. So don't worry the old cost are too low, not the new one too high!
The next step is the access on both buyer and supplier. To get the answer you must provide
additional information.
How selective is the supplier filter?
I.e. if the predicate buyer_identifier='GC84338293AA' returns say 5M records, how may records return the predicate with both columns?
buyer_identifier='GC84338293AA' and supplier_identifier='GC75987723HT'
Is it 4M or 100 records?
If the complete predicate returns only few records than the local index on supplier is OK.
If it returns large number of rows (say the quarter of the partition) - you should stay on FULL PARTITION SCAN and not use it.
This is similar to my comment on the non partitioned table.
Estimation of the supplier cardinality
In case that the column SUPPLIER contains a skewed data (which may fool the CBO to calulate improper cost) you may define explicitely histogram in this column.
I used this statement statement, that calculates the histogram on full data (100% is important for highly skewed data) and for the table and partition.
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>'TAB_LP',granularity=>'all',estimate_percent => 100,METHOD_OPT => 'for columns SUPPLIER_ID size 254');
This worked for my test data, i.e. for supplier with low cardinality an index access was opened (on local no-prefixed index) and for huge suppliers a full partition scan was used.
You can create a Local partitioned index using this script.
CREATE INDEX PO_HEADER_LOCAL_IDX ON PO_HEADER_LP
(BUYER_IDENTIFIER, SUPPLIER_IDENTIFIER)
LOCAL (
PARTITION GC66287246AA,
PARTITION GC43837235JK,
PARTITION GC84338293AA,
PARTITION DEFAULTBUID
);
Also it is recommended to gather statistics of the newly created partition table using this script:
EXEC DBMS_STATS.GATHER_TABLE_STATS('SCHEMA Name','PO_HEADER_LP');
Now you can generate the execution plan again of the following SQL:
select * from po_header_lp where buyer_identifier= 'GC84338293AA' and supplier_identifier='GC75987723HT';
Hope this will help you.

cassandra query on map in select clause

i am new to cassandra and i am trying to read a row from database which contains values
siteId | country | someMap
1 | US | {a:b, x:z}
2 | PR | {a:b, x:z}
I have also created an index on table using create index on columnfamily(keys(someMap));
but still when i query as select * from table where siteId=1 and someMap contains key 'a'
it returns an entiremap as
1 | US | {a:b, x:z}
Can somebody help me on what should i do to get the value as
1 | US | {a:b}
You can not: even if internally each entry of a Map|List|Set is stored as a column you can only retrieve the whole collection but not part of it. You are not asking cassandra give me the entry of the map containing X, but the row whom map contains X.
HTH,
Carlo

Oracle Identify not unique values in a clob column of a table

I want to identify all rows whose content in a clob column is not unique.
The query I use is:
select
id,
clobtext
from
table t
where
(select count(*) from table innerT where dbms_lob.compare(innerT.clobtext, t.clobtext) = 0)>1
However this query is very slow. Any suggestions to speed it up? I already tried to use the dbms_lob.getlength function to eliminate more elements in the subquery but I didn't really improve the performance (feels the same).
To make it more clear an example:
table
ID | clobtext
1 | a
2 | b
3 | c
4 | d
5 | a
6 | d
After running the query. I'd like to get (order doesn't matter):
1 | a
4 | d
5 | a
6 | d
In the past I've generated checksums (in my C# code) for each clob.
Whilst this will inccur a one off increase in io (to generate the checksum)
subsequent scans will be quicker, and you can index the value too
TK has a good PL\SQL example here:
Ask Tom

How should I range partition an index with a varchar2 column in Oracle? Is it a bad idea?

I am using Oracle 10g Enterprise edition.
A table in our Oracle database stores the soundex value representation of another text column. We are using a custom soundex implementation in which the soundex values are longer than are generated by traditional soundex algorithms (such as the one Oracle uses). That's really beside the point.
Basically I have a varchar2 column that has values containing a single character followed by a dynamic number of numeric values (e.g. 'A12345', 'S382771', etc). The table is partitioned by another column, but I'd like to add a partitioned index to the soundex column since it is often searched. When trying to add a range partitioned index using the first character of the soundex column it worked great:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
However, I in order to more evenly distribute the size of the partitions, I want to define some partitions by the first two chars, like so:
create index IDX_NAMES_SOUNDEX on NAMES_SOUNDEX (soundex)
global partition by range (soundex) (
partition IDX_NAMES_SOUNDEX_PART_A5 values less than ('A5'), -- 'A0% - A4%'
partition IDX_NAMES_SOUNDEX_PART_A values less than ('B'), -- 'A4% - A9%'
partition IDX_NAMES_SOUNDEX_PART_B values less than ('C'), -- 'B%'
...
);
I'm not sure how to properly range partition using varchar2 columns. I'm sure this is a less than ideal choice, so perhaps someone can recommend a better solution. Here's a distribution of the soundex data in my table:
-----------------------------------
| SUBSTR(SOUNDEX,1,1) | COUNT |
-----------------------------------
| A | 6476349 |
| B | 854880 |
| D | 520676 |
| F | 1200045 |
| G | 280647 |
| H | 3048637 |
| J | 711031 |
| K | 1336522 |
| L | 348743 |
| M | 3259464 |
| N | 1510070 |
| Q | 276769 |
| R | 1263008 |
| S | 3396223 |
| V | 533844 |
| W | 555007 |
| Y | 348504 |
| Z | 1079179 |
-----------------------------------
As you can see, the distribution is not evenly spread, which is why I want to define range partitions using the first two characters instead of just the first character.
Suggestions?
Thanks!
What exactly is your question?
Don't you know how you can split your table in n equal parts to avoid skew?
You can do that with analytic function percentile_disc().
Here an SQL PLUS example with n=100, I admit that it isn't very sophisticated but it will do the job.
set pages 0
set lines 200
drop table random_strings;
create table random_strings
as
select upper(dbms_random.string('A', 12)) rndmstr
from dual
connect by level < 1000;
spool parts
select 'select '||level||'/100,percentile_disc('||level||
'/100) within group (order by RNDMSTR) from random_strings;'
sql_statement
from dual
connect by level <= 100
/
spool off
This will output in file parts.lst:
select 1/100,percentile_disc(1/100) within group (order by RNDMSTR) from random_strings;
select 2/100,percentile_disc(2/100) within group (order by RNDMSTR) from random_strings;
select 3/100,percentile_disc(3/100) within group (order by RNDMSTR) from random_strings;
...
select 100/100,percentile_disc(100/100) within group (order by RNDMSTR) from random_strings;
Now you can run script parts.lst to get the partition values. Each partition will contain 1% of the data initially.
Script parts.lst will output:
,01 AJUDRRSPGMNP
,02 AOMJZQPZASQZ
,03 AWDQXVGLLUSJ
,04 BIEPUHAEMELR
....
,99 ZTMHDWTXUJAR
1 ZYVJLNATVLOY
Is the table is being searched by the partitioning key in addition to the SOUNDEX value? Or is it being searched just by the SOUNDEX column?
If you are just trying to achieve an even distribution of data among partitions, have you considered using hash partitions rather than range partitions? Assuming you choose a power of 2 for the number of partitions, that should give you a pretty even distribution of data between partitions.
Talk to me!
Can you tell me what your reason is for partitioning this table? It sounds like it is an OLTP table and may not need to be partition. We don’t want to partition just to say we are partitioned. Tell me what you are trying to accomplish by partitioning this table and I can help you pick a correct partitioning scheme. Partitioning does not equal faster queries. It actually can cause your queries to be slower in some cases.
I see some of your additional thoughts above and I don’t believe you need to partition your table. If your queries are going to be doing aggregates on entire partitions then you may want to partition. If you are going to have hundreds of millions of rows of data you may want to partition to help with DBA maintenance. If you just want you queries to run fast then the primary key index will suffice. Please let me know
Just create a global index on your desired columns.

Resources