How to arrange multi partitions in hive? - hadoop

say i have a order table, which contains multi time column(spend_time,expire_time,withdraw_time),
usually,i will query the table with the above column independently,so how do i create the partitions?
order_no | spend_time | expire_time | withdraw_time | spend_amount
A001 | 2017/5/1 | 2017/6/1 | 2017/6/2 | 100
A002 | 2017/4/1 | 2017/4/19 | 2017/4/25 | 500
A003 | 2017/3/1 | 2017/3/19 | 2017/3/25 | 1000
Usually the business situation is to calculate total spend_amount between certain spend_time or expire_time or withdraw_time, or the combination of the 3.
But with 3 time dimensions cross combination(each has about 1000 partitions) can be a lot of partitions(1000*1000*1000),is that ok and efficient?
my solution is that i create 3 tables with 3 different columns.Is this a efficient way to solve this problem?

Related

Is there a way to rank multiple columns in power query?

I am setting up a query where I need to rank multiple columns. I was able to sort the first column in descending order and inserted an index column. However, I am not able to rank the other columns.
I have included an example below:
Table to show agent performance
Agent | surveys | rank | outcalls |total calls |outcalls/total calls |rank
Dallas | 80% | 1 | 50 | 80 | 62.5% | ?
May | 75% | 2 | 90 | 100 | 90.0% | ?
Summer | 60% | 3 | 60 | 75 | 80.0% | ?
So basically from the example above, I was able to add an index column that ranked the surveys. How can I rank the outcalls/total calls column while still maintaining the rank in the other columns?
In this case, a simple approach would be to sort on outcalls/total calls, add another index column, and then sort the first rank column if you want to revert back to your starting order.

Oracle insert in index table:Time to load 500 thousand rows is more than inserting 16 million rows

At first I tried normal insert into target table from temporary table.
INSERT /*+ APPEND */ INTO RDW10DM.INV_ITEM_LW_DM
SELECT
*
FROM
RDW10PRD.TMP_MDS_RECLS_INV_ITEM_LW_DM
;
COMMIT;
It tooks only 17 min to load.Total count in temp table TMP_MDS_RECLS_INV_ITEM_LW_DM is 16491650.
Plan for Execution:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 16M| 1290M| 4927 |
| 1 | LOAD AS SELECT | | | | |
| 2 | TABLE ACCESS FULL | TMP_MDS_RECLS_INV_ITEM_LW_DM | 16M| 1290M| 4927 |
--------------------------------------------------------------------------------------
Note: cpu costing is off
Then I tried to load loc wise:
INSERT /*+ APPEND */ INTO RDW10DM.INV_ITEM_LW_DM
SELECT
*
FROM
RDW10PRD.TMP_MDS_RECLS_INV_ITEM_LW_DM
where LOC_KEY=222
;
COMMIT;
Then it tooks around 28 min to load. Total count in temp table with filter is 493465
Plan for execution:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 492K| 38M| 4927 |
| 1 | LOAD AS SELECT | | | | |
|* 2 | TABLE ACCESS FULL | TMP_MDS_RECLS_INV_ITEM_LW_DM | 492K| 38M| 4927 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("TMP_MDS_RECLS_INV_ITEM_LW_DM"."LOC_KEY"=222)
Note: cpu costing is off
Index in Target table:
Does anyone has any idea why this is happening?
My guess? The TMP table doesn't have an index.
Therefore - selecting all records and inserting them is faster then applying an a filter on 16Mil records.
As you can see, in your second execution plan the scanner is using FULL ACCESS , which slows down the query. Try adding an index on TMP_MDS_RECLS_INV_ITEM_LW_DM(LOC_KEY) . It should boost your query performance.
Thank everyone for your valuable thoughts.
I found the actual problem later. Since I have doing frequent truncate and load in target table RDW10DM.INV_ITEM_LW_DM so index pages might have fragmented.
So, ran query after rebuilding indexes and got expected results.

Efficient way to join by levenshtein in Hive or Impala

I have two tables one includes about 17K (NLIST) records while the other 57K (FNAMES).
I would like to join the both by comparing the records using levenshtein formula.
Here is the example for the content of tables:
Table NLIST:
+------+-------------+
| ID | S_NAME |
+------+-------------+
| 1 | Avi |
| 2 | Moshe |
| 3 | David |
....
Table FNAMES:
+------+-------------+
| ID | NICKNAMES |
+------+-------------+
| 1 | Avile |
| 2 | Dudi |
| 3 | Moshiko |
| 4 | Avi |
| 5 | DAVE |
....
The above tables are just examples. In the real case the names column can include more than one word.
The required result should be:
+------+-------------+--------+
| ID | NICKNAMES | S_NAME |
+------+-------------+--------+
| 1 | Avile | Avi |
| 2 | Dudi | David |
| 3 | Moshiko | Moshe |
| 4 | Avi | Avi |
| 5 | DAVE | David |
...
Here is the code I use:
select FNAMES.NICKNAMES, NLIST.S_NAME
from NICKNAMES
LEFT OUTER JOIN NLIST
ON(true)
WHERE levenshtein (FNAMES.NICKNAMES, NLIST.S_NAME) <=4
The above code runs for a very long time and I stopped its running.
How can I make it run in a reasonable time?
In addition, I think the levenshtein distance depends on the length of the words. How can I find the optimal value for the distance (in this case I chose 4 arbitrarily)?
Hive Table performance is depends upon various point .
Query enginee
File format
use VECTORIZATION set hive.vectorized.execution.enabled = true;set hive.vectorized.execution.reduce.enabled = true;
If you have good server you can try with Impala and definitely it is faster than Hive.
You can do the fine tuning of impala which will give you an edge to execute this query faster .Tuning Impala for Performance

Get 1 value of each date SSRS

Ussing SSRS, I have data with duplicate values in Field1. I need to get only 1 value of each month.
Field1 | Date |
----------------------------------
30 | 01.01.1990 |
30 | 01.01.1990 |
30 | 01.01.1990 |
50 | 02.01.1990 |
50 | 02.01.1990 |
50 | 02.01.1990 |
50 | 02.01.1990 |
40 | 03.01.1990 |
40 | 03.01.1990 |
40 | 03.01.1990 |
It should be ssrs expression with average value of each month or mb there are other solutions to get requested data by ssrs expression. Requested data in table:
30 | 01.01.1990 |
50 | 02.01.1990 |
40 | 03.01.1990 |
Hope for help.
There is no SumDistinct function in SSRS, and it is real lack of it (CountDistinct exist although). So you obviously can't achieve what you want easy way. You have two options:
Implement a new stored procedure with select distinct, returning reduced set of fields to avoid repeated data that you need. You then need to use this stored procedure to build new dataset and use in your table. But this way obviously may be not applicable in your case.
The other option is to implement your own function, which will save state of aggregation and perform distinct sum. Take a look at this page, it contains examples of code that you need.

Subqueries and MySQL Cache for 18M+ row table

As this is my first post it seems I can only post 1 link so I have listed the sites I'm referring to at the bottom. In a nutshell my goal is to make the database return the results faster, I have tried to include as much relevant information as I could think of to help frame the questions at the bottom of the post.
Machine Info
8 processors
model name : Intel(R) Xeon(R) CPU E5440 # 2.83GHz
cache size : 6144 KB
cpu cores : 4
top - 17:11:48 up 35 days, 22:22, 10 users, load average: 1.35, 4.89, 7.80
Tasks: 329 total, 1 running, 328 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 87.4%id, 12.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8173980k total, 5374348k used, 2799632k free, 30148k buffers
Swap: 16777208k total, 6385312k used, 10391896k free, 2615836k cached
However we are looking at moving the mysql installation to a different machine in the cluster that has 256 GB of ram
Table Info
My MySQL Table looks like
CREATE TABLE ClusterMatches
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cluster_index INT,
matches LONGTEXT,
tfidf FLOAT,
INDEX(cluster_index)
);
It has approximately 18M rows, there are 1M unique cluster_index's and 6K unique matches. The sql query I am generating in PHP looks like.
SQL query
$sql_query="SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters."))
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";
where $cluster contains a string of approximately 3,000 comma separated cluster_index's. This query makes use of approximately 50,000 rows and takes approximately 15s to run, when the same query is run again it takes approximately 1s to run.
Usage
The content of the table can be assumed to be static.
Low number of concurrent users
The query above is currently the only query that will be run on the table
Subquery
Based on this post [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] and the improvement in query time I believe my subquery can be indexed.
mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000)
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 48528 | Using temporary; Using filesort |
| 2 | DERIVED | ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 53689 | Using where |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
According to this older article [Optimizing MySQL: Queries and Indexes][2] in Extra info - the bad ones to see here are "using temporary" and "using filesort"
MySQL Configuration Info
Query cache is available, but effectively turned off as the size is currently set to zero
mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name | Value |
+---------------------------------+----------------------+
| bdb_cache_size | 8384512 |
| binlog_cache_size | 32768 |
| expire_logs_days | 0 |
| have_query_cache | YES |
| flush | OFF |
| flush_time | 0 |
| innodb_additional_mem_pool_size | 1048576 |
| innodb_autoextend_increment | 8 |
| innodb_buffer_pool_awe_mem_mb | 0 |
| innodb_buffer_pool_size | 8388608 |
| join_buffer_size | 131072 |
| key_buffer_size | 8384512 |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 18446744073709547520 |
| sort_buffer_size | 2097144 |
| table_cache | 64 |
| thread_cache_size | 0 |
| query_cache_limit | 1048576 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 0 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| read_rnd_buffer_size | 262144 |
+---------------------------------+----------------------+
Based on this article on [Mysql Database Performance turning][3] I believe that the values I need to tweak are
table_cache
key_buffer
sort_buffer
read_buffer_size
record_rnd_buffer (for GROUP BY and ORDER BY terms)
Areas Identified for improvement - MySQL Query tweaks
Changing the datatype for matches to an index that is an int pointing to another table [MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table.][4]
Indexing the new match_index feild so that the GROUP BY matches occurs faster, based on the statement ["You should probably create indices for any field on which you are selecting, grouping, ordering, or joining."][5]
Tools
To tweak perform I plan to use
[Explain][6] making reference to [the output format][7]
[ab - Apache HTTP server benchmarking tool][8]
[Profiling][9] with [log data][10]
Future Database Size
The goal is to build a system that can have 1M unique cluster_index values 1M unique match values, approx 3,000,000,000 table rows with a response time to the query of around 0.5s (we can add more ram as necessary and distribute the database across the cluster)
Questions
I think we want to keep the entire recordset in ram so that the query doesnt touch the disk, if we keep the entire database in the MySQL cache does that eliminate the need for memcachedb?
Is trying to keep the entire database in MySQL cache a bad strategy as its not designed to be persistent? Would something like memcachedb or redis be a better approach, if so why?
Is the temporary table "result" that is created by the query automatically destroyed when the query finishes?
Should we switch from Innodb to MyISAM [as its good for read heavy data where as InnoDB is good for write heavy][11] ?
my cache doesnt appear to be on as its zero in my [Query Cache Configuration][12], why does the query currently occur faster the second time I run it?
can i restructure my query to eliminate "using temporary" and "using filesort" occuring, should i be using a join instead of a subquery?
how do you view the size of the MySQL [Data Cache][13]?
what sort of sizes for the values table_cache, key_buffer, sort_buffer, read_buffer_size, record_rnd_buffer would you suggest as a starting point?
Links
1: stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
2: databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
3: debianhelp.co.uk/mysqlperformance.htm
4: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
5: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
6: dev.mysql.com/doc/refman/5.0/en/explain.html
7: dev.mysql.com/doc/refman/5.0/en/explain-output.html
8: httpd.apache.org/docs/2.2/programs/ab.html
9: mtop.sourceforge.net/
10: dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
11: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
12: dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
13: dev.mysql.com/tech-resources/articles/mysql-query-cache.html
Changing the table
Based on the advice in this post on How to pick indexes for order by and group by queries the table now looks like
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
Eliminating Subquery
The query without sorting the results by the SUM(tfidf) looks like
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
Which eliminates using temporary and using filesort
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
Sorting Problem
However if i add the ORDER BY SUM(tfdif) in
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)
The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
Possible Solutions?
Im looking for a solution that doesn't use temporary or filesort, along the lines of
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;
where I dont need to hardcode a threshold for total, any ideas?

Resources