In Vertica 8, the "metadata" resource pool was introduced. The documentation describes it as :
The pool that tracks memory allocated for catalog data and storage data structures.
It doesn't seem essential, since the documentation indicates how to disable it using the EnableMetadataMemoryTracking parameter.
What is this pool used for ? Since it consumes quite a lot of RAM (4Gb on our servers), can I disable it safely ?
metadata RAM it's vertica catalog size, reserved dynamically RAM that vertica process allocated for catalog.
for example you have 32GB of RAM total , vertica will use 95% of total ram ~30.5 GB but you have large catalog ~3GB (tons of objects) and vertica process consume couple of GB -> vertica process uses RAM that according to general pool must be free for queries -> can cause starvation.
If you use metadata pool that dynamicly borrow from general RAM needed for catalog your resource management will be better.
BTW why you have 4GB RAM catalog?? its kinda huge how much RAM vertica process consume in IDLE? Is it consume less after restart and grows over time?
created simple script that create 1000 tables with 100 int columns, insert 1 row and analyze statistics. You can see how catalog size grow with number of objects and how it affect metadata pool and vertica process RAM :
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata';
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
218 | v_vertica_node0001 | 108622 | 108622
218 | v_vertica_node0002 | 119596 | 119596
218 | v_vertica_node0003 | 122374 | 122374
(3 rows)
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
513 | v_vertica_node0001 | 229210 | 229210
513 | v_vertica_node0002 | 281601 | 281601
513 | v_vertica_node0003 | 289407 | 289407
(3 rows)
476260 dbadmin 20 0 5391m 407m 39m S 109.2 2.6 21:25.64 vertica
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
825 | v_vertica_node0001 | 352359 | 352359
825 | v_vertica_node0002 | 448032 | 448032
825 | v_vertica_node0003 | 456439 | 456439
(3 rows)
476260 dbadmin 20 0 5564m 554m 39m S 79.2 3.5 38:16.91 vertica
dbadmin=> select (select count(1) from tables),node_name,memory_size_kb,memory_size_actual_kb from resource_pool_status where pool_name ilike 'metadata'; \! top -n 1 | grep vertica
?column? | node_name | memory_size_kb | memory_size_actual_kb
----------+--------------------+----------------+-----------------------
1143 | v_vertica_node0001 | 489867 | 489867
1143 | v_vertica_node0002 | 627409 | 627409
1143 | v_vertica_node0003 | 635616 | 635616
(3 rows)
476260 dbadmin 20 0 5692m 711m 39m S 0.7 4.5 58:13.61 vertica
Related
At first I tried normal insert into target table from temporary table.
INSERT /*+ APPEND */ INTO RDW10DM.INV_ITEM_LW_DM
SELECT
*
FROM
RDW10PRD.TMP_MDS_RECLS_INV_ITEM_LW_DM
;
COMMIT;
It tooks only 17 min to load.Total count in temp table TMP_MDS_RECLS_INV_ITEM_LW_DM is 16491650.
Plan for Execution:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 16M| 1290M| 4927 |
| 1 | LOAD AS SELECT | | | | |
| 2 | TABLE ACCESS FULL | TMP_MDS_RECLS_INV_ITEM_LW_DM | 16M| 1290M| 4927 |
--------------------------------------------------------------------------------------
Note: cpu costing is off
Then I tried to load loc wise:
INSERT /*+ APPEND */ INTO RDW10DM.INV_ITEM_LW_DM
SELECT
*
FROM
RDW10PRD.TMP_MDS_RECLS_INV_ITEM_LW_DM
where LOC_KEY=222
;
COMMIT;
Then it tooks around 28 min to load. Total count in temp table with filter is 493465
Plan for execution:
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
--------------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 492K| 38M| 4927 |
| 1 | LOAD AS SELECT | | | | |
|* 2 | TABLE ACCESS FULL | TMP_MDS_RECLS_INV_ITEM_LW_DM | 492K| 38M| 4927 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("TMP_MDS_RECLS_INV_ITEM_LW_DM"."LOC_KEY"=222)
Note: cpu costing is off
Index in Target table:
Does anyone has any idea why this is happening?
My guess? The TMP table doesn't have an index.
Therefore - selecting all records and inserting them is faster then applying an a filter on 16Mil records.
As you can see, in your second execution plan the scanner is using FULL ACCESS , which slows down the query. Try adding an index on TMP_MDS_RECLS_INV_ITEM_LW_DM(LOC_KEY) . It should boost your query performance.
Thank everyone for your valuable thoughts.
I found the actual problem later. Since I have doing frequent truncate and load in target table RDW10DM.INV_ITEM_LW_DM so index pages might have fragmented.
So, ran query after rebuilding indexes and got expected results.
We do an initial bulk load of some tables (both, source and target are Oracle 11g). The process is as follows: 1. truncate, 2. drop indexes (the PK and a unique index), 3. bulk insert, 4. create indexes (again the PK and the unique index). Now I got the following error:
alter table TARGET_SCHEMA.MYBIGTABLE
add constraint PK_MYBIGTABLE primary key (MYBIGTABLE_PK)
ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
So obviously TEMP tablespace is to small for PK creation (FYI the table has 6 columns and about 2.2 billion records). So I did this:
explain plan for
select line_1,line_2,line_3,line_4,line_5,line_6,count(*) as cnt
from SOURCE_SCHEMA.MYBIGTABLE
group by line_1,line_2,line_3,line_4,line_5,line_6;
select * from table( dbms_xplan.display );
/*
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2274M| 63G| | 16M (2)| 00:05:06 |
| 1 | HASH GROUP BY | | 2274M| 63G| 102G| 16M (2)| 00:05:06 |
| 2 | TABLE ACCESS FULL| MYBIGTABLE | 2274M| 63G| | 744K (7)| 00:00:14 |
-----------------------------------------------------------------------------------------------
*/
Is this how to tell how much TEMP tablespace will be needed for PK creation (102 GB in my case)? Or would you make the estimate differently?
Additional: The PK only exists on the target system. But fair point, so I run your query on target PK:
explain plan for
select MYBIGTABLE_PK
from TARGET_SCHEMA.MYBIGTABLE
group by MYBIGTABLE_PK ;
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 13 | 3 (34)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 13 | 3 (34)| 00:00:01 |
| 2 | TABLE ACCESS FULL| MYBIGTABLE | 1 | 13 | 2 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
So how would I have to read this now?
This is a good question.
First, If you create the following primary key
alter table TARGET_SCHEMA.MYBIGTABLE
add constraint PK_MYBIGTABLE primary key (MYBIGTABLE_PK)
then you should query
explain plan for
select PK_MYBIGTABLE
from SOURCE_SCHEMA.MYBIGTABLE
group by PK_MYBIGTABLE
To get an estimate (make sure you gather stats exec dbms_stats.gather_table_stats('SOURCE_SCHEMA','MYBIGTABLE').
Second , you can query V$TEMPSEG_USAGE to see how much temp blocks were consumed before you got thrown and v$session_longops to see how much of the total process you finished.
Oracle docs suggests creating a dedicated temp tablespace for the process to not disturb any other operations.
Please post an edit if you find a more accurate solution.
I am currently running into an issue with my Oracle instance. I have two simple select statements:
select * from dog_vets
and
select * from dog_statuses
and the following fiddle
My explain plan on dog_vets is as follows:
0 | Select Statement
1 | Table Access Full Scan dog_vets
my explain plan on dog_statuses is as follows:
ID|Operation | Name | Rows |Bytes | cost | time
0 | Select Statement | | 20G | 500M | 100000 | 999:99:17
1 | View | index%_join_001 | 20G | 500M | 100000 | 999:99:17
2 | Hash Join | | | | |
3 | Hash Join | | | | |
4 | Index fast full scan dog_statuses_check_up | | 20G | 500M | 100000 | 32:15:00
5 | Index fast full scan dog_statuses_sick| | 20G | 500M | 100000 | 35:19:00
To get this type of output execute the following statement:
explain plan for
select * from dog_vets;
OR
explain plan for
select * from dog_statuses;
and then
select * from table(dbms_xplan.display);
Now my question is, why do multiple indexes imply a view (materialized I assume) being created in my above statements and further what type of performance hit am I suffering on this type of query? As it stands now dog_vets has ~300 million records and dog_Statuses has about 500 million. I have yet to be able to get select * from dog_statuses to return in under 10 hours. This is primarily because the query dies before it completes.
DDL
In case sql fiddle dies:
create table dog_vets
(
name varchar2(50),
founded timestamp,
staff_count number
);
create table dog_statuses
(
check_up timestamp,
sick varchar2(1)
);
create index dog_vet_name
on dog_vets(name);
create index dog_status_check_up
on dog_statuses(check_up);
create index dog_status_sick
on dog_statuses(sick);
You could try to tell the optimizer to forget about indexes
SELECT /*+NO_INDEX(dog_statuses)*/ *
FROM dog_statuses
I'm having a performance issue when deploying an app developed on 10g XE in a client's 9i server. The same query produces completely different query plans depending on the server:
SELECT DISTINCT FOO.FOO_ID AS C0,
GEE.GEE_CODE AS C1,
TO_CHAR(FOO.SOME_DATE, 'DD/MM/YYYY') AS C2,
TMP_FOO.SORT_ORDER AS SORT_ORDER_
FROM TMP_FOO
INNER JOIN FOO ON TMP_FOO.FOO_ID=FOO.FOO_ID
LEFT JOIN BAR ON FOO.FOO_ID=BAR.FOO_ID
LEFT JOIN GEE ON FOO.GEE_ID=GEE.GEE_ID
ORDER BY SORT_ORDER_;
Oracle Database 10g Express Edition Release 10.2.0.1.0 - Production:
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 67 | 10 (30)| 00:00:01 |
| 1 | SORT UNIQUE | | 1 | 67 | 9 (23)| 00:00:01 |
| 2 | NESTED LOOPS OUTER | | 1 | 67 | 8 (13)| 00:00:01 |
|* 3 | HASH JOIN OUTER | | 1 | 48 | 7 (15)| 00:00:01 |
| 4 | NESTED LOOPS | | 1 | 44 | 3 (0)| 00:00:01 |
| 5 | TABLE ACCESS FULL | TMP_FOO | 1 | 26 | 2 (0)| 00:00:01 |
| 6 | TABLE ACCESS BY INDEX ROWID| FOO | 1 | 18 | 1 (0)| 00:00:01 |
|* 7 | INDEX UNIQUE SCAN | FOO_PK | 1 | | 0 (0)| 00:00:01 |
| 8 | TABLE ACCESS FULL | BAR | 1 | 4 | 3 (0)| 00:00:01 |
| 9 | TABLE ACCESS BY INDEX ROWID | GEE | 1 | 19 | 1 (0)| 00:00:01 |
|* 10 | INDEX UNIQUE SCAN | GEE_PK | 1 | | 0 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
Oracle9i Release 9.2.0.1.0 - 64bit Production:
----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost |
----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 98M| 6546M| | 3382K|
| 1 | SORT UNIQUE | | 98M| 6546M| 14G| 1692K|
|* 2 | HASH JOIN OUTER | | 98M| 6546M| 137M| 2874 |
| 3 | VIEW | | 2401K| 109M| | 677 |
|* 4 | HASH JOIN OUTER | | 2401K| 169M| 40M| 677 |
| 5 | VIEW | | 587K| 34M| | 24 |
|* 6 | HASH JOIN | | 587K| 34M| | 24 |
| 7 | TABLE ACCESS FULL| TMP_FOO | 8168 | 207K| | 10 |
| 8 | TABLE ACCESS FULL| FOO | 7188 | 245K| | 9 |
| 9 | TABLE ACCESS FULL | BAR | 409 | 5317 | | 1 |
| 10 | TABLE ACCESS FULL | GEE | 4084 | 89848 | | 5 |
----------------------------------------------------------------------------
As far as I can tell, indexes exist and are correct. What are my options to make Oracle 9i use them?
Update #1: TMP_FOO is a temporary table and it has no rows in this test. FOO is a regular table with 13,035 rows in my local XE; not sure why the query plan shows 1, perhaps it's realising that an INNER JOIN against an empty table won't require a full table scan :-?
Update #2: I've spent a couple of weeks trying everything and nothing provided a real enhancement: query rewriting, optimizer hints, changes in DB design, getting rid of temp tables... Finally, I got a copy of the same 9.2.0.1.0 unpatched Oracle version the customer has (with obvious architecture difference), installed it at my site and... surprise! In my 9i, all execution plans come instantly and queries take from 1 to 10 seconds to complete.
At this point, I'm almost convinced that the customer has a serious misconfiguration issue.
it looks like either you don't have data on your 10g express database, or your statistics are not collected properly. In either case it looks to Oracle like there aren't many rows, and therefore an index-range scan is appropriate.
In your 9i database, the statistics look like they are collected properly and Oracle sees a 4-table join with lots of rows and without a where clause. In that case since you haven't supplied an hint, Oracle builds an explain plan with the default ALL_ROWS optimizer behaviour: Oracle will find the plan that is the most performant to return all rows to the last. In that case the HASH JOIN with full table scans is brutally efficient, it will return big sets of rows faster that with an index NESTED LOOP join.
Maybe you want to use an index because you are only interested in the first few rows of the query. In that case use the hint /*+ FIRST_ROWS*/ that will help Oracle understand that you are more interested in the first row response time than overall total query time.
Maybe you want to use an index because you think this would result in a faster total query time. You can force an explain plan through the use of hints like USE_NL and USE_HASH but most of the time you will see that if the statistics are up-to-date the optimizer will have picked the most efficient plan.
Update: I saw your update about TMP_FOO being a temporary table having no row. The problem with temporary table is that they have no stats so my above answer doesn't apply perfectly to temporary tables. Since the temp table has no stats, Oracle has to make a guess (here it chooses quite arbitrarly 8168 rows) which results in an inefficient plan.
This would be a case where it could be appropriate to use hints. You have several options:
A mix of LEADING, USE_NL and USE_HASH hints can force a specific plan (LEADING to set the order of the joins and USE* to set the join method).
You could use the undocumented CARDINALITY hint to give additional information to the optimizer as described in an AskTom article. While the hint is undocumented, it is arguably safe to use. Note: on 10g+ the DYNAMIC_SAMPLING could be the documented alternative.
You can also set the statistics on the temporary table beforehand with the DBMS_STATS.set_table_stats procedure. This last option would be quite radical since it would potentially modify the plan of all queries against this temp table.
It could be that 9i is doing it exactly right. According to the stats posted, the Oracle 9i database believes it is dealing with a statement returning 98 million rows, whereas the 10G database thinks it will return 1 row. It could be that both are correct, i.e the amount of data in the 2 databases is very very different. Or it could be that you need to gather stats in either or both databases to get a more accurate query plan.
In general it is hard to tune queries when the target version is older and a different edition. You have no chance of tuning a query without realistic volumes of data, or at least realistic statistics.
If you have a good relationship with your client you could ask them to export their statistics using DBMS_STATS.EXPORT_SCHEMA_STATS(). Then you can import the stats using the matching IMPORT_SCHEMA_STATS procedure.
Otherwise you'll have to fake the numbers yourself using the DBMS_STATS.SET_TABLE_STATISTICS() procedure. Find out more.
You could add the following hints which would "force" Oracle to use your indexes (if possible):
Select /*+ index (FOO FOO_PK) */
/*+ index (GEE GEE_PK) */
From ...
Or try to use the FIRST_ROWS hint to indicate you're not going to fetch all these estimated 98 Million rows... Otherwise I doubt the indexes would make a huge difference because you have no Where clause so Oracle would have to read these tables anyways.
The customer had changed a default setting in order to support a very old third-party legacy application: the static parameter OPTIMIZER_FEATURES_ENABLE had been changed from the default value in 9i (9.2.0) to 8.1.7.
I made the same change in a local copy of 9i and I got the same problems: explain plans that take hours to be calculated and so on.
(Knowing this, I've asked a related question at ServerFault, but I believe this solves the original question.)
As this is my first post it seems I can only post 1 link so I have listed the sites I'm referring to at the bottom. In a nutshell my goal is to make the database return the results faster, I have tried to include as much relevant information as I could think of to help frame the questions at the bottom of the post.
Machine Info
8 processors
model name : Intel(R) Xeon(R) CPU E5440 # 2.83GHz
cache size : 6144 KB
cpu cores : 4
top - 17:11:48 up 35 days, 22:22, 10 users, load average: 1.35, 4.89, 7.80
Tasks: 329 total, 1 running, 328 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 87.4%id, 12.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8173980k total, 5374348k used, 2799632k free, 30148k buffers
Swap: 16777208k total, 6385312k used, 10391896k free, 2615836k cached
However we are looking at moving the mysql installation to a different machine in the cluster that has 256 GB of ram
Table Info
My MySQL Table looks like
CREATE TABLE ClusterMatches
(
id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
cluster_index INT,
matches LONGTEXT,
tfidf FLOAT,
INDEX(cluster_index)
);
It has approximately 18M rows, there are 1M unique cluster_index's and 6K unique matches. The sql query I am generating in PHP looks like.
SQL query
$sql_query="SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM Test2_ClusterMatches WHERE `cluster_index` in (".$clusters."))
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) DESC LIMIT 0, 10;";
where $cluster contains a string of approximately 3,000 comma separated cluster_index's. This query makes use of approximately 50,000 rows and takes approximately 15s to run, when the same query is run again it takes approximately 1s to run.
Usage
The content of the table can be assumed to be static.
Low number of concurrent users
The query above is currently the only query that will be run on the table
Subquery
Based on this post [stackoverflow: Cache/Re-Use a Subquery in MySQL][1] and the improvement in query time I believe my subquery can be indexed.
mysql> EXPLAIN EXTENDED SELECT `matches`,sum(`tfidf`) FROM
(SELECT * FROM ClusterMatches WHERE `cluster_index` in (1,2,...,3000)
AS result GROUP BY `matches` ORDER BY sum(`tfidf`) ASC LIMIT 0, 10;
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
| 1 | PRIMARY | derived2 | ALL | NULL | NULL | NULL | NULL | 48528 | Using temporary; Using filesort |
| 2 | DERIVED | ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 53689 | Using where |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+---------------------------------+
According to this older article [Optimizing MySQL: Queries and Indexes][2] in Extra info - the bad ones to see here are "using temporary" and "using filesort"
MySQL Configuration Info
Query cache is available, but effectively turned off as the size is currently set to zero
mysqladmin variables;
+---------------------------------+----------------------+
| Variable_name | Value |
+---------------------------------+----------------------+
| bdb_cache_size | 8384512 |
| binlog_cache_size | 32768 |
| expire_logs_days | 0 |
| have_query_cache | YES |
| flush | OFF |
| flush_time | 0 |
| innodb_additional_mem_pool_size | 1048576 |
| innodb_autoextend_increment | 8 |
| innodb_buffer_pool_awe_mem_mb | 0 |
| innodb_buffer_pool_size | 8388608 |
| join_buffer_size | 131072 |
| key_buffer_size | 8384512 |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 18446744073709547520 |
| sort_buffer_size | 2097144 |
| table_cache | 64 |
| thread_cache_size | 0 |
| query_cache_limit | 1048576 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 0 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| read_rnd_buffer_size | 262144 |
+---------------------------------+----------------------+
Based on this article on [Mysql Database Performance turning][3] I believe that the values I need to tweak are
table_cache
key_buffer
sort_buffer
read_buffer_size
record_rnd_buffer (for GROUP BY and ORDER BY terms)
Areas Identified for improvement - MySQL Query tweaks
Changing the datatype for matches to an index that is an int pointing to another table [MySQL will indeed use a dynamic row format if it contains variable length fields like TEXT or BLOB, which, in this case, means sorting needs to be done on disk. The solution is not to eschew these datatypes, but rather to split off such fields into an associated table.][4]
Indexing the new match_index feild so that the GROUP BY matches occurs faster, based on the statement ["You should probably create indices for any field on which you are selecting, grouping, ordering, or joining."][5]
Tools
To tweak perform I plan to use
[Explain][6] making reference to [the output format][7]
[ab - Apache HTTP server benchmarking tool][8]
[Profiling][9] with [log data][10]
Future Database Size
The goal is to build a system that can have 1M unique cluster_index values 1M unique match values, approx 3,000,000,000 table rows with a response time to the query of around 0.5s (we can add more ram as necessary and distribute the database across the cluster)
Questions
I think we want to keep the entire recordset in ram so that the query doesnt touch the disk, if we keep the entire database in the MySQL cache does that eliminate the need for memcachedb?
Is trying to keep the entire database in MySQL cache a bad strategy as its not designed to be persistent? Would something like memcachedb or redis be a better approach, if so why?
Is the temporary table "result" that is created by the query automatically destroyed when the query finishes?
Should we switch from Innodb to MyISAM [as its good for read heavy data where as InnoDB is good for write heavy][11] ?
my cache doesnt appear to be on as its zero in my [Query Cache Configuration][12], why does the query currently occur faster the second time I run it?
can i restructure my query to eliminate "using temporary" and "using filesort" occuring, should i be using a join instead of a subquery?
how do you view the size of the MySQL [Data Cache][13]?
what sort of sizes for the values table_cache, key_buffer, sort_buffer, read_buffer_size, record_rnd_buffer would you suggest as a starting point?
Links
1: stackoverflow.com/questions/658937/cache-re-use-a-subquery-in-mysql
2: databasejournal.com/features/mysql/article.php/10897_1382791_4/Optimizing-MySQL-Queries-and-Indexes.htm
3: debianhelp.co.uk/mysqlperformance.htm
4: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
5: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
6: dev.mysql.com/doc/refman/5.0/en/explain.html
7: dev.mysql.com/doc/refman/5.0/en/explain-output.html
8: httpd.apache.org/docs/2.2/programs/ab.html
9: mtop.sourceforge.net/
10: dev.mysql.com/doc/refman/5.0/en/slow-query-log.html
11: 20bits.com/articles/10-tips-for-optimizing-mysql-queries-that-dont-suck/
12: dev.mysql.com/doc/refman/5.0/en/query-cache-configuration.html
13: dev.mysql.com/tech-resources/articles/mysql-query-cache.html
Changing the table
Based on the advice in this post on How to pick indexes for order by and group by queries the table now looks like
CREATE TABLE ClusterMatches
(
cluster_index INT UNSIGNED,
match_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (match_index,cluster_index,id,tfidf)
);
CREATE TABLE MatchLookup
(
match_index INT UNSIGNED NOT NULL PRIMARY KEY,
image_match TINYTEXT
);
Eliminating Subquery
The query without sorting the results by the SUM(tfidf) looks like
SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
Which eliminates using temporary and using filesort
explain extended SELECT match_index, SUM(tfidf) FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index LIMIT 10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 14938 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+--------------------------+
Sorting Problem
However if i add the ORDER BY SUM(tfdif) in
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index in (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+-------------+--------------------+
| match_index | total |
+-------------+--------------------+
| 868 | 0.11126546561718 |
| 4182 | 0.0238558370620012 |
| 2162 | 0.0216601379215717 |
| 1406 | 0.0191618576645851 |
| 4239 | 0.0168981291353703 |
| 1437 | 0.0160425212234259 |
| 2599 | 0.0156466849148273 |
| 394 | 0.0155945559963584 |
| 3116 | 0.0151005545631051 |
| 4028 | 0.0149106932803988 |
+-------------+--------------------+
10 rows in set (0.03 sec)
The result is suitably fast at this scale BUT having the ORDER BY SUM(tfidf) means it uses temporary and filesort
explain extended SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY match_index
ORDER BY total DESC LIMIT 0,10;
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 65369 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
Possible Solutions?
Im looking for a solution that doesn't use temporary or filesort, along the lines of
SELECT match_index, SUM(tfidf) AS total FROM ClusterMatches
WHERE cluster_index IN (1,2,3 ... 3000) GROUP BY cluster_index, match_index
HAVING total>0.01 ORDER BY cluster_index;
where I dont need to hardcode a threshold for total, any ideas?