Speeding up a postgres query (which works on 2 tables) - performance

I am doing, in postgresql, something like this:
select A.first,
count(B.second) as count,
array_agg(A.second) as second,
array_agg(A.third) as third,
array_agg(B.kids) as kids
from A join B on A.first=B.second
group by A.first;
And it's taking forever (also because the tables are pretty big). Limiting the output to 10 row and looking with explain analyze told me there's a nested loop which is huge and takes most of the time.
Is there any way in which I can write this query (which I'll then use in CREATE TABLE AS to create a new table) to speed it up, while conserving the same output, which is what I want?
Thanks!

Ensure the column bring used as a foreign key is indexed:
create index b_second on b(second);
Without such an index, every row of a would cause a table scan of b, which would make your query crawl.

Related

SQLite - Exploiting Sorted Indexes

This is probably simple, but I can't find the answer.
I'm trying to minimise the overhead of selecting records using ORDER BY
My understanding is that in...
SELECT gorilla, chimp FROM apes ORDER BY bananas LIMIT 10;
...the full set of matching records is retrieved so that that the ORDER BY can be actioned, even if I only want the top ten records. This makes sense.
Trying to eliminate that overhead, I looked at the possibility of storing the records in a pre-defined order, but that would only work until insertions/deletions took place, upon which I would have to re-build the table. Not viable.
I found an option in SQLite (I assume it also exists in other SQLs) to create a sorted index (https://www.sqlite.org/lang_createindex.html)...
CREATE INDEX index_name ON apes (bananas DESC);
...which I ASSUME to mean that the index (not the table) is sorted in descending order and will remain so after updates .
My question is - how do I exploit this? The SQLite documentation is a bit meh in this regard. Is there some kind of "SELECT FROM index" or equivalent? Or does the fact that a sorted index exists on a column mean that any results from querying that column will be returned in the order of the index rather than the order of the column?
Or am I missing something entirely?
I'm working with SQLite3, queried by PHP 7.1
ORDER BY with LIMIT is a little bit more efficient than a plain ORDER BY because only the first few rows need to be completely sorted.
Anyway, for a single-column index, the sort order (ASC or DESC) is pointless because SQLite can step through an index either forwards or backwards.
Indexes are used automatically when SQLite estimates that they would be useful.
To check what actually happens, run EXPLAIN QUERY PLAN (or set .eqp on in the sqlite3 shell).

Adding Index To A Column Having Flag Values

I am a novice in tuning oracle queries thus need help.
If I have a sql query like:
select a.ID,a.name.....
from a,b,c
where a.id=b.id
and ....
and b.flag='Y';
then will adding index to the FLAG column of table b help to tune the query by any means? The FLAG column has only 2 values Y and N
With a standard btree index, the SQL engine can find the row or rows in the index for the specified value quickly due to its binary structure, then use the physical address (the rowid) stored in the index to access the desired row in a second hop. It's like looking in the index of a book to find the page number. So that is:
Go to index with the key value you want to look up.
The index tells you the physical address in the table.
Go straight to that physical address.
That is nice and quick for something like a unique customer ID. It's still OK for something nonunique, like a customer ID in a table of orders, although the database has to go through the index entries and for each one go to the indicated address. That can still be faster than slogging through the entire table from top to bottom.
But for a column with only two distinct values, you can see that it is going to be more work going through all of the index entries for 'Y' for example, and for each one going to the indicated location in the table, than it would be to just forget the index and scan the whole table in one shot.
That's unless the values are unevenly distributed. If there are a million Y rows and ten N rows then an index will help you find those N rows fast but be no use for Y.
Adding an index to a column with only 2 values normally isn't very useful, because Oracle might just as well do a full table scan.
From your query it looks like it would be more useful to have an index on id, because that would help with the join a.id=b.id.
If you really want to get into tuning then learn to use "explain plan", as that will give you some indication of how much work Oracle needs to do for a query. Add (or remove) an index, then rerun the explain plan.

Oracle database help optimizing LIKE searches

I am on Oracle 11g and we have these 3 core tables:
Customer - CUSTOMERID|DOB
CustomerName - CUSTOMERNAMEID|CustomerID|FNAME|LNAME
Address - ADDRESSID|CUSTOMERID|STREET|CITY|STATE|POSTALCODE
I have about 60 million rows on each of the tables and the data is a mix of US and Canadian population.
I have a front-end application that calls a web service and they do a last name and partial zip search. So my query basically has
where CUSTOMERNAME.LNAME = ? and ADDRESS.POSTALCODE LIKE '?%'
They typically provide the first 3 digits of the zip.
The address table has an index on all street/city/state/zip and another one on state and zip.
I did try adding an index exclusively for the zip and forced oracle to use that index on my query but that didn't make any difference.
For returning about 100 rows (I have pagination to only return 100 at a time) it takes about 30 seconds which isn't ideal. What can I do to make this better?
The problem is that the filters you are applying are not very selective and they apply to different tables. This is bad for an old-fashioned btree index. If the content is very static you could try bitmap indexes. More precisely a function based bitmap join index on the first three letter of the last name and a bitmap join index on the postal code column. This assumes that very few people with the whose last name starts with certain letters live in an are with a certain postal code.
CREATE BITMAP INDEX ix_customer_custname ON customer(SUBSTR(cn.lname,1,3))
FROM customer c, customername cn
WHERE c.customerid = cn.customerid;
CREATE BITMAP INDEX ix_customer_postalcode ON customer(SUBSTR(a.postalcode,1,3))
FROM customer c, address a
WHERE c.customerid = a.customerid;
If you are successful you should see the two bitmap indexes becoming AND connected. The execution time should drop to a couple of seconds. It will not be as fast as a btree index.
Remarks:
You may have to play around a bit whether it is more efficient to make one or two indexes and whether the function are helpful useful.
If you decide to do it function based you should include the exact same function calls in the where clause of your query. Otherwise the index will not be used.
DML operations will be considerably slower. This is only useful for tables with static data. Note that DML operations will block whole row "ranges". Concurrent DML operations will run into problems.
Response time will probably still be seconds not instanteously like a BTREE index.
AFAIK this will work only on the enterprise edition. The syntax is untested because I do not have an enterprise db available at the moment.
If this is still not fast enough you can create a materialized view with customerid, last name and postal code and but a btree index on it. But that is kind of expensive, too.

Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
Queries:-
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
http://76.12.87.164/cpd/testperformance.cfm
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
Thanks
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
SELECT
*
FROM (
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY sq.name) AS Row,
sq.*
FROM (
SELECT
c.*,
p.providername,
p.school,
p.website,
p.type
FROM
cpd_COURSES c, cpd_PROVIDERS p
WHERE
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
(
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
)
) sq
) AS temp
WHERE
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
EXISTS (
SELECT zipcode
FROM cpd_CODES_ZIPCODES
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
)
</CFIF>

Resources