Rails 3 Postgres uses single field index for query, but shouldn't it use the compound index? - ruby

So User has many :orders, which works like you expect. I also have a valid scope on order that should filter by ensuring the orders are in a set of whitelisted states (not canceled orders, for instance)
I've declared some indices on the orders table, and my schema.rb looks like:
add_index "orders", ["state"], :name => "index_orders_on_state"
add_index "orders", ["user_id", "state"], :name => "index_orders_on_user_id_and_state"
add_index "orders", ["user_id"], :name => "index_orders_on_user_id"
When I run puts user.orders.valid.explain I get this:
EXPLAIN for: SELECT "orders".* FROM "orders"
WHERE "orders"."user_id" = 1 AND
"orders"."state" IN ('pending', 'packed', 'shipped', 'in_transit', 'delivered', 'return_pending', 'returned')
Bitmap Heap Scan on orders (cost=4.60..154.88 rows=40 width=3323)
Recheck Cond: (user_id = 1)
Filter: ((state)::text = ANY ('{pending,packed,shipped,in_transit,delivered,return_pending,returned}'::text[]))
-> Bitmap Index Scan on index_orders_on_user_id (cost=0.00..4.59 rows=44 width=0)
Index Cond: (user_id = 1)
So given that I am searching on user_id and state, and a have a compound index for both those fields, why is it not using the index_orders_on_user_id_and_state index? Or am I just reading this explain output wrong?
Is it doing two passes? One to find orders by user_id, and then another pass to check for state?
I need to run queries like this a lot, on a lot of records at once. So any way to keep it speedy is a very good thing.

The database system may decide not to use indexes. For example with Mysql, if the table data is small, it may decide to do a full table scan. You can try putting several million of records and execute the query again to see how the plan change.

A pretty good explanation of the internal usage of postgres indexes is here:
the relevant part is
There are many reasons why the Postgres planner may choose to not use
an index. Most of the time, the planner chooses correctly, even if it
isn’t obvious why. It’s okay if the same query uses an index scan on
some occasions but not others. The number of rows retrieved from the
table may vary based on the particular constant values the query
retrieves. So, for example, it might be correct for the query planner
to use an index for the query select * from foo where bar = 1, and yet
not use one for the query select * from foo where bar = 2, if there
happened to be far more rows with “bar” values of 2. When this
happens, a sequential scan is actually most likely much faster than an
index scan, so the query planner has in fact correctly judged that the
cost of performing the query that way is lower.


Scan on DynamDB table or Query on secondary global index or a local index (What's the best solution)

I have AWS DynamoDB table called "Users", whose hash key/primary key is "UserID" which consist of emails. It has two attributes, first called "Daily Points" and second "TimeSpendInTheApp". Now I need to run a query or scan on the table, that will give me top 50 users which have the highest points and top 50 users which have spend the most time in the app. Now this query will be executed only once a day by cron aws lambda. I am trying to find the best solutions for this query or scan. For me, the cost is most important than speed/or efficiency. As maintaining secondary global index or a local index on points can be costly operations, as I have to assign Read and Write units for those indexes, which I want to avoid. "Users" table will have a maximum of 100,000 to 150,000 records and on average it will have 50,000 records. What are my best options? Please suggest.
I am thinking, my first option is, I can scan the whole table on Filter Expression for records above certain points (5000 for example), after this scan, if 50 or more than 50 records are found, then simply sort the values and take the top 50 records. If this scan returns no or very less results then reduce the Filter Expression value (3000 for example), then again do the same scan operation. If Filter Expression value (2500 for example) returns too many records, like 5000 or more, then reduce the Filter Expression value. Is this even possible, I guess it would also need to handle pagination. Is it advisable to scan on a table which has 50,000 record?
Any advice or suggestion will be helpful. Thanks in advance.
Firstly, creating indexes for the above use case doesn't simplify the process as it doesn't have solution for aggregation or sorting.
I would export the data to HIVE and run the queries rather than writing code to determine the result especially as it is going to be a batch executed only once per day.
Something like below:-
Create Hive table:-
CREATE EXTERNAL TABLE hive_users(userId string, dailyPoints bigint, timeSpendInTheApp bigint)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "Users",
"dynamodb.column.mapping" = "userId:UserID,dailyPoints:Daily_Points,timeSpendInTheApp:TimeSpendInTheApp");
SELECT dailyPoints, userId from hive_users sort by dailyPoints desc;
SELECT timeSpendInTheApp, userId from hive_users sort by timeSpendInTheApp desc;
Hive Reference

How do I build effecient SQL filters?

After taking an advanced T-SQL performance/query tuning class, something that I thought I remembered hearing was that you can speed up some queries just a little bit if you put your date(time) filters first.
RunDate = '12/1/2015' AND
OtherFilters = etc...
But does this really only count if I have indexes in place on these columns I filter on for this table?
So to add to this just a little, should I be building my filters off of the indexes on any tables referenced in the query? Such that my first filters of the query are based on my indexes?
ID > 1000 AND
RunDate <= '1/1/206' AND
OtherFilters = etc...
Where ID and RunDate are part of my indexes/primary key.
The order of filters in WHERE clause does not matter. As long as you have index on the fields, SQL Server knows how to use your filters.
Assume you have index on (ID, RunDt) and you have both ID and RunDt in your WHERE clause. SQL Server first filters the data on ID and then from that subset rows, will filter on RunDt.
This scenario may change if you have other indexes depends on selectivity of your data.
Also if you have clustered index on RunDt, SQL will first filter on RunDt and then ID.
You don't need to worry about the order of your filters in WHERE clause, as long as you have the right order of columns in your index definition.
TSQL is just a logical representation
The query optimizer will set the actual execution order that is most efficient
It messes up some times but for the most part it is spot on
If you have a clustered PK on ID then this will typically be done first
Appears even the OP is confused about the question
Can only answer the stated question
But does this really only count if I have indexes in place on these
columns I filter on for this table?
The order in the where does not matter for columns with indexes
The order in the where does not matter for columns without indexes
The order in the where does not matter

Index in sqlite causing me trouble and slow requesting

Hello I have a table with +800'000 rows in sqlite.
I've indexes on each fields I'm used to search. But my request rate is SLOW:
SELECT "links".* FROM "links"
WHERE "links"."from_id_admin" = "XXXX"
AND "links"."from_type" = "Section"
ORDER BY category_rank DESC, rank DESC
it took me 800ms. (return only one row, all the time is wasted on index lookup)
I investigated further with "EXPLAIN QUERY PLAN" and here is the result:
"SEARCH TABLE links USING INDEX index_links_on_from_type (from_type=?)"
Weirdly, Sqlite is using only the from_type index. The problem is there's not so much discrimination on this index (there's 4 or 5 differents values).
If I remove the clause WHERE enough, my request is fast as expected (2ms):
SELECT "links".*
FROM "links"
WHERE "links"."from_id_admin" = "XXXXX"
ORDER BY category_rank DESC, rank DESC
Yeah. Less discrimination means 400x speed improvement. So my question is:
Is that normal behavior?
How can I avoid it?
Can I force the search pattern to lookup to the proper index?
Thanks for your answers ;-)
Ok, finally I found it:
My SQLite database was populated with large amount of data (2Gb) then I never called "ANALYZE" to check the datas and optimize the index use.
So after big change in your database, always use:
Took one second and half and then everything works properly!
Good to know I guess ;-)

Ruby on Rails: Search one table where multiple rows must be present in another table

I'm trying to create a search where a single record must have multiple records in another table (linked by id's and has_many statements) in order to be included as a result.
I have tables users, skill_lists, skill_maps.
users are mapped to individual skills through single entries in the skill_maps table. Many user can share a single skill and single user can have many skills trough multiple entries in the skill_maps table.
User_id | Skill_list_id
2 | 9
2 | 15
3 | 9
user 2 has skills 9 and 15
user 3 has only skill 9
I'm trying to create a search that returns a hash of all users which have a set of skills. The set of required skill_ids appear as an array in the params.
Here's the code that I'm using:
skill_selection_user_ids = SkillMap.find_all_by_skill_list_id(params[:skill_ids]).map(&:user_id)
#results = User.find(:all, :conditions => {:id => skill_selection_user_ids})
The problem is that this returns all users that have ANY of these skills not users that have ALL of them.
Also, my users table is linked to the skill_lists table :through => :skill_maps and visa versa so that i can call #user.skill_list etc...
I'm sure this is a real newbie question, I'm totally new to rails (and programming). I searched and searched for a solution but couldn't find anything. I don't really know how to explain the problem in a single search term.
I personally don't know how to do this using ActiveRecord's query interface. The easiest thing to do would be to retrieve lists of users who have each individual skill, and then take the intersection of those lists, perhaps using Set:
require 'set'
skills = [5, 10, 19] # for example
user_ids = skills.map { |s| Set.new(SkillMap.find_all_by_skill_list_id(s).map(&:user_id)) }.reduce(:&)
users = User.where(:id => user_ids.to_a)
For (likely) higher performance, you could "roll your own" SQL and let the DB engine do the work. I may be able to come up with some SQL for you, if you need high performance here. (Or if anyone else can, please edit this answer!)
By the way, you should probably put an index on skill_maps.skill_list_id to ensure good performance even if the skill_maps table gets very large. See the ActiveMigration documentation: http://api.rubyonrails.org/classes/ActiveRecord/Migration.html
You'll probably have to use some custom SQL to get the user IDs. I tested this query on a similar HABTM relationship and it seems to work:
SELECT DISTINCT(user_id) FROM skill_maps AS t1 WHERE (SELECT COUNT(skill_list_id) FROM skill_maps AS t2 WHERE t2.user_id = t1.user_id AND t2.skill_list_id IN (1,2,3)) = 3
The trick is in the subquery. For each row in the outer query, it finds a count of records for that row that match any of the skills that you're interested in. Then it checks whether that count matches the total number of skills you're interested in. If there's a match, then the user must possess all of the skills you searched for.
You could execute this in Rails using find_by_sql:
sql = 'SELECT DISTINCT(user_id) FROM skill_maps AS t1 WHERE (SELECT COUNT(skill_list_id) FROM skill_maps AS t2 WHERE t2.user_id = t1.user_id AND t2.skill_list_id IN (?)) = ?'
skill_ids = params[:skill_ids]
user_ids = SkillMap.find_by_sql([sql, skill_ids, skill_ids.size])
Sorry if the table and column names aren't exactly right, but hopefully this is in the ballpark.

Improve SQL Server 2005 Query Performance

I have a course search engine and when I try to do a search, it takes too long to show search results. You can try to do a search here
At that page you can also see the database tables and indexes, if any.
I'm not using Stored Procedures - the queries are inline using Coldfusion.
I think I need to create some indexes but I'm not sure what kind (clustered, non-clustered) and on what columns.
You need to create indexes on columns that appear in your WHERE clauses. There are a few exceptions to that rule:
If the column only has one or two unique values (the canonical example of this is "gender" - with only "Male" and "Female" the possible values, there is no point to an index here). Generally, you want an index that will be able to restrict the rows that need to be processed by a significant number (for example, an index that only reduces the search space by 50% is not worth it, but one that reduces it by 99% is).
If you are search for x LIKE '%something' then there is no point for an index. If you think of an index as specifying a particular order for rows, then sorting by x if you're searching for "%something" is useless: you're going to have to scan all rows anyway.
So let's take a look at the case where you're searching for "keyword 'accounting'". According to your result page, the SQL that this generates is:
c.providerid = p.providerid AND
c.activatedYN = 'Y' AND
c.name like '%accounting%' OR
c.title like '%accounting%' OR
c.keywords like '%accounting%'
) sq
) AS temp
Row >= 1 AND Row <= 10
In this case, I will assume that cpd_COURSES.providerid is a foreign key to cpd_PROVIDERS.providerid in which case you don't need an index, because it'll already have one.
Additionally, the activatedYN column is a T/F column and (according to my rule above about restricting the possible values by only 50%) a T/F column should not be indexed, either.
Finally, because searching with a x LIKE '%accounting%' query, you don't need an index on name, title or keywords either - because it would never be used.
So the main thing you need to do in this case is make sure that cpd_COURSES.providerid actually is a foreign key for cpd_PROVIDERS.providerid.
SQL Server Specific
Because you're using SQL Server, the Management Studio has a number of tools to help you decide where you need to put indexes. If you use the "Index Tuning Wizard" it is actually usually pretty good at tell you what will give you the good performance improvements. You just cut'n'paste your query into it, and it'll come back with recommendations for indexes to add.
You still need to be a little bit careful with the indexes that you add, because the more indexes you have, the slower INSERTs and UPDATEs will be. So sometimes you'll need to consolidate indexes, or just ignore them altogether if they don't give enough of a performance benefit. Some judgement is required.
Is this the real live database data? 52,000 records is a very small table, relatively speaking, for what SQL 2005 can deal with.
I wonder how much RAM is allocated to the SQL server, or what sort of disk the database is on. An IDE or even SATA hard disk can't give the same performance as a 15K RPM SAS disk, and it would be nice if there was sufficient RAM to cache the bulk of the frequently accessed data.
Having said all that, I feel the " (c.name like '%accounting%' OR c.title like '%accounting%' OR c.keywords like '%accounting%') " clause is problematic.
Could you create a separate Course_Keywords table, with two columns "courseid" and "keyword" (varchar(24) should be sufficient for the longest keyword?), with a composite clustered index on courseid+keyword
Then, to make the UI even more friendly, use AJAX to apply keyword validation & auto-completion when people type words into the keywords input field. This gives you the behind-the-scenes benefit of having an exact keyword to search for, removing the need for pattern-matching with the LIKE operator...
Using CF9? Try using Solr full text search instead of %xxx%?
You'll want to create indexes on the fields you search by. An index is a secondary list of your records presorted by the indexed fields.
Think of an old-fashioned printed yellow pages - if you want to look up a person by their last name, the phonebook is already sorted in that way - Last Name is the clustered index field. If you wanted to find phone numbers for people named Jennifer or the person with the phone number 867-5309, you'd have to search through every entry and it would take a long time. If there were an index in the back with all the phone numbers or first names listed in order along with the page in the phonebook that the person is listed, it would be a lot faster. These would be the unclustered indexes.
I would try changing your IN statements to an EXISTS query to see if you get better performance on the Zip code lookup. My experience is that IN statements work great for small lists but the larger they get, you get better performance out of EXISTS as the query engine will stop searching for a specific value the first instance it runs into.
<CFIF zipcodes is not "">
SELECT zipcode
WHERE zipcode = p.zipcode
AND 3963 * (ACOS((SIN(#getzipcodeinfo.latitude#/57.2958) * SIN(latitude/57.2958)) +
(COS(#getzipcodeinfo.latitude#/57.2958) * COS(latitude/57.2958) *
COS(longitude/57.2958 - #getzipcodeinfo.longitude#/57.2958)))) <= #radius#
