Ruby on Rails: Search one table where multiple rows must be present in another table - ruby

I'm trying to create a search where a single record must have multiple records in another table (linked by id's and has_many statements) in order to be included as a result.
I have tables users, skill_lists, skill_maps.
users are mapped to individual skills through single entries in the skill_maps table. Many user can share a single skill and single user can have many skills trough multiple entries in the skill_maps table.
e.g.
User_id | Skill_list_id
2 | 9
2 | 15
3 | 9
user 2 has skills 9 and 15
user 3 has only skill 9
I'm trying to create a search that returns a hash of all users which have a set of skills. The set of required skill_ids appear as an array in the params.
Here's the code that I'm using:
skill_selection_user_ids = SkillMap.find_all_by_skill_list_id(params[:skill_ids]).map(&:user_id)
#results = User.find(:all, :conditions => {:id => skill_selection_user_ids})
The problem is that this returns all users that have ANY of these skills not users that have ALL of them.
Also, my users table is linked to the skill_lists table :through => :skill_maps and visa versa so that i can call #user.skill_list etc...
I'm sure this is a real newbie question, I'm totally new to rails (and programming). I searched and searched for a solution but couldn't find anything. I don't really know how to explain the problem in a single search term.

I personally don't know how to do this using ActiveRecord's query interface. The easiest thing to do would be to retrieve lists of users who have each individual skill, and then take the intersection of those lists, perhaps using Set:
require 'set'
skills = [5, 10, 19] # for example
user_ids = skills.map { |s| Set.new(SkillMap.find_all_by_skill_list_id(s).map(&:user_id)) }.reduce(:&)
users = User.where(:id => user_ids.to_a)
For (likely) higher performance, you could "roll your own" SQL and let the DB engine do the work. I may be able to come up with some SQL for you, if you need high performance here. (Or if anyone else can, please edit this answer!)
By the way, you should probably put an index on skill_maps.skill_list_id to ensure good performance even if the skill_maps table gets very large. See the ActiveMigration documentation: http://api.rubyonrails.org/classes/ActiveRecord/Migration.html

You'll probably have to use some custom SQL to get the user IDs. I tested this query on a similar HABTM relationship and it seems to work:
SELECT DISTINCT(user_id) FROM skill_maps AS t1 WHERE (SELECT COUNT(skill_list_id) FROM skill_maps AS t2 WHERE t2.user_id = t1.user_id AND t2.skill_list_id IN (1,2,3)) = 3
The trick is in the subquery. For each row in the outer query, it finds a count of records for that row that match any of the skills that you're interested in. Then it checks whether that count matches the total number of skills you're interested in. If there's a match, then the user must possess all of the skills you searched for.
You could execute this in Rails using find_by_sql:
sql = 'SELECT DISTINCT(user_id) FROM skill_maps AS t1 WHERE (SELECT COUNT(skill_list_id) FROM skill_maps AS t2 WHERE t2.user_id = t1.user_id AND t2.skill_list_id IN (?)) = ?'
skill_ids = params[:skill_ids]
user_ids = SkillMap.find_by_sql([sql, skill_ids, skill_ids.size])
Sorry if the table and column names aren't exactly right, but hopefully this is in the ballpark.

Related

Oracle duplicate field but still correct

So i built a query for my leadership team that was correct, but i dont understand why oracle gave me the correct answer.
i have 3 tables that i needed to get data out of in order to get the total billed amount.
Here is my query (please forgive me, my 2nd post and im not sure how to properly format my querys)
select b.total_amount_billed as billed from t1.billing_information b
where b.billing_no in
(select h.billing_no
from t1.res_history h where h.res_seq_no in
(Select r.reservation_seq_no
from t1.res r where r.customer_order_no in ('THO40000') ))
so in the deepest select, i take the the sequence number where my customer order number was THO40000, this query returns 2 sequence numbers.
the second sub query returns the billing numbers for my order from the history table where the sequence number match, in this case for this order they both use the same billing number, 312000.
the final select, returns my total billed amount where it matched my billing numbers it found, in my case $110.
the query works, but what i dont understand is why is it not duplicated? why does it not return 110, for each time it found 312000, giving me 2 records of 110? the billing number is a PK in the billing_information table. im not sure why it worked without me using the distinct keyword on the query for the billing number.
anyway thanks for the help, ill do my best to explain if you have questions!
You are being saved because you used IN to get the billing_no values to use, rather than an INNER JOIN between the two tables using b.billing_no = h.billing_no. A join would have duplicated the records, but your IN query is essentially this:
select b.total_amount_billed as billed
from t1.billing_information b
where b.billing_no in (312000, 312000);
If there is a single row in billing_information having billing_no equal to 312000, it is in the list, so the WHERE condition is true and it is included in the results. The fact that it is in the list twice doesn't make the IN condition "more true".

Rails 3 Postgres uses single field index for query, but shouldn't it use the compound index?

So User has many :orders, which works like you expect. I also have a valid scope on order that should filter by ensuring the orders are in a set of whitelisted states (not canceled orders, for instance)
I've declared some indices on the orders table, and my schema.rb looks like:
add_index "orders", ["state"], :name => "index_orders_on_state"
add_index "orders", ["user_id", "state"], :name => "index_orders_on_user_id_and_state"
add_index "orders", ["user_id"], :name => "index_orders_on_user_id"
When I run puts user.orders.valid.explain I get this:
EXPLAIN for: SELECT "orders".* FROM "orders"
WHERE "orders"."user_id" = 1 AND
"orders"."state" IN ('pending', 'packed', 'shipped', 'in_transit', 'delivered', 'return_pending', 'returned')
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on orders (cost=4.60..154.88 rows=40 width=3323)
Recheck Cond: (user_id = 1)
Filter: ((state)::text = ANY ('{pending,packed,shipped,in_transit,delivered,return_pending,returned}'::text[]))
-> Bitmap Index Scan on index_orders_on_user_id (cost=0.00..4.59 rows=44 width=0)
Index Cond: (user_id = 1)
So given that I am searching on user_id and state, and a have a compound index for both those fields, why is it not using the index_orders_on_user_id_and_state index? Or am I just reading this explain output wrong?
Is it doing two passes? One to find orders by user_id, and then another pass to check for state?
I need to run queries like this a lot, on a lot of records at once. So any way to keep it speedy is a very good thing.
The database system may decide not to use indexes. For example with Mysql, if the table data is small, it may decide to do a full table scan. You can try putting several million of records and execute the query again to see how the plan change.
A pretty good explanation of the internal usage of postgres indexes is here:
https://devcenter.heroku.com/articles/postgresql-indexes
the relevant part is
There are many reasons why the Postgres planner may choose to not use
an index. Most of the time, the planner chooses correctly, even if it
isn’t obvious why. It’s okay if the same query uses an index scan on
some occasions but not others. The number of rows retrieved from the
table may vary based on the particular constant values the query
retrieves. So, for example, it might be correct for the query planner
to use an index for the query select * from foo where bar = 1, and yet
not use one for the query select * from foo where bar = 2, if there
happened to be far more rows with “bar” values of 2. When this
happens, a sequential scan is actually most likely much faster than an
index scan, so the query planner has in fact correctly judged that the
cost of performing the query that way is lower.

Oracle query with multiple tables

I am trying to display volunteer information with duty and what performance is allocated.
I want to display this information. However, when I run the query, it did not gather the different date from same performance. And also availability_date is mixed up. Is it right query for it? I am not sure it is right query.
Could you give me some feedback for me?
Thanks.
Query is here.
SELECT Production.name, performance.performance_date, volunteer_duty.availability_date, customer.name "Customer volunteer", volunteer.volunteerid, membership.name "Member volunteer", membership.membershipid
FROM Customer, Membership, Volunteer, volunteer_duty, duty, performance_duty, performance, production
WHERE
Customer.customerId (+) = Membership.customerId AND
Membership.membershipId = Volunteer.membershipId AND
volunteer.volunteerid = volunteer_duty.volunteerid AND
duty.dutyid = volunteer_duty.dutyid AND
volunteer_duty.dutyId = performance_duty.dutyId AND
volunteer_duty.volunteerId = performance_duty.volunteerId AND
performance_duty.performanceId = performance.performanceId AND
Performance.productionId = production.productionId
--Added image--
Result:
The query seems reasonable, in terms of it having what appear to be the appropriate join conditions between all the tables. It's not clear to me what issue you are having with the results; it might help if you explained in more detail and/or showed a relevant subset of the data.
However, since you say there is some issue related to availability_date, my first thought is that you want to have some condition on that column, to ensure that a volunteer is available for a given duty on the date of a given performance. This might mean simply adding volunteer_duty.availability_date = performance.performance_date to the query conditions.
My more general recommendation is to start writing the query from scratch, adding one table at a time, and using ANSI join syntax. This will make it clearer which conditions are related to which joins, and if you add one table at a time hopefully you will see the point at which the results are going wrong.
For instance, I'd probably start with this:
SELECT production.name, performance.performance_date
FROM production
JOIN performance ON production.productionid = performance.productionid
If that gives results that make sense, then I would go on to add a join to performance_duty and run that query. Et cetera.
I suggest that you explicitly write JOINS, instead of using the WHERE-Syntax.
Using INNER JOINs the query you are describing, could look like:
SELECT *
FROM volunteer v
INNER JOIN volunteer_duty vd ON(v.volunteerId = vd.colunteerId)
INNER JOIN performance_duty pd ON(vd.dutyId = pd.dutyId AND vd.volunteerId = pd.colunteerId)
INNER JOIN performance p ON (pd.performanceId = p.performanceId)

How to get around strategic eager loading in Datamapper?

I'm processing a ton of book records (12.5 million) with Ruby and Datamapper. On rare occasion I need to grab associated identifiers for a particular book record, but Datamapper is creating a select statement grabbing all the associated identifiers for all the book records. The query take more than 2 minutes.
http://datamapper.org/why.html
The help document says this is "Strategic Eager Loading" and...
"The idea is that you aren't going to load a set of objects and use only an association in just one of them. This should hold up pretty well against a 99% rule.
When you don't want it to work like this, just load the item you want in it's own set. So DataMapper thinks ahead. We like to call it "performant by default". This feature single-handedly wipes out the "N+1 Query Problem"."
However, how do you load an item in it's own set? I can't seem to find a way to specify that I really only want to query the identifiers for one of the book records.
If you are experiencing this issue, it might be because you are using Model.first() rather than Model.get(). See my comments under the question too.
As of DM 1.1.0...
Example using Model.first:
# this will create a select statement for one book record
book = Books.first(:author => 'Jane Austen')
# this will create select statement for all isbns associated with all books
# if there are a lot of books and identifiers, it will take forever
book.isbns.each do |isbn|
# however, as expected it only iterates through related isbns
puts isbn
end
This is the same behavior as using Book.all, and then selecting the associations on one
Example using Model.get:
# this will create a select statement for one book record
book = Books.get(2345)
# this will create select statement for book with a primary key of 2345
book.isbns.each do |isbn|
puts isbn
end

Mongo multiple queries or database normalization

I'm using MongoDB for my database. The query that I'm currently working on revealed a possible deficiency in my schema. Below is the relevant layout of my collections. Note that games.players is an array of 2 players since the game is chess.
users {_id, username, ...}
games {_id, players[], ...}
msgs {_id, username, gameid, time, msg}
The data that I need is:
All msgs for games which a user is in which is newer than a given timestamp.
In a SQL database, my query would look similar to:
SELECT * FROM msgs WHERE time>=$time AND gameid IN
(SELECT _id FROM games WHERE players=$username);
But, Mongo isn't a relational database, so doesn't support sub-queries or joins. I see two possible solutions. What would be better performance-wise and efficiency-wise?
Multiple Queries
Select games the user is in, then use $in to match msgs.gameid by.
Other?
Normalization
Make users.games contain all games a user is in.
Copy games.players to msgs.players by msgs.gameid
etc.,
I'm a relative newbie to MongoDB, but I find my self frequently using a combination of the two approaches. Some things - e.g. user names - are frequently duplicated to simplify queries used for display, but any time I need to do more than display information, I wind up writing multiple queries, sometimes 2 or 3 levels deep, using $in, to gather all the documents I need to work with for a given operation.
You can "normalize" yourself. I would add an array to users that list the games he is a member of;
users {_id, username, games={game1,game2,game3}}
now you can do a query on msgs where the time>time$ and the {games._id "is in" users.games}
You will have to maintain the games list on each user.

Resources