Going .pluck crazy. What is a realistic limit on query with array - ruby

I've used .pluck(:id) quite often (and map before it) to get a set of record ids. This is usually to get a set of related model records (e.g,, :people has_many :scores, as :assessed)
Lets say I have 10,000 people, but a query on People limits it to say 1,000.
people_ids = people.pluck(:id) #people a relation/scoped
scores = Score.where(:assessed_type => 'People', :assessed_id => people_ids)
There would be more to the Score query, but my basic question is querying with an array of say 1000 ids a bad idea?
I should point out that the filtered Score query would be used to get a new set of People. This is a filter on People.
I only have a few hundred records in my test DB, and that works fine - but there must be a point where psql or Rails is going to blow up. In production, I don't see going more that 1000 ids since People is automatically filter before this Score option is used.

Related

Eloquent get average for both genders from a collection without requesting again the database

I have this situation where I get a collection directly from the database and have introduced several filters already, except gender, as follows:
$workers = Worker::where('status','Active')->where('category','SomeCategory')->get();
In order to get the general age average, I get it like this:
$avg_age = $workers->avg('age');
So far so good.
Inside, the $workers collection, there is this column gender. Now I want to get the age average by gender. I have tried the following ... which doesn't work:
$avg_age_women = $workers->where('gender','female')->avg();
But this does not work since the returned value is zero.
I know I could do the same like:
$workers_female = Worker::where('status','Active')->where('category','SomeCategory')->where('gender','female')->get();
But, if I can continue working with the $workers collection, I think that'd be more efficient.
Is there a way to get the average of female ones right from the original $workers collection??? What am I missing?
You're not saying what you want to do avg on.
$avg_age_women = $workers->where('gender','female')->avg("age");

Renaming multiple items in a database

I have seeded a lot of data for my database in rails 4. The data that I imported was entered manually by hand by a user of gigabot (using the gigabot) API.
The problem that I have is that I am trying to list "club nights" in my case but I am getting lots of duplicates back as the names are similar but not identical. Is there any way I could group the items where is the name contains a certain word then they would group together.
Currently these are my only validations
class Club < ActiveRecord::Base
has_many :events
validates :name, presence:true, uniqueness:true
validates :location, presence:true
validates :description, presence:true, uniqueness:true
end
Here is some of example data that the table currently displays
Name
DC10
Amnesia
Circo Loco # DC10
Sankeys
Sankeys Ibiza
Cocoon
Privilege Ibiza
Circoloco at Dc 10
Space
Space Ibiza
If you look at the above example you will see that some of the clubs are repeated. I would like to clean up the table so it would only have "DC10" as 1 club and all the clubs which have DC10 in their name are grouped together.
SO in the example above instead of having 10 seperate clubs it would be 6.
DC10,
Amnesia,
Space,
Sankeys,
Priviledge,
Cocoon.
Have a look at the update_all method from ActiveRecord.
This will allow you to update all the values of fields in a collection. So now you just have to get a collection that you're certain fits together.
I suggest doing something like SIMILAR for postgres. So you could do something like:
pattern = '%DC10%' # This can be as advanced as you need it
collection = Club.where('name SIMILAR TO ?', pattern)
collection.update_all(name: 'DC10')
This sounds like a very difficult task. Most likely you won't be able to come up with a regex that can capture your intention.
For example let's imagine you have a club Space and other entries
Void # Space
Outer Space
Inner space
Alien in Outer Space
they all end in Space but which ones should be regrouped ? My examples was a big exaggerated, but it sounds like you are dealing with a lot of data and cases like this one may occur.
Do you not have any other fied which could help you regroup records together ? Like GPS coordinates, city, etc. ?

How can I fetch documents in a random order using MongoMapper?

I cannot use Array#shuffle since I don't fetch all documents (I only fetch up to twenty documents). How can I fetch random documents from a MongoDB database using MongoMapper (i.e. in MySQL one would use ORDER BY RAND())?
There's no technique similar to ORDER BY RAND(). And even in MySQL it is advised to avoid it (on large tables).
You could apply some common tricks, however.
For example, if you know min and max value for your id, then pick a random value within the range and get the next object.
db.collection.find({_id: {$gte: random_id}}).limit(1);
Repeat 20 times.
Or you could add "random" field to each document yourself (and recalc it every once in a while). This way you won't get really random results with each query, but it'll be pretty cheap.
db.collection.find().sort({pseudo_random_field: 1}).limit(20)
// you can also skip some records here, but don't skip a lot.
Use skip and Random class.
class Book {
include MongoMapper::Document
key :title
key :author
}
rand = Random.rand(0..(Book.count-1))
Book.skip(rand).first

Mongo multiple queries or database normalization

I'm using MongoDB for my database. The query that I'm currently working on revealed a possible deficiency in my schema. Below is the relevant layout of my collections. Note that games.players is an array of 2 players since the game is chess.
users {_id, username, ...}
games {_id, players[], ...}
msgs {_id, username, gameid, time, msg}
The data that I need is:
All msgs for games which a user is in which is newer than a given timestamp.
In a SQL database, my query would look similar to:
SELECT * FROM msgs WHERE time>=$time AND gameid IN
(SELECT _id FROM games WHERE players=$username);
But, Mongo isn't a relational database, so doesn't support sub-queries or joins. I see two possible solutions. What would be better performance-wise and efficiency-wise?
Multiple Queries
Select games the user is in, then use $in to match msgs.gameid by.
Other?
Normalization
Make users.games contain all games a user is in.
Copy games.players to msgs.players by msgs.gameid
etc.,
I'm a relative newbie to MongoDB, but I find my self frequently using a combination of the two approaches. Some things - e.g. user names - are frequently duplicated to simplify queries used for display, but any time I need to do more than display information, I wind up writing multiple queries, sometimes 2 or 3 levels deep, using $in, to gather all the documents I need to work with for a given operation.
You can "normalize" yourself. I would add an array to users that list the games he is a member of;
users {_id, username, games={game1,game2,game3}}
now you can do a query on msgs where the time>time$ and the {games._id "is in" users.games}
You will have to maintain the games list on each user.

Querying MongoDB for last-items-before

Consider I have two collections in MongoDB. One for products with documents like:
{'_id': ObjectId('lalala'), 'title': 'Yellow banana'}
And another stores price changes with documents like:
{'product': DBRef('products', ObjectId('lalala')),
'since': datetime(2011, 4, 5),
'new_price': 150 }
One product may have many price changes. The price lasts until a new change with later time stamp. I guess you've caught idea.
Say, I have 100 products. I want to query my DB to get know what's the price of each product at the moment of June 9, 2011. What is the most efficient (quick) way to perform this query in MongoDB? Suppose I have no cache solution or cache is empty.
I thought about group statement on prices collection, where reduce function would select last since before a date provided, grouping by product.$id. But in this case I would not benefit from an index on since field and all documents would be scanned.
Any ideas?
I had a similar problem, but for GPS locations. I found the fastest way was to set up a query for each item, which is rather counter-intuitive if your used to SQL databases.
Query for the item where it's timestamp is less or equal than the date your looking for, and limit the result to 1. Repeat for each item. To really speed things up, run multiple querys in parallel to utilise all the cores on the MongoDB server.

Resources