Avoid Multiple Select on an Activerecord - ruby

I currently have the following code to find all the unique IDs of comments placed by a user. This works "fine". However, it's really slow for users with a lot of comments and I'm trying to figure out if there is a more elegant way of handling this as it doesn't seem to be the best solution.
def find_unique_user_grades
#comments = []
#environment.users.includes(:comments).map(&:comments).select do |comments|
comments.select { |comment| #comments.push(comment.id) }
end
#comments.uniq!
end
I'm hoping someone can help me with this.

You should always prefer to do this in the database, not in Ruby. The code you've posted will load all users from the database, and then convert the raw row data to ActiveRecord objects, which is (comparatively) extremely expensive. You don't need any of that data to join through to comments. Then, you'll do the same for every user's comments (query and create ActiveRecord objects) and again, you don't need any of that to get at the comment's id column.
What you're after (assuming I've guessed correctly at the shape your schema) is a simply join followed by a pluck. This will run a single query and return a single array of numbers, without any of the cost of loading users, creating objects, loading comments, creating objects, or iterating in Ruby.
Finally, it will also perform the distinct query in the database, where it can take advantage of any relevant indexes, rather than uniqing in Ruby.
The correct query is something near to:
#environment.users.joins(:comments).distinct.pluck('comments.id')

Related

Eloquent Eager Loading in Cursor (Lazy Collection)

I'm trying to export a large number of records from my database, but I need relationship data in order to build the export correctly. Ideally I would be able to use cursor() to get a Lazy Collection, but that won't load the relationships. I can't load the relationship within a loop, because that will create N+1 queries, and this could be hundreds of thousands of additional queries, which is unacceptable.
Here's what "works" (but runs out of memory):
Record::with('projects')->get()->map(function ($record) {
dd($record); // Shows the `projects` relationship
});
But when I use cursor()...
Record::with('projects')->cursor()->map(function ($record) {
dd($record); // Does NOT show the `projects` relationship
});
Is there a way to get a lazy collection that includes a record's relationship? I have looked in the documentation and it's not clear. Other suggestions have been to use chunk() which is unfortunately not a possibility in this situation.
EDIT: I shouldn't say chunk isn't a possibility, but it's a very expensive re-write. Currently, the data is structured with a lot of variability. So in order to construct the CSV for export, I need (for example) a header for the file. I currently grab that header by looping through all the records (the fields are stored in a JSONB field) and building out an array based on the fields present on those records.
I am also normalizing the data against those headers. So if one record has the field "address-1" but another record doesn't have that, the one that doesn't have it instead shows a blank value in the appropriate column. Otherwise, when inserting the row into the CSV, it doesn't respect the header.
These operations currently grab the entire data set and use a LazyCollection to map the header and normalize the records, and then feed it into the CSV one at a time. It would be ideal if I could grab relationships in a LazyCollection as well rather than having to rewrite the workflow.
according to this doc
cursor work in db stage, while loading relations come after method 'get' or 'first' ...
so: the code in cursor will work in db row represented as Model instance before the overall result, means that this code will run into db, without loading the relation, again db row (iterate through your database records...)
if you can't use chunk... then i think that you can use mySql to manage your data using raw-expressions

Is Laravel's 'pluck' method cheaper than a general 'get'?

I'm trying to dramatically cut down on pricey DB queries for an app I'm building, and thought I should perhaps just return IDs of a child collection (then find the related object from my React state), rather than returning the children themselves.
I suppose I'm asking, if I use 'pluck' to just return child IDs, is that more efficient than a general 'get', or would I be wasting my time with that?
Yes,pluck method is just fine if you are trying to retrieving a Single Column from tables.
If you use get() method it will retrieve all information about child model and that could lead to a little slower process for querying and get results.
So in my opinion, You are using great method for retrieving the result.
Laravel has also different methods for select queries. Here you can look Selects.
The good practice to perform DB select query in a application, is to select columns that are necessary. If id column is needed, then id column should be selected, instead of all columns. Otherwise, it will spend unnecessary memory to hold unused data. If your mind is clear, pluck and get are the same:
Model::pluck('id')
// which is the same as
Model::select('id')->get()->pluck('id');
// which is the same as
Model::get(['id'])->pluck('id');
I know i'm a little late to the party, but i was wondering this myself and i decided to research it. It proves that one method is faster than the other.
Using Model::select('id')->get() is faster than Model::get()->pluck('id').
This is because Illuminate\Support\Collection::pluck will iterate over each returned Model and extract only the selected column(s) using a PHP foreach loop, while the first method will make it cheaper in general as it is a database query instead.

Is there a way to sort a content query by the value of a field programmatically?

I'm working on a portal based on Orchard CMS. We're using Orchard to manage the "normal" content of the site, as well as to model what's essentially data for a small application embedded in it.
We figured that doing it that way is "recommended" for working in Orchard, and that it would save us duplicating a bunch of effort in features that Orchard already provides, mainly generating a good enough admin UI. This is also why we're using fields wherever possible.
However, for said application, the client wants to be able to display the data in the regular UI in a garden-variety datagrid that can be filtered, sorted, and paged.
I first tried to implement this by cobbling together a page with a bunch of form elements for the filtering, above a projection with filters bound to query string parameters. However, I ran into the following issues with this approach:
Filters for numeric fields crash when the value is missing - as would be pretty common to indicate that the given field shouldn't be considered when filtering. (This I could achieve by changing the implementation in the Orchard source, which would however make upgrading trickier later. I'd prefer to keep anything I haven't written untouched.)
It seems the sort order can only be defined in the administration UI, it doesn't seem to support tokens to allow for the field to sort by to be changed when querying.
So I decided to dump that approach and switched to trying to do this with just MVC controllers that access data using IContentQuery. However, there I found out that:
I have no clue how, if at all, it's possible to sort the query based on field values.
Or, for that matter, how / if I can filter.
I did take a look at the code of Orchard.Projections, however, how it handles sorting is pretty inscrutable to me, and there doesn't seem to be a straightforward way to change the sort order for just one query either.
So, is there any way to achieve what I need here with the rest of the setup (which isn't little) unchanged, or am I in a trap here, and I'll have to move every single property I wish to use for sorting / filtering into a content part and code the admin UI myself? (Or do something ludicrous, like create one query for every sortable property and direction.)
EDIT: Another thought I had was having my custom content part duplicate the fields that are displayed in the datagrids into Hibernate-backed properties accessible to query code, and whenever the content item is updated, copy values from these fields into the properties before saving. However, again, I'm not sure if this is feasible, and how I would be able to modify a content item just before it's saved on update.
Right so I have actually done a similar thing here to you. I ended up going down both approaches, creating some custom filters for projections so I could manage filters on the frontend. It turned out pretty cool but in the end projections lacked the raw querying power I needed (I needed to filter and sort based on joins to aggregated tables which I think I decided I didn't know how I could do that in projections, or if its nature of query building would allow it). I then decided to move all my data into a record so I could query and filter it. This felt like the right way to go about it, since if I was building a UI to filter records it made sense those records should be defined in code. However, I was sorting on users where each site had different registration data associated to users and (I think the following is a terrible affliction many Orchard devs suffer from) I wanted to build a reusable, modular system so I wouldn't have to change anything, ever!
Didn't really work out quite like I hoped, but to eventually answer the question in your title: yes, you can query fields. Orchard projections builds an index that it uses for querying fields. You can access these in HQL, get the ids of the content items, then call getmany to get them all. I did this several years ago, and I cant remember much but I do remember having a distinctly unenjoyable time with it haha. So after you have an nhibernate session you can write your hql
select distinct civr.Id
from Orchard.ContentManagement.Records.ContentItemVersionRecord civr
join civ.ContentItemRecord cir
join ci.FieldIndexPartRecord fipr
join fipr.StringFieldIndexRecord sfir
This just shows you how to join to the field indexes. There are a few, for each different data type. This is the string one I'm joining here. They are all basically the same, with a PropertyName and value field. Hql allows you to add conditions to your join so we can use that to join with the relevant field index records. If you have a part called Group attached directly to your content type then it would be like this:
join fipr.StringFieldIndexRecord sfir
with sfir.PropertyName = 'MyContentType.Group.'
where sfir.Value = 'HR'
If your field is attached to a part, replace MyContentType with the name of your part. Hql is pretty awesome, can learn more here: https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/queryhql.html But I dunno, it gave me a headache haha. At least HQL has documentation though, unlike Orchard's query layer. Also can always fall back to pure SQL when HQL wont do what you want, there is an option to write SQL queries from the NHibernate session.
Your other option is to index your content types with lucene (easy if you are using fields) then filter and search by that. I quite liked using that, although sometimes indexes are corrupted, or need to be rebuilt etc. So I've found it dangerous to rely on it for something that populates pages regularly.
And pretty much whatever you do, one query to filter and sort, then another query to getmany on the contentmanager to get the content items is what you should accept is the way to go. Good luck!
You can use indexing and the Orchard Search API for this. Sebastien demoed something similar to what you're trying to achieve at Orchard Harvest recently: https://www.youtube.com/watch?v=7v5qSR4g7E0

Why in the world would I have_many relationships?

I just ran into an interesting situation about relationships and databases. I am writing a ruby app and for my database I am using postgresql. I have a parent object "user" and a related object "thingies" where a user can have one or more thingies. What would be the advantage of using a separate table vs just embedding data within a field in the parent table?
Example from ActiveRecord:
using a related table:
def change
create_table :users do |i|
i.text :name
end
create_table :thingies do |i|
i.integer :thingie
i.text :discription
end
end
class User < ActiveRecord::Base
has_many :thingies
end
class Thingie < ActiveRecord::Base
belongs_to :user
end
using an embedded data structure (multidimensional array) method:
def change
create_table :users do |i|
i.text :name
i.text :thingies, array: true # example contents: [[thingie,discription],[thingie,discription]]
end
end
class User < ActiveRecord::Base
end
Relevant Information
I am using heroku and heroku-posgres as my database. I am using their free option, which limits me to 10,000 rows. This seems to make me want to use the multidimensional array way, but I don't really know.
Embedding a data structure in a field can work for simple cases but it prevents you from taking advantage of relational databases. Relational databases are designed to find, update, delete and protect your data. With an embedded field containing its own wad-o-data (array, JSON, xml etc), you wind up writing all the code to do this yourself.
There are cases where the embedded field might be more suitable, but for this question as an example I will use a case that highlights the advantages of a related table approch.
Imagine a User and Post example for a blog.
For an embedded post solution, you would have a table something like this (psuedocode - these are probably not valid ddl):
create table Users {
id int auto_increment,
name varchar(200)
post text[][],
}
With related tables, you would do something like
create table Users {
id int auto_increment,
name varchar(200)
}
create table Posts {
id auto_increment,
user_id int,
content text
}
Object Relational Mapping (ORM) tools: With the embedded post, you will be writing the code manually to add posts to a user, navigate through existing posts, validate them, delete them etc. With the separate table design, you can leverage the ActiveRecord (or whatever object relational system you are using) tools for this which should keep your code much simpler.
Flexibility: Imagine you want to add a date field to the post. You can do it with an embedded field, but you will have to write code to parse your array, validate the fields, update the existing embedded posts etc. With the separate table, this is much simpler. In addition, lets say you want to add an Editor to your system who approves all the posts. With the relational example this is easy. As an example to find all posts edited by 'Bob' with ActiveRecord, you would just need:
Editor.where(name: 'Bob').posts
For the embedded side, you would have to write code to walk through every user in the database, parse every one of their posts and look for 'Bob' in the editor field.
Performance: Imagine that you have 10,000 users with an average of 100 posts each. Now you want to find all posts done on a certain date. With the embedded field, you must loop through every record, parse the entire array of all posts, extract the dates and check agains the one you want. This will chew up both cpu and disk i/0. For the database, you can easily index the date field and pull out the exact records you need without parsing every post from every user.
Standards: Using a vendor specific data structure means that moving your application to another database could be a pain. Postgres appears to have a rich set of data types, but they are not the same as MySQL, Oracle, SQL Server etc. If you stick with standard data types, you will have a much easier time swapping backends.
These are the main issues I see off the top. I have made this mistake and paid the price for it, so unless there is a super-compelling reason do do otherwise, I would use the separate table.
what if users John and Ann have the same thingies? the records will be duplicated and if you decide to change the name of thingie you will have to change two or more records. If thingie is stored in the separate table you have to change only one record. FYI https://en.wikipedia.org/wiki/Database_normalization
Benefits of one to many:
Easier ORM (Object Relational Mapping) integration. You can use it either way, but you have to define your tables with native sql. Having distinct tables is easier and you can make use of auto-generated mappings.
Your space limitation of 10,000 rows will go further with the one to many relationship in the case that 2 or more people can have the same "thingies."
Handle users and thingies separately. In some cases, you might only care about people or thingies, not their relationship with each other. Some examples, updating a username or thingy description, getting a list of all thingies (or all users). Selecting from the single table can make it harding to work with.
Maintenance and manipulation is easier. In the case that a user or a thingy is updated (name change, email address update, etc), you only need to update 1 record in their table instead of writing update statements "where user_id=?".
Enforceable database constraints. What if a thingy is not owned by anyone? Is the user column now nillable? It would have to be in the single table case, so you could not enforce a simple "not nillable" username, for example.
There are a lot of reasons of course. If you are using a relational database, you should make use of the one to many by separating your objects (users and thingies) as separate tables. Considering your limitation on number of records and that the size of your dataset is small (under 10,000), you shouldn't feel the down side of normalized data.
The short truth is that there are benefits of both. You could, for example, get faster read times from the single table approach because you don't need complicated joins.
Here is a good reference with the pros/cons of both (normalized is the multiple table approach and denormalized is the single table approach).
http://www.ovaistariq.net/199/databases-normalization-or-denormalization-which-is-the-better-technique/
Besides the benefits other mentioned, there is also one thing about standards. If you are working on this app alone, then that's not a problem, but if someone else would want to change something, then the nightmare starts.
It may take this guy a lot of time to understand how it works alone. And modifing something like this will take even more time. This way, some simple improvement may be really time consuming. And at some point, you will be working with other people. So always code like the guy who works with your code at the end is the brutal psychopath who knows where you live.

LINQ Projection in Entity Framework

I posted a couple of questions about filtering in an eager loading query, and I guess the EF does not support filtering inside of the Include statement, so I came up with this.
I want to perform a simple query where get a ChildProdcut by sku number and it PriceTiers that are filtered for IsActive.
Dim ChildProduct = ChildProductRepository.Query.
Where(Function(x) x.Sku = Sku).
Select(Function(x) New With {
.ChildProduct = x,
.PriceTiers = x.PriceTiers.
Where(Function(y) y.IsActive).
OrderBy(Function(y) y.QuantityStart)
}).Select(Function(x) x.ChildProduct).Single
Is there a more efficient way of doing this? I am on the right track at all? It does work.
Another thing I really don't understand is why does this work? Do you just have to load an object graph and the EF will pick up on that and see that these collections belong to the ChildProduct even though they are inside of an anonymous type?
Also, what are the standards for formatting a long LINQ expression?
Is there a more efficient way of doing this? I am on the right track at all?
Nope, that's about the way you do this in EF and yes, you're on the right track.
Another thing I really don't understand is why does this work?
This is considered to be a bit of a hack, but it works because EF analyzes the whole expression and generates one query (it would look about the same as if you just used Include, but with the PriceTiers collection filtered). As a result, you get your ChildProducts with the PriceTiers populated (and correctly filtered). Obviously, you don't need the PriceTiers property of your anonymous class (you discard it by just selecting x.ChildProduct), but adding it to the LINQ query tells EF to add the join and the extra where to the generated SQL. As a result, the ChildProduct contains all you need.
If this functionality is critcal, create a stored procedure and link entity framework to it.

Resources