I'm trying to query a large amount of data (40K records), and planning to query much larger datasets in the future. Eloquent seems to take a very long time to load this data. I wondered if there is a faster way to process this data. I'm attempting to look at the validity of my data, and hence checking to see if all fields are null.
I've used regular Eloquent calls. I don't think chunking the data is appropriate as I'm not planning on modifying the data in any way. I debated whether running a job every so often and calling the results of this job might be a better approach.
$journal = Journal::where('issn', $this->issn)->first();
$collection = $journal->outputs;
$collectionUnique = $collection->unique('doi');
$collectionDupes = $collection->diff($collectionUnique);
dd('Total Articles '.$this->getTotal(), 'Total Articles '.count($collection));
Just use Query Builders !
Why we should use Query Builders for lots of records instead of Eloquent ?!
Here is the reason :
Query Builder is so faster than Eloquent :
Comparison (Eloquent vs Query Builder ) :
To insert 1000 rows for a simple table Eloquent takes 1.2 seconds and in that case DB facades take only 800 mili seconds(ms).
Another comparison :
Eloquent ORM average response time
Joins | Average (ms)
------+-------------
1 | 162,2
3 | 1002,7
4 | 1540,0
Result of select operation average response time for Eloquent ORM
Raw SQL average response time
Joins | Average (ms)
------+-------------
1 | 116,4
3 | 130,6
4 | 155,2
Result of select operation average response time for Raw SQL
For more information : Laravel Eloquent vs Query Builder
Edited :
Your code should be :
$journal = DB::table('journals')->where('issn', $this->issn)->first();
And Then For using Collection ( Simple way ) :
$journal = Collection::make($journal); //For use Collection Methods
$collection = $journal->get("outputs");//Changed
$collectionUnique = $collection->unique('doi');
$collectionDupes = $collection->diff($collectionUnique);
dd('Total Articles '.$this->getTotal(), 'Total Articles '.count($collection));
Best Performance :
Use queries and Query Builder instead of collections . Because operations in SQL often is faster .
Please compare time for your last code and this code and please let me know in comments :)
Related
I am running a PySpark application where we are comparing two large datasets of 3GB each. There are some differences in the datasets, which we are filtering via outer join.
mismatch_ids_row = (sourceonedf.join(sourcetwodf, on=primary_key,how='outer').where(condition).select(primary_key)
mismatch_ids_row.count()
So the output of join on count is a small data of say 10 records. The shuffle partition at this point is about 30 which has been counted as amount of data/partition size(100Mb).
After the result of the join, the previous two datasets are joined with the resultant joined datasets to filter out data for each dataframe.
df_1 = sourceonedf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
df_2 = sourcetwodf.join(mismatch_ids_row, on=primary_key, how='inner').dropDuplicates()
Here we are dropping duplicates since the result of first join will be double via outer join where some values are null.
These two dataframes are further joined to find the column level comparison and getting the exact issue where the data is mismatched.
df = (df_1.join(df_2,on=some condition, how="full_outer"))
result_df = df.count()
The resultant dataset is then used to display as:
result_df.show()
The issue is that, the first join with more data is using merge sort join with partition size as 30 which is fine since the dataset is somewhat large.
After the result of the first join has been done, the mismatched rows are only 10 and when joining with 3Gb is a costly operation and using broadcast didn't help.
The major issue in my opinion comes when joining two small resultant datasets in second join to produce the result. Here too many shuffle partitions are killing the performance.
The application is running in client mode as spark run for testing purposes and the parameters are sufficient for it to be running on the driver node.
Here is the DAG for the last operation:
As an example:
data1 = [(335008138387,83165192,"yellow","2017-03-03",225,46),
(335008138384,83165189,"yellow","2017-03-03",220,4),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
data2 = [(335008138387,83165192,"yellow","2017-03-03",300,46),
(335008138384,83165189,"yellow","2017-03-03",220,10),
(335008138385,83165193,"yellow","2017-03-03",210,11),
(335008138386,83165194,"yellow","2017-03-03",230,12),
(335008138387,83165195,"yellow","2017-03-03",240,13),
(335008138388,83165196,"yellow","2017-03-03",250,14)
]
field = [
StructField("row_num",LongType(),True),
StructField("tripid",IntegerType(),True),
StructField("car_type",StringType(),True),
StructField("dates", StringType(), True),
StructField("pickup_location_id", IntegerType(), True),
StructField("trips", IntegerType(), True)
]
schema = StructType(field)
sourceonedf = spark.createDataFrame(data=data1,schema=schema)
sourcetwodf = spark.createDataFrame(data=data2,schema=schema)
They have just two differences, on a larger dataset think of these as 10 or more differences.
df_1 will get rows from 1st sourceonedf based on mismatch_ids_row and so will the df_2. They are then joined to create another resultant dataframe which outputs the data.
How can we optimize this piece of code so that optimum partitions are there for it to perform faster that it does now.
At this point it takes ~500 secs to do whole activity, when it can take about 200 secs lesser and why does the show() takes time as well, there are only 10 records so it should print pretty fast if all are in 1 partition I guess.
Any suggestions are appreciated.
You should be able to go without df_1 and df_2. After the first 'outer' join you have all the data in that table already.
Cache the result of the first join (as you said, the dataframe is small):
# (Removed the select after the first join)
mismatch_ids_row = sourceonedf.join(sourcetwodf, on=primary_key, how='outer').where(condition)
mismatch_ids_row.cache()
mismatch_ids_row.count()
Then you should be able to create a self-join condition. When joining, use dataframe aliases for explicit control:
result_df = (
mismatch_ids_row.alias('a')
.join(mismatch_ids_row.alias('b'), on=some condition...)
.select(...)
)
Having the query:
$events = $this->entityManager->createQueryBuilder()
->select('e')
->from('AppBundle:Event', 'e')
->where('e.member = :member')
->andWhere('e.isCancelled = false') // this one makes the query extremely slow
->andWhere('DATE(e.startsAt) = :date')
->setParameter('member', $this->member)
->setParameter('date', $this->date->format('Y-m-d'))
->setMaxResults(1)
->getQuery()
->useQueryCache(true)
->getResult()
;
Member is ManyToOne relation, startsAt and isCancelled are indexed.
Now, total rows of AppBundle:Events is 500 000. Total rows belonging to Member is only 271. Total rows of isCancelled = false is about 450 000.
So without e.isCancelled = false query runs within few milliseconds but adding that condition extends the query to 4.5 minute (localhost, mysql 8.0).
My guess is that MySQL is seeking isCancelled in all rows of Events, instead of assigned to Member only. How can I improve this query, to force Doctrine to fetch those non-cancelled events in Member result.
I have a fairly large and complex query being generated by Yii. To generate this query we're utilizing CDbCritera::with to eagerly load multiple related models, and we're using multiple scopes to limit the records returned. The query being generated is roughly 700 lines long, but looks something like this:
SELECT `t`.`column1` as `t0_c0`,
`t`.`column2` as `t0_c1`,
`related1`.`column1` as `t1_c0`,
...
`related9`.`column5` as `t9_c4`
FROM `model` `t`
LEFT OUTER JOIN `other_model` `related1`
ON ( `t`.`other_model_id` = `related1`.`id` )
...
LEFT OUTER JOIN `more_models` `related9`
ON ( `t`.`more_models_id` = `related9`.`id` )
WHERE
...big long WHERE clause using all of related1 - related9 to filter model...
LIMIT 10
Our database has a not insignificant amount of data, but not obscene, either. In this case the model table has about 126000 rows, every "related" model is a BELONGS_TO relationship and there is an index on t.XXX_id so the join is fairly trivial. The problem is the complexity of the WHERE clause, possessing multiple COALESCE and IF and CASE clauses. Performing the filter on our 126000 rows is taking 2.6 seconds -- far longer than we would like for an API endpoint.
The WHERE clause is divided into multiple different sections like so:
WHERE
( ... part 1 ... )
AND
( ... part 2 ... )
AND
( ... part 3 ... )
With each part corresponding to one of the scopes, and each part using one or more related models
One of the scopes filters on only a single related model, and in doing so filters our table down from 126000 rows to about 2000 rows. I found experimentally (in MySQL Workbench) that I could get our query from 2.6 seconds to 0.2 seconds by simply doing this:
SELECT `t`.`column1` as `t0_c0`,
`t`.`column2` as `t0_c1`,
`related1`.`column1` as `t1_c0`,
...
`related9`.`column5` as `t9_c4`
FROM
(
SELECT `model`.*
FROM `model`
LEFT OUTER JOIN `other_model`
ON ( `t`.`other_model_id` = `other_model`.`id` )
WHERE
( ... part 1 ... )
) `t`
LEFT OUTER JOIN `other_model` `related1`
ON ( `t`.`other_model_id` = `related1`.`id` )
...
LEFT OUTER JOIN `more_models` `related9`
ON ( `t`.`more_models_id` = `related9`.`id` )
WHERE
( ... part 2 ... )
AND
( ... part 3 ... )
LIMIT 10
This way instead of performing the very complex WHERE clause on all 126000 rows of the original model table, we perform the much simpler (and index-enhanced) WHERE clause on these 126000 rows and then perform the complex WHERE clause on only the 2000 relevant rows. The results of the two queries are identical, but using a subquery in the FROM clause causes it to run 13x faster.
The problem is, I have no idea how to do this in Yii. I know that I can use CDbCommand to build a query and even pass in raw SQL, but what I'll get back is an array of "rows" -- they won't be understood by Yii and properly converted to the right models.
Does Yii's ActiveRecord system have a way to say something like the following?
$criteria = new CDbCriteria;
$criteria->scopes = array("part1");
$subQuery = Model::model()->buildQuery($criteria);
$criteria = new CDbCriteria;
$criteria->scopes = array("part2", "part3");
$fullQuery = $subQuery->findAll($criteria);
Although not a perfect solution, I did find something that's almost as good. Break the original query into two:
Get the IDs or models you wish to select in the FROM subquery
Append a WHERE to the outer query with id in (...)
If anyone is interested I'll hunt down the code I wrote for this to post in the answer as an example, but so far this question has gotten very little attention and once I found a pseudo-decent solution I've moved on.
I'm attempting to make a linq where contains query quicker.
The data set contains 256,999 clients. The Ids is just a simple list of GUID'S and this would could only contain 3 records.
The below query can take up to a min to return the 3 records. This is because the logic will go through the 256,999 record to see if any of the 256,999 records are within the List of 3 records.
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Where(x => ids.Contains(x.ClientId)).ToList();
I would like to and get the query to check if the three records are within the pot of 256,999. So in a way this should be much quicker.
I don't want to do a loop as the 3 records could be far more (thousands). The more loops the more hits to the db.
I don't want to grap all the db records (256,999) and then do the query as it would take nearly the same amount of time.
If I grap just the Ids for all the 256,999 from the DB it would take a second. This is where the Ids come from. (A filtered, small and simple list)
Any Ideas?
Thanks
You've said "I don't want to grab all the db records (256,999) and then do the query as it would take nearly the same amount of time," but also "If I grab just the Ids for all the 256,999 from the DB it would take a second." So does this really take "just as long"?
returnItems = context.ExecuteQuery<DataClass.SelectClientsGridView>(sql).Select(x => x.ClientId).ToList().Where(x => ids.Contains(x)).ToList();
Unfortunately, even if this is fast, it's not an answer, as you'll still need effectively the original query to actually extract the full records for the Ids matched :-(
So, adding an index is likely your best option.
The reason the Id query is quicker is due to one field being returned and its only a single table query.
The main query contains sub queries (below). So I get the Ids from a quick and easy query, then use the Ids to get the more details information.
SELECT Clients.Id as ClientId, Clients.ClientRef as ClientRef, Clients.Title + ' ' + Clients.Forename + ' ' + Clients.Surname as FullName,
[Address1] ,[Address2],[Address3],[Town],[County],[Postcode],
Clients.Consent AS Consent,
CONVERT(nvarchar(10), Clients.Dob, 103) as FormatedDOB,
CASE WHEN Clients.IsMale = 1 THEN 'Male' WHEN Clients.IsMale = 0 THEN 'Female' END As Gender,
Convert(nvarchar(10), Max(Assessments.TestDate),103) as LastVisit, ";
CASE WHEN Max(Convert(integer,Assessments.Submitted)) = 1 Then 'true' ELSE 'false' END AS Submitted,
CASE WHEN Max(Convert(integer,Assessments.GPSubmit)) = 1 Then 'true' ELSE 'false' END AS GPSubmit,
CASE WHEN Max(Convert(integer,Assessments.QualForPay)) = 1 Then 'true' ELSE 'false' END AS QualForPay,
Clients.UserIds AS LinkedUsers
FROM Clients
Left JOIN Assessments ON Clients.Id = Assessments.ClientId
Left JOIN Layouts ON Layouts.Id = Assessments.LayoutId
GROUP BY Clients.Id, Clients.ClientRef, Clients.Title, Clients.Forename, Clients.Surname, [Address1] ,[Address2],[Address3],[Town],[County],[Postcode],Clients.Consent, Clients.Dob, Clients.IsMale,Clients.UserIds";//,Layouts.LayoutName, Layouts.SubmissionProcess
ORDER BY ClientRef
I was hoping there was an easier way to do the Contain element. As the pool of Ids would be smaller than the main pool.
A way I've speeded it up for now is. I've done a Stinrg.Join to the list of Ids and added them as a WHERE within the main SQL. This has reduced the time down to a seconds or so now.
Basically I have a QueryExpression that returns over 3000 results. I only need to use between 50 and 200 of these. If I was using normal sql I could use SELECT TOP 200.....
Is there a way to do this in CRM using the QueryExpression or FetchXML?
In a QueryExpression:
QueryExpression query = new QueryExpression();
query.PageInfo = new PagingInfo();
query.PageInfo.Count = 200; // or 50, or whatever
query.PageInfo.PageNumber = 1;
In Fetch XML:
<fetch mapping='logical' page='1' count='200'>
...
#Matt basically said everything right.
This article expands on his answer.
What you essentially want to do is use PageInfo prop of QueryExpression.
That way you can limit the results, or, even better fetch more than 5000 rows (default limit). PageInfo is used as a paging indicator. How many rows does a page have, how many pages and most important, PagingCookie used for recursive read of more data (more than 5k rows)
https://msdn.microsoft.com/en-us/library/mt269606.aspx