I use eloquent model to do some complex migration of database and run out of memory during the processing. Can someone explain what's the reason? Thank you!
Laravel version: "v8.52.0"
Test code:
public function handle()
{
for ($i = 0; $i < 100; $i++) {
Customer::chunkById(1000, function ($customers) use ($i) {
$this->print_progress();
foreach ($customers as $customer) {
$customer->first_name = (string)$i;
$customer->save();
}
});
}
}
Output (memory usage):
usage: 27MB - peek: 27MB
usage: 33MB - peek: 33MB
usage: 39MB - peek: 39MB
...
...
usage: 491MB - peek: 491MB
usage: 496MB - peek: 496MB
PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 20480 bytes) in /home/vagrant/code/billing/vendor/laravel/framework/src/Illumi nate/Support/Str.php on line 855
PHP Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 32768 bytes) in /home/vagrant/code/billing/vendor/symfony/error-handler/Error/ FatalError.php on line 1```
Update: Memory leak is caused by telescope. When turn off telescope, no memroy leak occurs.
You have a memory leak somewhere.
I'll assume the problem is not within the print_progress function but please, double check it (or edit your question with its content).
It's hard to give you an accurate answer since there can be many things that cause memory leaks, but try to use saveQuietly instead of save. Model events will not be dispatched and it might be the cause of your problem.
Also, check you are not using Laravel Telescope and if you do, disable it during these tests.
First of all, I see your code is really strange. I understand if you are just testing stuff.
So, your logic is this:
Iterate the next step 100 times.
Each iteration will go to the database and get a collection of 1000 Customers in chunks (but you will always iterate over all your Customers).
You are going to iterate over each chunk and say the first name of each 1000 you got in the previous step is going to be the current $i index (as string). This will happen for all your Customers, but you are doing it in in chunks.
So:
You are not filtering your query in any way, so maybe chunkById is being less performant than a normal chunk, so first of all try to use that.
If you are still running out of memory, just reduce your 1000 to 500 or, personal recommendation 200 or 100, never have more than that per chunk... Getting 1000 models in a collection is not very performant wise.
You can have a code a little bit more readable or Laravel friendly by using Higher Order Messages:
public function handle()
{
for ($i = 0; $i < 100; $i++) {
Customer::chunk(100, function ($customers) use ($i) {
$this->print_progress();
$customers->each->update(['first_name' => (string)$i]);
});
}
}
But, if you want to be 100% performant, you can disregard chunk and directly update the table so that will be nearly instant compared to letting PHP do that work:
public function handle()
{
for ($i = 0; $i < 100; $i++) {
Customer::update(['first_name' => $i]);
$this->print_progress();
}
}
But I am not sure if you are just testing performance or what, so maybe this last code is of no use for you.
Related
We are developing an API with LUMEN.
Today we had a confused problem with getting the collection of our "TimeLog"-model.
We just wanted to get all time logs with additional informationen from the board model and task model.
In one row of time log we had a board_id and a task_id. It is a 1:1 relation on both.
This was our first code for getting the whole data. This took a lot of time and sometimes we got a timeout:
BillingController.php
public function byYear() {
$timeLog = TimeLog::get();
$resp = array();
foreach($timeLog->toArray() as $key => $value) {
if(($timeLog[$key]->board_id && $timeLog[$key]->task_id) > 0 ) {
array_push($resp, array(
'board_title' => isset($timeLog[$key]->board->title) ? $timeLog[$key]->board->title : null,
'task_title' => isset($timeLog[$key]->task->title) ? $timeLog[$key]->task->title : null,
'id' => $timeLog[$key]->id
));
}
}
return response()->json($resp);
}
The TimeLog.php where the relation has been made.
public function board()
{
return $this->belongsTo('App\Board', 'board_id', 'id');
}
public function task()
{
return $this->belongsTo('App\Task', 'task_id', 'id');
}
Our new way is like this:
BillingController.php
public function byYear() {
$timeLog = TimeLog::
join('oc_boards', 'oc_boards.id', '=', 'oc_time_logs.board_id')
->join('oc_tasks', 'oc_tasks.id', '=', 'oc_time_logs.task_id')
->join('oc_users', 'oc_users.id', '=', 'oc_time_logs.user_id')
->select('oc_boards.title AS board_title', 'oc_tasks.title AS task_title','oc_time_logs.id','oc_time_logs.time_used_sec','oc_users.id AS user_id')
->getQuery()
->get();
return response()->json($timeLog);
}
We deleted the relation in TimeLog.php, cause we don't need it anymore. Now we have a load time about 1 sec, which is fine!
There are about 20k entries in the time log table.
My questions are:
Why is the first method out of range (what causes the timeout?)
What does getQuery(); exactly do?
If you need more information just ask me.
--First Question--
One of the issues you might be facing is having all those huge amount of data in memory, i.e:
$timeLog = TimeLog::get();
This is already enormous. Then when you are trying to convert the collection to array:
There is a loop through the collection.
Using the $timeLog->toArray() while initializing the loop based on my understanding is not efficient (I might not be entirely correct about this though)
Thousands of queries are made to retrieve the related models
So what I would propose are five methods (one which saves you from hundreds of query), and the last which is efficient in returning the result as customized:
Since you have many data, then chunk the result ref: Laravel chunk so you have this instead:
$timeLog = TimeLog::chunk(1000, function($logs){
foreach ($logs as $log) {
// Do the stuff here
}
});
Other way is using cursor (runs only one query where the conditions match) the internal operation of cursor as understood is using Generators.
foreach (TimeLog::where([['board_id','>',0],['task_id', '>', 0]])->cursor() as $timelog) {
//do the other stuffs here
}
This looks like the first but instead you have already narrowed your query down to what you need:
TimeLog::where([['board_id','>',0],['task_id', '>', 0]])->get()
Eager Loading would already present the relationship you need on the fly but might lead to more data in memory too. So possibly the chunk method would make things more easier to manage (even though you eagerload related models)
TimeLog::with(['board','task'], function ($query) {
$query->where([['board_id','>',0],['task_id', '>', 0]]);
}])->get();
You can simply use Transformer
With transformer, you can load related model, in elegant, clean and more controlled methods even if the size is huge, and one greater benefit is you can transform the result without having to worry about how to loop round it
You can simply refer to this answer in order to perform a simple use of it. However incase you don't need to transform your response then you can take other options.
Although this might not entirely solve the problem, but because the main issues you face is based on memory management, so the above methods should be useful.
--Second question--
Based on Laravel API here You could see that:
It simply returns the underlying query builder instance. To my observation, it is not needed based on your example.
UPDATE
For question 1, since it seems you want to simply return the result as response, truthfully, its more efficient to paginate this result. Laravel offers pagination The easiest of which is SimplePaginate which is good. The only thing is that it makes some few more queries on the database, but keeps a check on the last index; I guess it uses cursor as well but not sure. I guess finally this might be more ideal, having:
return TimeLog::paginate(1000);
I have faced a similar problem. The main issue here is that Elloquent is really slow doing massive task cause it fetch all the results at the same time so the short answer would be to fetch it row by row using PDO fetch.
Short example:
$db = DB::connection()->getPdo();
$query_sql = TimeLog::join('oc_boards', 'oc_boards.id', '=', 'oc_time_logs.board_id')
->join('oc_tasks', 'oc_tasks.id', '=', 'oc_time_logs.task_id')
->join('oc_users', 'oc_users.id', '=', 'oc_time_logs.user_id')
->select('oc_boards.title AS board_title', 'oc_tasks.title AS task_title','oc_time_logs.id','oc_time_logs.time_used_sec','oc_users.id AS user_id')
->toSql();
$query = $db->prepare($query->sql);
$query->execute();
$logs = array();
while ($log = $query->fetch()) {
$log_filled = new TimeLog();
//fill your model and push it into an array to parse it to json in future
array_push($logs,$log_filled);
}
return response()->json($logs);
I have a very large result set to process and so I'm using the chunk() method to reduce the memory footprint of the job. However, I only want to process a certain number of total results to prevent the job from running too long.
Currently I'm doing this, but it does not seem like an elegant solution:
$count = 0;
$max = 1000000;
$lists = Lists::whereReady(true);
$lists->chunk(1000, function (Collection $lists) use (&$count, $max) {
if ($count >= $max)
return;
foreach ($lists as $list) {
if ($count >= $max)
break;
$count++;
// ...do stuff
}
});
Is there a cleaner way to do this?
As of right now, I don't believe so.
There have been some issues and pull requests submitted to have chunk respect previously set skip/limits, but Taylor has closed them as expected behavior that chunk overwrites these.
There is currently an open issue in the laravel/internals repo where he said he'd take a look again, but I don't think it is high on the priority list. I doubt it is something he would work on, but may be more receptive to another pull request now.
Your solution looks fine, except for one thing. chunk() will end up reading your entire table, unless you return false from your closure. Currently, you are just returning null, so even though your "max" is set to 1000000, it will still read the entire table. If you return false from your closure when $count >= $max, chunk() will stop querying the database. It will cause chunk() to return false itself, but your example code doesn't care about the return of chunk() anyway, so that's okay.
Another option, assuming you're using sequential ids, would be to get the ending id and then add a where clause to your chunked query to get all the records with an id less than your max id. So, something like:
$max = 1000000;
$maxId = Lists::whereReady(true)->skip($max)->take(1)->value('id');
$lists = Lists::whereReady(true)->where('id', '<', $maxId);
$lists->chunk(1000, function (Collection $lists) {
foreach ($lists as $list) {
// ...do stuff
}
});
Code is slightly cleaner, but it is still a hack, and requires one extra query (to get the max id).
I have a file with over 30,000 records and another with 41,000. Is there a best case study for seeding this using laravel 4's db:seed command? A way to make the inserts more swift.
Thanks for the help.
Don't be afraid, 40K rows table is kind of a small one. I have a 1 milion rows table and seed was done smoothly, I just had to add this before doing it:
DB::disableQueryLog();
Before disabling it, Laravel wasted all my PHP memory limit, no matter how much I gave it.
I read data from .txt files using fgets(), building the array programatically and executing:
DB::table($table)->insert($row);
One by one, wich may be particularily slow.
My database server is a PostgreSQL and inserts took around 1.5 hours to complete, maybe because I was using a VM using low memory. I will make a benchmark one of these days on a better machine.
2018 Update
I have run into the same issue and after 2 days of headache, I could finally write script to seed 42K entries in less than 30s!
You ask How?
1st Method
This method assumes that you have a database with some entries in it(in my case were 42k entries) and you want to import same into other database. Export your database as CSV files with header names and put the file into the public folder of your project and then you can parse the file and insert one by one all the entries in new database via seeder.
So your seeder will look something like this:
<?php
use Illuminate\Database\Seeder;
class {TableName}TableSeeder extends Seeder
{
/**
* Run the database seeds.
*
* #return void
*/
public function run()
{
$row = 1;
if (($handle = fopen(base_path("public/name_of_your_csv_import.csv"), "r")) !== false) {
while (($data = fgetcsv($handle, 0, ",")) !== false) {
if ($row === 1) {
$row++;
continue;
}
$row++;
$dbData = [
'col1' => '"'.$data[0].'"',
'col2' => '"'.$data[1].'"',
'col3' => '"'.$data[2].'"',
so on...how many columns you have
];
$colNames = array_keys($dbData);
$createQuery = 'INSERT INTO locations ('.implode(',', $colNames).') VALUES ('.implode(',', $dbData).')';
DB::statement($createQuery, $data);
$this->command->info($row);
}
fclose($handle);
}
}
}
Simple and Easy :)
2nd method
In case you can modify the settings of your PHP and allocate a big size to aprticular script then this method will work as well.
Well basically you need to focus on three major steps:
Allocate more memory to script
Off Query Logger
Divide your data in chunks of 1000
Iterate through data and use insert() to create chunks of 1K at a time.
So if I combine all of the above mentioned steps in a seeder, your seeder will look something like this:
<?php
use Illuminate\Database\Seeder;
class {TableName}TableSeeder extends Seeder
{
/**
* Run the database seeds.
*
* #return void
*/
public function run()
{
ini_set('memory_limit', '512M');//allocate memory
DB::disableQueryLog();//disable log
//create chunks
$data = [
[
[
'col1'=>1,
'col2'=>1,
'col3'=>1,
'col4'=>1,
'col5'=>1
],
[
'col1'=>1,
'col2'=>1,
'col3'=>1,
'col4'=>1,
'col5'=>1
],
so on..until 1000 entries
],
[
[
'col1'=>1,
'col2'=>1,
'col3'=>1,
'col4'=>1,
'col5'=>1
],
[
'col1'=>1,
'col2'=>1,
'col3'=>1,
'col4'=>1,
'col5'=>1
],
so on..until 1000 entries
],
so on...until how many entries you have, i had 42000
]
//iterate and insert
foreach ($data as $key => $d) {
DB::table('locations')->insert($d);
$this->command->info($key);//gives you an idea where your iterator is in command line, best feeling in the world to see it rising if you ask me :D
}
}
}
and VOILA you are good to go :)
I hope it helps
I was migrating from a different database and I had to use raw sql (loaded from an external file) with bulk insert statements (I exported structure via navicat which has the option to break up your insert statements every 250KiB). Eg:
$sqlStatements = array(
"INSERT INTO `users` (`name`, `email`)
VALUES
('John Doe','john.doe#gmail.com'),.....
('Jane Doe','jane.doe#gmail.com')",
"INSERT INTO `users` (`name`, `email`)
VALUES
('John Doe2','john.doe2#gmail.com'),.....
('Jane Doe2','jane.doe2#gmail.com')"
);
I then looped through the insert statements and executed using
DB::statement($sql).
I couldn't get insert to work one row at a time. I'm sure there's alternatives that are better but this at least worked while letting me keep it within Laravel's migration/seeding.
I had the same problem today. Disabling query log wasn't enough. Looks like an event also get fired.
DB::disableQueryLog();
// DO INSERTS
// Reset events to free up memory.
DB::setEventDispatcher(new Illuminate\Events\Dispatcher());
It may be a simple question, but I can't find the answer. How can I know if my Collection has no data ?
I do $datas = Mage::getModel('zzz/zzz')->getCollection() if I do a $datas->getData() it returns an empty array, but how do I know if my collection has no data without doing foreach or getData ?
You should avoid using count or your Collections. Here's why:
the Mage_Core_Model_Resource_Db_Collection_Abstract (Collection Model that is inherited by almost all Magento Collections) does not have count() defined, so using count on your Collection you'll most likely end up with Varien_Data_Collection::count() which is very bad option, since it does a collection load() and then counts the loaded objects:
/**
* Retireve count of collection loaded items
*
* #return int
*/
public function count()
{
$this->load();
return count($this->_items);
}
Having a large collection (especially EAV collection) will make result in loading ALL of your Collection data - this can take a lot of time.
Instead you should use Varien_Data_Collection_Db::getSize() method, which will run the SQL query to get count only, much more optimized compared to retrieving all kind of data for Collection load:
/**
* Get collection size
*
* #return int
*/
public function getSize()
{
if (is_null($this->_totalRecords)) {
$sql = $this->getSelectCountSql();
$this->_totalRecords = $this->getConnection()->fetchOne($sql, $this->_bindParams);
}
return intval($this->_totalRecords);
}
In addition to that, after load collection can not be modified in any way. For example you won't be able to apply additional filters of change sort order at any point after using count().
So correct answer should be:
$collection = Mage::getModel('zzz/zzz')->getCollection();
var_dump($collection->getSize());
You can easily just do an if statement like so:
if (!$datas->getData() || empty($datas->getData())) {
// do things
}
In addition to the accepted answers see benchmarks:
Tested for 750 products
$collection->getData()
Total Incl. Wall Time (microsec): 67,567 microsecs
Total Incl. CPU (microsecs): 67,599 microsecs
Total Incl. MemUse (bytes): 11,719,168 bytes
Total Incl. PeakMemUse (bytes): 11,648,152 bytes
Number of Function Calls: 1,047
$collection->getSize()
Total Incl. Wall Time (microsec): 6,371 microsecs
Total Incl. CPU (microsecs): 4,402 microsecs
Total Incl. MemUse (bytes): 140,816 bytes
Total Incl. PeakMemUse (bytes): 96,000 bytes
Number of Function Calls: 191
$collection->count() or sizeof($collection)
Total Incl. Wall Time (microsec): 2,130,568 microsecs
Total Incl. CPU (microsecs): 2,080,617 microsecs
Total Incl. MemUse (bytes): 12,899,872 bytes
Total Incl. PeakMemUse (bytes): 13,002,256 bytes
Number of Function Calls: 101,073
So you should go with getSize().
Form: https://magento.stackexchange.com/questions/179028/how-to-check-if-a-collection-has-items/179035#179035
/**
* Retrieve collection all items count
*
* #return int
*/
$collection = Mage::getModel('aaa/bbb')->getCollection()->getSize();
This is the code that's used in pagination etc and is recommended.
where
/**
* Retireve count of collection loaded items
*
* #return int
*/
public function count()
will be useful to check for loaded items data.
you can use;
$collection = Mage::getModel('zzz/zzz')->getCollection();
var_dump($collection->count());
Running a simple, standard PHP count() on the collection is fine here. As long as you have properly filtered your collection, which you should always have done before getting to the point of counting it calling the ->count() method on a collection is fine also. As soon as you manipulate the collection in any way, it will load regardless of the method you use, so a standard PHP count(), calling the ->count() method on the object, running through the collection with a foreach() will all load the collection in the same way as as load(), in fact if you trace the load() method back, you will see it actually runs a standard PHP foreach() to load collection data.
So it doesn't matter how you do it, you still can't count your collection until you know how many results have been returned from the database, so the method above is fine, but does mean extra DB calls, first to count, then to load. A better method is just to ensure that you make your SELECT statements as specific as possible by narrowing them with WHERE clauses and so on. If you pull the select object from a collection you have access to all of the Zend_Db_Select methods shown here, i.e.
$collection->getSelect()->where('......... = ?', $var);
Suppose the product collection is $pro_collection
Now apply the following code..
<?php
if(isset($pro_collection) && count($pro_collection) > 0) {
/* Your code here */
}
?>
Though this has nothing to do with PHP specifically, I use PHP in the following examples.
Let's say this is the 'normal' way of limiting results.
$db->users->find()->limit(10);
This is probably the fastest way, but there are some restrictions here... In the following example, I'll filter out all rows that have the save value for a certain column as the previous row:
$cursor = $db->users->find();
$prev = null;
$results = array();
foreach ($cursor as $row) {
if ($row['coll'] != $prev['coll']) {
$results[] = $row;
$prev = $row;
}
}
But you still want to limit the results to 10, of course. So you could use the following:
$cursor = $db->users->find();
$prev = null;
$results = array();
foreach ($cursor as $row) {
if ($row['coll'] != $prev['coll']) {
$results[] = $row;
if (count($results) == 10) break;
$prev = $row;
}
}
Explanation: since the $cursor does not actually load the results from the database, breaking the foreach-loop will limit it just as the limit(...)-function does.
Just for sure, is this really working as I'm saying, or are there any performance issues I'm not aware of?
Thank you very much,
Tim
Explanation: since the $cursor does not actually load the results from the database, breaking the foreach-loop will limit it just as the limit(...)-function does.
This is not 100% true.
When you do the foreach, you're basically issuing a series of hasNext / getNext that is looping through the data.
However, underneath this layer, the driver is actually requesting and receiving batches of results. When you do a getNext the driver will seamlessly fetch the next batch for you.
You can control the batch size. The details in the documentation should help clarify what's happening.
In your second example, if you get to 10 and then break there are two side effects:
The cursor remains open on the server (times out in 10 minutes, generally not a big impact).
You may have more data cached in $cursor. This cache will go away when $cursor goes out of scope.
In most cases, these side effects are "not a big deal". But if you're doing lots of this processing in a single process, you'll want to "clean up" to avoid having cursors hanging around.