Laravel multiple tasks simultaneously - laravel

I need to process several image files from a directory (S3 directory), the process is to read the filename (id and type) that is stored in the filename (001_4856_0-P-0-A_.jpg), this file is stored in the moment is invoked the process (im using cron and schedule, it works great) the objetive of the process is to store the info into a database.
I have the process working, it works great but my problem is the number of files that is in the directory, because every second adds a lot more files to the directory, the time spent in the process is about 0.19 sec for file, but the amount of files is huge, about 15,000 per minute is added, so i think a multiple simultaneous process (about 10 - 40 times) of the same original process can do the job.
I need some advice or idea,
First to know how to launch multiple process at the same time of one original process.
Second how to get only the non selected filenames bcause the process takes the filenames with:
$recibidos = Storage::disk('s3recibidos');
if(count($recibidos) <= 0)
{
$lognofile = ['Archivos' => 'No hay archivos para procesar'];
$orderLog->info('ImagesLog', $lognofile);
}else{
$files = $recibidos->files();
if(Image::count() == 0)
{
$last_record = 1;
} else{
$last_record = Image::latest('id')->pluck('id')->first()+1;
}
$i=$last_record;
$fotos_sin_info = 0;
foreach($files as $file)
{
$datos = explode('_',$file);
$tipos = str_replace('-','',$datos[2]);
Image::create([
'client_id' => $datos[0],
'tipo' => $tipos,
]);
$recibidos->move($file,'/procesar/'.$i.'.jpg');
$i++;
}
but i dont figured out how to retrieve only the non selected.
Thanks for your comments.

Using multi-threaded programming in php is possible and has been discussed on so How can one use multi threading in PHP applications.
However this is generally not the most obvious choice for standard applications. A solution for your situation will depend on the exact use-case.
Did you consider a solution using queues?
https://laravel.com/docs/5.6/queues
Or the scheduler?
https://laravel.com/docs/5.6/scheduling

Related

Laravel tagging overhead leaving behind significantly large reference sets using redis

I am using Laravel 9 with the Redis cache driver. However, I have an issue where the internal standard_ref and forever_ref map that Laravel uses to manage tagged cache exceed more than 10MB.
This map consists of numerous keys, 95% of which have already expired/decayed and no longer exist; this map seems to grow in size and has a TTL of -1 (never expire).
Other than "not using tags", has anyone else encountered and overcome this? I found this in the slow log of Redis Enterprise, which led me to realize this is happening:
I checked the key/s via SCAN and can confirm it's a massive set of cache misses. It seems highly inefficient and expensive to constantly transmit 10MB back and forth to find one key within the map.
This quickly and efficiently removes expired keys from the SET data-type that laravel uses to manage tagged cache.
use Illuminate\Support\Facades\Cache;
function flushExpiredKeysFromSet(string $referenceKey) : void
{
/** #var \Illuminate\Cache\RedisStore $store */
$store = Cache::store()->getStore();
$lua = <<<LUA
local keys = redis.call('SMEMBERS', '%s')
local expired = {}
for i, key in ipairs(keys) do
local ttl = redis.call('ttl', key)
if ttl == -2 or ttl == -1 then
table.insert(expired, key)
end
end
if #expired > 0 then
redis.call('SREM', '%s', unpack(expired))
end
LUA;
$store->connection()->eval(sprintf($lua, $key, $key), 1);
}
To show the calls that this LUA script generates, from the sample above:
10:32:19.392 [0 lua] "SMEMBERS" "63c0176959499233797039:standard_ref{0}"
10:32:19.392 [0 lua] "ttl" "i-dont-expire-for-an-hour"
10:32:19.392 [0 lua] "ttl" "aa9465100adaf4d7d0a1d12c8e4a5b255364442d:i-have-expired{1}"
10:32:19.392 [0 lua] "SREM" "63c0176959499233797039:standard_ref{0}" "aa9465100adaf4d7d0a1d12c8e4a5b255364442d:i-have-expired{1}"
Using a custom cache driver that wraps the RedisTaggedCache class; when cache is added to a tag, I dispatch a job using the above PHP script only once within that period by utilizing a 24-hour cache lock.
Here is how I obtain the reference key that is later passed into the cleanup script.
public function dispatchTidyEvent(mixed $ttl)
{
$referenceKeyType = $ttl === null ? self::REFERENCE_KEY_FOREVER : self::REFERENCE_KEY_STANDARD;
$lock = Cache::lock('tidy:'.$referenceKeyType, 60 * 60 * 24);
// if we were able to get a lock, then dispatch the event
if ($lock->get()) {
foreach (explode('|', $this->tags->getNamespace()) as $segment) {
dispatch(new \App\Events\CacheTidyEvent($this->referenceKey($segment, $referenceKeyType)));
}
}
// otherwise, we'll just let the lock live out its life to prevent repeating this numerous times per day
return true;
}
Remembering that a "cache lock" is simply just a SET/GET and Laravel is responsible for many of those already on every request to manage it's tags, adding a lock to achieve this "once per day" concept only adds negligible overhead.

Spark Streaming: Micro batches Parallel Execution

We are receiving data in spark streaming from Kafka. Once execution has been started in Spark Streaming, it executes only one batch and the remaining batches starting queuing up in Kafka.
Our data is independent and can be processes in Parallel.
We tried multiple configurations with multiple executor, cores, back pressure and other configurations but nothing worked so far. There are a lot messages queued and only one micro batch has been processed at a time and rest are remained in queue.
We want to achieve parallelism at maximum, so that not any micro batch is queued, as we have enough resources available. So how we can reduce time by maximum utilization of resources.
// Start reading messages from Kafka and get DStream
final JavaInputDStream<ConsumerRecord<String, byte[]>> consumerStream = KafkaUtils.createDirectStream(
getJavaStreamingContext(), LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, byte[]>Subscribe("TOPIC_NAME",
sparkServiceConf.getKafkaConsumeParams()));
ThreadContext.put(Constants.CommonLiterals.LOGGER_UID_VAR, CommonUtils.loggerUniqueId());
JavaDStream<byte[]> messagesStream = consumerStream.map(new Function<ConsumerRecord<String, byte[]>, byte[]>() {
private static final long serialVersionUID = 1L;
#Override
public byte[] call(ConsumerRecord<String, byte[]> kafkaRecord) throws Exception {
return kafkaRecord.value();
}
});
// Decode each binary message and generate JSON array
JavaDStream<String> decodedStream = messagesStream.map(new Function<byte[], String>() {
private static final long serialVersionUID = 1L;
#Override
public String call(byte[] asn1Data) throws Exception {
if(asn1Data.length > 0) {
try (InputStream inputStream = new ByteArrayInputStream(asn1Data);
Writer writer = new StringWriter(); ) {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(asn1Data);
GZIPInputStream gzipInputStream = new GZIPInputStream(byteArrayInputStream);
byte[] buffer = new byte[1024];
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
int len;
while((len = gzipInputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
return new String(byteArrayOutputStream.toByteArray());
} catch (Exception e) {
//
producer.flush();
throw e;
}
}
return null;
}
});
// publish generated json gzip to kafka
cache.foreachRDD(new VoidFunction<JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
#Override
public void call(JavaRDD<String> jsonRdd4DF) throws Exception {
//Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
if(!jsonRdd4DF.isEmpty()) {
//JavaRDD<String> jsonRddDF = getJavaSparkContext().parallelize(jsonRdd4DF.collect());
Dataset<Row> json = sparkSession.read().json(jsonRdd4DF);
SparkAIRMainJsonProcessor airMainJsonProcessor = new SparkAIRMainJsonProcessor();
airMainJsonProcessor.processAIRData(json, sparkSession);
}
}
});
getJavaStreamingContext().start();
getJavaStreamingContext().awaitTermination();
getJavaStreamingContext().stop();
Technology that we are using:
HDFS 2.7.1.2.5
YARN + MapReduce2 2.7.1.2.5
ZooKeeper 3.4.6.2.5
Ambari Infra 0.1.0
Ambari Metrics 0.1.0
Kafka 0.10.0.2.5
Knox 0.9.0.2.5
Ranger 0.6.0.2.5
Ranger KMS 0.6.0.2.5
SmartSense 1.3.0.0-1
Spark2 2.0.x.2.5
Statistics that we got from difference experimentations:
Experiment 1
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 48 Minutes
Experiment 2
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 8 Minutes
Experiment 3
spark.default.parallelism=12
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 7 Minutes
Experiment 4
spark.default.parallelism=16
num_executors=6
executor_memory=8g
executor_cores=12
100 Files processing time 10 Minutes
Please advise, how we can process maximum so no queued.
I was facing same issue and I tried a few things in trying to resolve the issue and came to following findings:
First of all. Intuition says that one batch must be processed per executor but on the contrary, only one batch is processed at a time but jobs and tasks are processed in parallel.
Multiple batch processing can be achieved by using spark.streaming.concurrentjobs, but it's not documented and still needs a few fixes. One of problems is with saving Kafka offsets. Suppose we set this parameter to 4 and 4 batches are processed in parallel, what if 3rd batch finishes before 4th one, which Kafka offsets would be committed. This parameter is quite useful if batches are independent.
spark.default.parallelism because of its name is sometimes considered to make things parallel. But its true benefit is in distributed shuffle operations. Try different numbers and find an optimum number for this. You will get a considerable difference in processing time. It depends upon shuffle operations in your jobs. Setting it too high would decrease the performance. It's apparent from you experiments results too.
Another option is to use foreachPartitionAsync in place of foreach on RDD. But I think foreachPartition is better as foreachPartitionAsync would queue up the jobs whereas batches would appear to be processed but their jobs would still be in the queue or in processing. May be I didn't get its usage right. But it behaved same in my 3 services.
FAIR spark.scheduler.mode must be used for jobs with lots of tasks as round-robin assignment of tasks to jobs, gives opportunity to smaller tasks to start receiving resources while bigger tasks are processing.
Try to tune your batch duration+input size and always keep it below processing duration otherwise you're gonna see a long backlog of batches.
These are my findings and suggestions, however, there are so many configurations and methods to do streaming and often one set of operation doesn't work for others. Spark Streaming is all about learning, putting your experience and anticipation together to get to a set of optimum configuration.
Hope it helps. It would be a great relief if someone could tell specifically how we can legitimately process batches in parallel.
We want to achieve parallelism at maximum, so that not any micro batch is queued
That's the thing about stream processing: you process the data in the order it was received. If you process your data at the rate slower than it arrives it will be queued. Also, don't expect that processing of one record will suddenly be parallelized across multiple nodes.
From your screenshot, it seems your batch time is 10 seconds and your producer published 100 records over 90 seconds.
It took 36s to process 2 records and 70s to process 17 records. Clearly, there is some per-batch overhead. If this dependency is linear, it would take only 4:18 to process all 100 records in a single mini-batch thus beating your record holder.
Since your code is not complete, it's hard to tell what exactly takes so much time. Transformations in the code look fine but probably the action (or subsequent transformations) are the real bottlenecks. Also, what's with producer.flush() which wasn't mentioned anywhere in your code?
I was facing the same issue and I solved it using Scala Futures.
Here are some link that show how to use it:
https://alvinalexander.com/scala/how-use-multiple-scala-futures-in-for-comprehension-loop
https://www.beyondthelines.net/computing/scala-future-and-execution-context/
Also, this is piece of my code when I used Scala Futures:
messages.foreachRDD{ rdd =>
val f = Future {
// sleep(100)
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures, conf)
println("Inside scala future")
1
}
f.onComplete {
case Success(messages) => println("yay!")
case Failure(exception) => println("On no!")
}
}
It's hard to tell without having all the details, but general advice to tackle issues like that -- start with very simple application, "Hello world" kind. Just read from input stream and print data into log file. Once this works you prove that problem was in application and you gradually add your functionality back until you find what was culprit. If even simplest app doesn't work - you know that problem in configuration or Spark cluster itself. Hope this helps.

Using parfor and labSend/labRecieve

I want to run two matlab scripts in parallel for a project and communicate between them. The purpose of this is to have one script do image analysis and sending the results to the other which will use it for more calculations (time consuming, but not related to the task of finding stuff in the images). Since both tasks are time consuming, and should preferably be done in real time, I believe that parallelization is necessary.
To get a feel for how this should be done I created a test script to find out how to communicate between the two scripts.
The first script takes a user input using the built in function input, and then using labSend sends it to the other, which recieves it, and prints it.
function [blarg] = inputStuff(blarg)
mpiInit(); %added because of error message, but do not work...
for i=1:2
labBarrier; % added because of error message
inp = input('Enter a number to write');
labSend(inp);
if (inp == 0)
break;
else
i = 1;
end
end
end
function [ blarg ] = testWrite( blarg )
mpiInit(); % added because of error message, but does not help
par = 0;
if ( blarg == 0)
par = 1;
end
for i = 1:10
if (par == 1)
labBarrier
delta = labReceive();
i = 1;
else
delta = input('Enter number to write');
end
if (delta == 0)
break;
end
s = strcat('This lab no', num2str(labindex), '. Delta is = ')
delta
end
end
%%This is the file test_parfor.m
funlist = {#inputStuff, #testWrite};
matlabpool(2);
mpiInit(); % added because of error message, but does not help
parfor i=1:2
funlist{i}(0);
end
matlabpool close;
Then, when the code is run, the following error message appears:
Starting matlabpool using the 'local' profile ... connected to 2 labs.
Error using parallel_function (line 589)
The MPI implementation has not yet been loaded. Please
call mpiInit.
Error stack:
testWrite.m at 11
Error in test_parfor (line 8)
parfor i=1:2
Calling the method mpiInit does not help... (Called as shown in the code above.)
And nowhere in the examples that mathworks have in the documentation, or on their website, show this error or what to do with it.
Any help is appreciated!
You would typically use constructs such as labSend, labRecieve and labBarrier within an spmd block, rather than a parfor block.
parfor is intended for implementing embarrassingly parallel algorithms, in other words algorithms that consist of multiple independent tasks that can be run in parallel, and do not require communication between tasks.
I'm stretching my knowledge here (perhaps someone more expert can correct me), but as I understand things, it does not set up an MPI ring for communication between workers, which is probably the explanation for the (rather uninformative) error message you're getting.
An spmd block enables communication between workers using labSend, labRecieve and labBarrier. There are quite a few examples of using them all in the documentation.
Sam is right that the MPI functionality is not enabled during parfor, only during spmd. You need to do something more like this:
spmd
funlist{labindex}(0);
end
(Sam is also quite right that the error message you saw is pretty unhelpful)

Magento foreach() orders getAllItems()

I am not sure why this loop is not working.
$orders = Mage::getSingleton('sales/order')->getCollection()
->addAttributeToSelect('*')
->addFieldToFilter('created_at', array('from'=>$from, 'to'=>$to))
->addAttributeToSort('increment_id', 'ASC')
;
foreach ($orders as $item) {
$order_id = $item->increment_id;
if (is_numeric($order_id)) $order = Mage::getModel('sales/order')->loadByIncrementId($order_id);
if (is_object($order)) {
echo "> O: ". $order_id ."<BR>";
$items = $order->getAllItems();
echo ">> O: ". $order_id ."<BR>";
} else
die("DIE ". var_dump($order));
}
die("<BR> DONE");
The output:
...
...
>> O: 100021819
> O: 100021820
>> O: 100021820
> O: 100021821
The loop never finishes nor does it stop at the same order_id.?
It always fails at $order->getAllItems()
These orders are either pending, processing or complete.
Is there something I should be checking for with $order->getAllItems(), since that's were it's failing.
Thanks.
Jon, I assume the problem you're talking about is your script ending un expectedly. i.e., you see the output with a single >
> O: 100021821
but not the output with the double >>.
Because Magento is so customizable, it's impossible to accurately diagnose your problem with the information given. Something is happening in your system, (a PHP error, an uncaught exception, etc.), that results in your script stopping. Turn on developer mode and set the PHP ini display_errors to 1 (ini_set('display_errors', 1);) and check your error log. One you (or we) have the PHP error, it'll be a lot easier to help you.
My guess is you're running into a memory problem. The way PHP has implemented objects can lead to small memory leaks — objects don't clean up after themselves correctly. This means each time you go through the loop you're slowly consuming the total amount of memory that's allowed for a PHP request. For a system with a significant number of orders, I'd be surprised if the above code could get through everything before running out of memory.
If your problem is a memory problem, there's information on manually cleaning up after PHP's objects in this PDF. You should also consider splitting your actions into multiple requests. i.e. The first request handles orders 1 - 100, the next 101 - 200, etc.
What do you mean it fails?
By the look of the output it doesn't fail there as it outputs text either side of the call to getAllItems()
change:
$items = $order->getAllItems();
to:
foreach($order->getAllItems() as $orderItem) {
echo $orderItem->getId() . "<br />";
}
and see what happens.
The script could be ending on a different order ID each time if you have a low memory limit set on the server and it quits when it runs out of resources.

Magento dataflow takes too long to load CSV file

I have a large CSV file containing Inventory data to update (more than 35,000 rows). I created a method which extends Mage_Catalog_Model_Convert_Adapter_Productimport to do the inventory update. Then I used an Advanced Profile to do the update which calls that method.
It's working very well when I run the profile manually. The problem is when I use an extension which handles the profile running in cronjob, the system takes too long to load and parse the CSV file. I set the cronjob to run everyday at 6:15am, but the first row of the file wouldn't be processed until 1:20pm the same day, it takes 7 hours to load the file.
That makes the process stop in the middle somehow, less than 1/3 records being processed. I've been frustrating trying to figure out why, trying to solve the problem, but no luck.
Any ideas would be appreciated.
Varien_File_Csv is the class that parses your csv file.
It takes too much memory.
Function to log memory amount used and peak memory usage,
public function log($msg, $level = null)
{
if (is_null($level)) $level = Zend_Log::INFO;
$units = array('b', 'Kb', 'Mb', 'Gb', 'Tb', 'Pb');
$m = memory_get_usage();
$mem = #round($m / pow(1024, ($i = floor(log($m, 1024)))), 2).' '.$units[$i];
$mp = memory_get_peak_usage();
$memp = #round($mp / pow(1024, ($ip = floor(log($mp, 1024)))), 2).' '.$units[$ip];
$msg = sprintf('(mem %4.2f %s, %4.2f %s) ', $mem, $units[$i], $memp, $units[$ip]).$msg;
Mage::log($msg, $level, 'my_log.log', 1);
}
$MyClass->log('With every message I log the memory is closer to the sky');
You could split your csv (use same filename) and call the job multiple times. You'll need to be sure a previous call won't run same time with a newer one.
Thanks

Resources