Is bash in windows implemented differently from native bash, specifically for loops - bash

I ran the following command on mac in an ad hoc fashion in mac store:
time for x in {1..5000000}; do if ! (($x % 10000)); then echo $x; fi done
to perform a very rudimentary benchmark. What this does is that it creates a list from 1 - 5000000, check if it's divisible by 10000, and print if it does. And time benchmark the time for the process to execute. I've been arriving at around 40 secs for macbook air, 32 for pros, all 8th gen intel processors. A particular pattern I noticed is that it freezes for a long time before printing out anything, presumably this is because it's creating a list from 1 to 5000000 and putting it in memory.
However, my friend who use windows reported faster times on gen 5 core m processor with Windows 10 native bash shell, on the order of 15 seconds. I suspect it's because windows bash treat for x in {1..5000000} as a generator. In this way the process never made into memory as everything would only needed to be stored in cache, achieving greater speed. Can anyone confirm that for loops for bash interpreter is the same/different across windows implementation and linux/mac implementations?

Related

Julia drawing from standard normal distribution

I need to draw 53000000 observations from a standard normal distribution. My current code takes a long time to run in Julia (in fact, it's been running for the past twenty minutes) and I'm wondering if there's anything I can do to speed it up. Here's what I tried:
using Distributions
d = Normal()
shock = rand(d, 1, 53000000)
The code works instantaneously when I execute it in REPL (I am working in Juno/Atom), but lags at this point (drawing from the standard normal) when I step through using the debugger. So I think the debugger may be the real culprit here.
It may be that the 1/2 gig of memory used by the allocation of the variable shock is sometimes causing swapping when the debugger is loaded.
Try running this to see, in the debugger:
using Distributions, Base.Sys
println("Free memory is $(Int(Sys.free_memory()))")
d = Normal()
shock = rand(d, 1, 53000000)
println("shock uses $(sizeof(shock)) bytes.")
println("Free memory is $(Int(Sys.free_memory()))")
Are you close to out of memory in gigs?

Why does GNU parallel affect script speed?

I have some Fortran script. I compile with gfortran and then run as time ./a.out.
My script completes, and outputs the runtime as,
real 0m36.037s
user 0m36.028s
sys 0m0.004s
i.e. ~36 seconds
Now suppose I want to run this script multiple times, in parallel. For this I am using GNU Parallel.
Using the lscpu command tells me that I have 8 CPUs, with 2 threads per core and 4 cores per socket.
I create some file example.txt of the form,
time ./a.out
time ./a.out
time ./a.out
time ./a.out
...
which goes on for 8 lines.
I can then run these in parallel on 8 cores as,
parallel -j 8 :::: example.txt
In this case I would expect the runtime for each script to still be 36 seconds, and the total runtime to be ~36 seconds. However, in actuality what happens is the run time for each script roughly doubles.
If I instead run on 4 cores instead of 8 (-j 4) the problem disappears, and each script reverts to taking 36 seconds to run.
What is the cause of this? I have heard talk in the past on 'overheads' but I am not sure exactly what is meant by this.
What is happening is that you have only one socket with 4 physical cores in it.
Those are the real cores of your machine.
The total number of CPUs you see as output of lscpu is calculated using the following formula: #sockets * #cores_per_socket * #threads_per_core.
In your case it is 1*4*2=8.
Threads per core are a sort of virtual CPUs and they do not always perform as real CPUs, expecially for compute intense processing (this spec is called hyperthreading ).
Hence when you try to squeeze two threads per core, they get almost executed serially.
Take a look at this article for more info.

Parallel-ForkManager, DBI. Faster than before forking, but still too slow

I have a very simple task on updating database.
my $pm = new Parallel::ForkManager(15);
for my $line (#lines){
my $pid = $pm->start and next;
my $dbh2 = $dbh->clone();
my $sth2 = $dbh2->prepare("update db1 set field1=? where field2 =?");
my ($field1, $field2) = very_slow_subroutine();
$sth2->execute($field1,$field2);
$pm->finish;
}
$pm->wait_all_children;
I could just use $dbh2->do, but I doubt it a reason for a slowness.
What interesting, is that it seems it very fast starts these 15 processes (or whatever I specify) , but right after that slows drastically, still noticeable faster than without forking, but I would expect more...
Edit:
The very_slow_subroutine is sub which get an answer from a web service. The service can answer from fraction of second to several seconds on time out. I have to ask dozen thousands times... the reason I would like to make a fork.
And if this is matters -- I am on Linux.
Parallel::ForkManager doesn't magically make things faster, it just lets you do run your code multiple times and at the same time. In order to get the benefit out of it, you have to design your code for parallelism.
Think of it this way. It takes you 10 minutes to get to the store, shop, load your car, come back, and unload it. You need to get 5 loads. You alone can do it in 50 minutes. That is working in serial. 10 minutes * 5 trips one after the other = 50 minutes.
Let's say you get four friends to help. You all start off for the store at the same time. There's still 5 trips, and they still take 10 minutes, but because you did it in parallel the total time is only 10 minutes.
But it will never take less than 10 minutes, no matter how many trips you have to make or how many friends you get to help. That is why the process starts up fast, everybody gets into their cars and drives off to the store, but then nothing happens for a while because it still takes 10 minutes for everyone to do their job.
Same thing here. Your loop body takes X time to run. If you iterate through it Y times, it will take X * Y real world human time to run. If you run it in parallel Y times, ideally it will take just X time to run. Each parallel worker must still execute the full body of the loop taking X time.
In order to speed things up further, you have to break up the big bottleneck of very_slow_subroutine and make that work in parallel. Your SQL is so simple that is where you should focus your efforts at optimization and parallelism.
Let's say the store is really close, it's only a 1 minute drive (this is your SQL UPDATE), but shopping, loading and unloading takes 9 minutes (this is very_slow_subroutine). What if instead you have 5 cars and 15 friends. You load 3 people into each car. Driving to and from the store will take the same time, but now three people are working together to do the shopping, loading and unloading taking only 4 minutes. Now each trip takes 5 minutes instead of 10.
This represents redesigning very_slow_subroutine to do its work in parallel. If it's just a big loop, you can put more workers on that loop. If it's a series of slow operations, you will have to redesign it to take advantage of parallel execution.
If you use too many workers you can clog up the system, it depends on what the bottleneck is. If it's CPU bound and you have 2 CPU cores, you're probably see performance gains up to 3 to 5 workers ((cores * 2)+1 is a good rule of thumb) and after that performance will drop off as the CPU spends more time switching between processes than doing work. If the bottleneck is IO, or an external service as is often the case with database and network calls, you can see great efficiencies throwing many workers at the problem. While one process is waiting around for a disk or network operation, the others can be using your CPU.
Whether parallelism can help depends on where your bottleneck is. If your CPU with 4 cores is the bottleneck, forking 4 processes might cause things to complete in about 1/4th the under the best case scenario, but spawning 15 processes is not going to improve things much more.
If, more likely, your bottleneck is in I/O, starting 15 processes that compete for the same I/O is not going to help much, although in cases where you have tons of memory to use as file cache, some improvement might be possible.
To explore the limits on your system, consider the following program:
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
run(#ARGV);
sub run {
my $count = #_ ? $_[0] : 2;
my $pm = Parallel::ForkManager->new($count);
for (1 .. 20) {
$pm->start and next;
sleep 1;
$pm->finish;
}
$pm->wait_all_children;
}
My ancient laptop has a single CPU with 2 cores. Let's see what I get:
TimeThis : Command Line : perl sleeper.pl 1
TimeThis : Elapsed Time : 00:00:20.735
TimeThis : Command Line : perl sleeper.pl 2
TimeThis : Elapsed Time : 00:00:06.578
TimeThis : Command Line : perl sleeper.pl 4
TimeThis : Elapsed Time : 00:00:04.578
TimeThis : Command Line : perl sleeper.pl 8
TimeThis : Elapsed Time : 00:00:03.546
TimeThis : Command Line : perl sleeper.pl 16
TimeThis : Elapsed Time : 00:00:02.562
TimeThis : Command Line : perl sleeper.pl 20
TimeThis : Elapsed Time : 00:00:02.563
So, running with max 20 processes gives me a total run time over 2.5 seconds for sleeping one second 20 times.
On the other hand, with just one process, sleeping one second 20 times took just over 20 seconds. That is a huge improvement, but it also indicates a management overhead of more than 150% when you have 20 processes each sleeping for one second.
This is in the nature of parallel programming. There are a lot of formal treatments out there on what you can expect, but Amdahl's Law is required reading.

When/how to benefit from parallel processing of scripts?

I have a sequence of scripts to run on a computer with 1 physical & logical core.
I have tried running them in sequence, and also forking them with something like the bash script below. The background processes, running in parallel, actually took longer than running them in sequence.
My question is: Under what circumstances should a workload like this be run in parallel? That is, must I have particular hardware or can I do this more efficiently with the single-processor computer that I have?
#!/bin/bash
# Run them in sequence...
T1=$(date +%s)
python proc_test1.py
python proc_test2.py
python proc_test3.py
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
# Now fork them...
T1=$(date +%s)
python proc_test1.py &
python proc_test2.py &
python proc_test3.py &
wait
T2=$(date +%s)
T=$((T2-T1))
echo "Scripts took ${T} seconds."
exit 0
Parallel processing usually becomes beneficial when hardware is available. For example, if you had three logical CPUs, then three computations could occur simultaneously. Thus, ideally, if you ran proc_test1.py three times with forks with three available processors, all three would finish in the same time it would take to run just one instance of proc_test1.py.
In other words, given sufficient hardware, running three proc_test1.py's in serial will take three times as long as running them with forks.
Now, given that you just have one hardware cpu, it makes sense that the parallel jobs would run more slowly than the serial ones, as each python program will be competing with each other for cpu time. The cpu stopping one job and resuming another costs cpu time itself.
For example, say you had 6 oranges and two hands, and you had to hold all 6 oranges for 5 seconds total. Say it takes you one second to pick up or swap out oranges. You could do this task in serial, and pick up two oranges at a time for five seconds before swapping for a new pair. This would take you
1 + 5 + 1 + 5 + 1 + 5
= 3 * 5 + 3 = 18
seconds to complete.
Now suppose the parallel analogy. Then all 6 oranges are begging to be picked up and you holding one does not mean that you wont immediately drop it for an alternative. There isn't necessarily an upper bound on how long it will take you to complete the task as we have defined it, so suppose that you have to hold the oranges for at least 2.5 seconds before you swap them out, and you only swap in pairs. Then, it will take you
1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1 + 2.5 + 1
= 3 * 5 + 7 = 22
seconds to complete. Note that by "forking", it takes you 22% longer to hold 6 oranges for 5 seconds each. Since you have two hands, it still takes you 15 seconds to complete the task, but there is a variable overhead in switching time based on your strategy. Note that if you had 6 hands, it would take only 7 seconds to complete the task.
Thus when you have more processors, fork processes, otherwise you're just juggling jobs on limited hardware.
Your simple question is in practice extremely hard to answer in general. In real life I would always measure to see if reality agrees with my theory.
An example where reality did not agree with my theory was on my Intel Core i7. It has 4 cores and has hyperthreading. This would suggest that running 8 threads in parallel will be optimal: You will be using the 4 processing units and use 4 additional threads for keeping the pipelines filled.
However, Core i7 has 6MB of shared cache. It just so happened, that the working set of my data fit inside the 6MB. So I saw an extreme speedup by running 1 thread instead of 2 or even 8: Running more than 1 would simply flush the cache all the time. This would not have been true if the cache had not been shared, but it just shows that it is not simple to say when parallelizing will be faster.

Performance penalty of persistent variables in MATLAB

Recently I profiled some MATLAB code and I was shocked to see the following in a heavily used function:
5.76 198694 58 persistent CONSTANTS;
3.44 198694 59 if isempty(CONSTANTS) % initialize CONSTANTS
In other words, MATLAB spent about 9 seconds, over 198694 function calls, declaring the persistent CONSTANTS and checking if it has been initialized. That represents 13% of the total time spent in that function.
Do persistent variables really carry that much of a performance penalty in MATLAB? Or are we doing something terribly wrong here?
UPDATE
#Andrew I tried your sample script and I am very, very perplexed by the output:
time calls line
6 function has_persistent
6.48 200000 7 persistent CONSTANTS
1.91 200000 8 if isempty(CONSTANTS)
9 CONSTANTS = 42;
10 end
I tried the bench() command and it showed my machine in the middle range of the sample machines. Running Ubuntu 64 bits on a Intel(R) Core(TM) i7 CPU, 4GB RAM.
That's the standard way of using persistent variables in Matlab. You're doing what you're supposed to. There will be noticable overhead for it, but your timings do seem kind of surprisingly high.
Here's a similar test I ran in 32-bit Matlab R2009b on a 3.0 GHz Intel Core 2 QX9650 machine under Windows XP x64. Similar results on other machines and versions. About 5x faster than your timings.
Test:
function call_has_persistent
for i = 1:200000
has_persistent();
end
function has_persistent
persistent CONSTANTS
if isempty(CONSTANTS)
CONSTANTS = 42;
end
Results:
0.89 200000 7 persistent CONSTANTS
0.25 200000 8 if isempty(CONSTANTS)
What Matlab version, OS, and CPU are you running on? What does CONSTANTS get initialized with? Does Matlab's bench() output seem reasonable for your machine?
Your timings do seem high. There may be a bug or config issue there to fix. But if you really want to get Matlab code fast, the standard advice is to "vectorize" it: restructure the code so that it makes fewer function calls on larger input arrays, and makes use of Matlab's built in vectorized functions instead of loops or control structures, to avoid having 200,000 calls to the function in the first place. If possible. Matlab has relatively high overhead per function or method call (see Is MATLAB OOP slow or am I doing something wrong? for some numbers), so you can often get more mileage by refactoring to eliminate function calls instead of making the individual function calls faster.
It may be worth benchmarking some other basic Matlab operations on your machine, to see if it's just "persistent" that seems slow. Also try profiling just this little call_has_persistent test script in isolation to see if the context of your function makes a difference.

Resources