Profiling shell commands in Emacs

Profiling shell commands in Emacs - shell

Is there a way to profile the amount of time blocking on shell commands in emacs? Consider the following program:
(profiler-start 'cpu)
(shell-command "sleep 3")
(profiler-report)
(profiler-stop)
The profiler report will look something like this:
- command-execute 371 95%
- call-interactively 371 95%
- funcall-interactively 329 84%
- execute-extended-command 175 44%
- execute-extended-command--shorter 157 40%
- completion-try-completion 149 38%
- completion--nth-completion 149 38%
- completion--some 143 36%
- #<compiled 0x438307f1> 143 36%
- completion-pcm-try-completion 102 26%
- completion-pcm--find-all-completions 98 25%
completion-pcm--all-completions 98 25%
+ completion-pcm--merge-try 4 1%
completion-basic-try-completion 41 10%
+ sit-for 16 4%
- eval-expression 154 39%
- eval 154 39%
- profiler-start 154 39%
- debug 154 39%
- recursive-edit 141 36%
- command-execute 114 29%
- call-interactively 114 29%
- byte-code 107 27%
+ read--expression 64 16%
+ read-extended-command 43 11%
+ funcall-interactively 7 1%
+ byte-code 42 10%
+ ... 19 4%
As you can see the time spent is more or less evenly distributed. I'm interested in seeing output that tells me that I'm spending the significant part of the program blocking on the shell-command sleep 3, is this possible somehow? I am aware that sleep 3 is not heavy on my CPU - but I'm trying to figure out which shell commands are called from magit that is taking such a long time - so I'll also be interested in stuff that's IO-bound.

Note that profiler.el is a sampling profiler. You might want to try an instrumenting profiler such as elp.el if you are interested in the wall time.
In your case you may want to instrument magit by using M-x elp-instrument-package RET magit RET. After running your magit commands you can then take a look at the results using M-x elp-results RET.
For magit you would probably find that the function magit-process-file is taking up a lot of time. To further investigate the specific function calls you could then simply instrument that or any other function by adding an advice function logging the runtime together with the function's arguments to the messages buffer for each individual function call as follows.
(defun log-function-time (f &rest args)
(let ((time-start (current-time)))
(prog1
(apply f args)
(message "%s seconds used in (magit-process-file %s)"
(time-to-seconds (time-subtract (current-time) time-start))
args))))
(advice-add 'magit-process-file :around 'log-function-time)

Related

How do I order an event study (panel data) dataframe?

I have a big panel database ordered by month and subject with lots of time series variables. In this database several events occure to some of the subjects indicated by a dummy-like variable but with the number of the event.
So I have:
Month
Subject_id
event
Variable_1
Variable_2
01-01-1970
A
0
8%
13%
02-01-1970
A
1
9%
5%
...
...
...
...
...
12-01-1984
B
0
-2%
1%
01-01-1985
B
2
10%
7%
02-01-1985
B
3
26%
3%
I want to construct another database where I can have the months ordered by before-after the event like t-12; t-11; t-10...t; t+1;t+2...
Month
Event
Subject_id
Variable_1
Variable_2
t-1
1
A
8%
13%
t
1
A
9%
5%
...
...
...
...
...
t-1
2
B
-2%
1%
t
2
B
10%
7%
...
...
...
...
...
t-1
3
B
10%
7%
t
3
B
26%
3%
Note that january 1985 is, at the same time, t for event 2 of subject B and t-1 for event 3 of the same subject. For this reason, I couldn't been able to merge by subject and a t+-x column. Some subjects have more than one overlapping event.
How can I transform my data into this new dataframe (I don't care about loosing the subjects that do not have events) ?

How do I rank this data by percentage and total possible?

Given this set in Excel:
Group Enrolled Eligible Percent
A 0 76 0%
B 10 92 11%
C 0 38 0%
D 2 50 4%
E 0 111 0%
F 4 86 5%
G 3 97 3%
H 4 178 2%
I 2 77 3%
J 0 64 0%
K 0 37 0%
L 11 54 20%
Is there a way to sort (for charting) to achieve the following order?
Group Enrolled Eligible Percent
L 11 54 20%
B 10 92 11%
F 4 86 5%
D 2 50 4%
G 3 97 3%
I 2 77 3%
H 4 178 2%
K 0 37 0%
C 0 38 0%
J 0 64 0%
A 0 76 0%
E 0 111 0%
My goal is to rank/visualize using these criteria:
Percent desc (when Enrolled > 0)
Eligible asc (when Enrolled = 0)
After writing this question, the answer looks obvious: sort by Percent descending, then Eligible ascending (when Percent or Enrolled = 0). But I feel like I'm missing an obvious method/term to achieve the results I'm looking for.
Thanks.

With Google spreadsheet Query is the easy way. Goal 1:
=QUERY(A1:D13,"Select A,B,C,D Where B>0 Order By D desc,1")
Goal 2:
=QUERY(A1:D13,"Select A,B,C,D Where B=0 Order By C ,1")

The term you're missing is SORT
Here's the formula you are looking for:
=SORT(A1:D13,4,0,3,1)
Note:
Numbers should be formatted as Numbers.

Execution time slow down at beginning and raising later

This is execution time log
As you can see, it faster and faster until it used 1.5 sec for one iteraction
and then slower and slower
iter: 0/700000
loss:8.13768323263
speed: 4.878s / iter
iter: 1/700000
loss:4.69941059748
speed: 3.162s / iter
...
...
...
iter: 1560/700000
loss:2.16679636637
speed: 1.496s / iter
iter: 1561/700000
loss:2.9271744887
speed: 1.496s / iter
...
...
...
iter: 3698/700000
loss:1.47574504217
speed: 1.701s / iter
iter: 3699/700000
loss:1.75555475553
speed: 1.701s / iter
Using graph.finalize() to freeze graph
install tensorflow 1.0 from source, using jemalloc, build with XLA, SSE...etc
threads = tf.train.start_queue_runners(coord=coord, sess=sess)
sess.graph.finalize() # Graph is read-only after this statement.
and follow this github to implement image_reader and accumulate gradients (like iter_size in caffe), all OP are outside training loop
not sure if relevant
GPU memory grow slightly, from 5707 MiB to 5717MiB
GPU-util become low and strange
1% -> 59% -> 1% -> 99% -> 0% -> 54% -> 1% -> 48%
anyone met this situation before?
or have any suggestion to debug?

Optimizing a Vectorized Matlab Function

When i run profiler it tell me that the most time consuming code is the function vdist. Its a program that measures distance between two points on earth considering earth as an ellipsoid. The code looks standard and i don't know where and how it can be improved upon. The initial comments say, it has already been vectorized. Is there a counterpart to it in some other language which can be used as a MEX file. All i want is improvement in terms of time efficiency. Here is a link to the code from Matlab FEX.
http://www.mathworks.com/matlabcentral/fileexchange/8607-vectorized-geodetic-distance-and-azimuth-on-the-wgs84-earth-ellipsoid/content/vdist.m
The function is called from within a loop as- (You can find the function as its the most time consuming line here)
109 for i=1:polySize
110 % find the two vectors needed
11755 111 if i~=1
0.02 11503 112 if i<polySize
0.02 11251 113 p0=Polygon(i,:); p1=Polygon(i-1,:); p2=Polygon(i+1,:);
252 114 else
252 115 p0=Polygon(i,:); p1=Polygon(i-1,:); p2=Polygon(1,:); %special case for i=polySize
252 116 end
252 117 else
252 118 p0=Polygon(i,:); p1=Polygon(polySize,:); p2=Polygon(i+1,:); %special case for i=1
252 119 end
0.02 11755 120 Vector1=(p0-p1); Vector2=(p0-p2);
0.06 11755 121 if ~(isequal(Vector1,Vector2) || isequal(Vector1,ZeroVec) || isequal(Vector2,ZeroVec));
122 %determine normals and normalise and
0.17 11755 123 NV1=rotateVector(Vector1, pi./2); NV2=rotateVector(Vector2, -pi./2);
0.21 11755 124 NormV1=normaliseVector(NV1); NormV2=normaliseVector(NV2);
125 %determine rotation by means of the atan2 (because sign matters!)
11755 126 totalRotation = vectorAngle(NormV2, NormV1); % Bestimme den Winkel totalRotation zwischen den normierten Vektoren
11755 127 if totalRotation<10
11755 128 totalRotation=totalRotation*50;
11755 129 end
0.01 11755 130 for res=1:6
0.07 70530 131 U_neu=p0+NV1;
17.01 70530 132 [pos,a12] = vdist(p0(:,2),p0(:,1),U_neu(:,2),U_neu(:,1));
0.02 70530 133 a12=a12+1/6.*res*totalRotation;
70530 134 ddist=1852*safety_distance;
4.88 70530 135 [lat2,lon2] = vreckon(p0(:,2),p0(:,1),ddist, a12);
0.15 70530 136 extendedPoly(f,:)=[lon2,lat2];f=f+1;
< 0.01 70530 137 end
11755 138 end
11755 139 end

No matter how hard I study the code that's been posted, I can't see why the call to vdist is made inside the loop.
When I'm trying to optimise a block of code inside a loop one of the things I look for are statements which are invariant, that is which are the same at each call, and which can therefore be lifted out of the loop.
Looking at
130 for res=1:6
131 U_neu=p0+NV1;
132 [pos,a12] = vdist(p0(:,2),p0(:,1),U_neu(:,2),U_neu(:,1));
133 a12=a12+1/6.*res*totalRotation;
134 ddist=1852*safety_distance;
135 [lat2,lon2] = vreckon(p0(:,2),p0(:,1),ddist, a12);
136 extendedPoly(f,:)=[lon2,lat2];f=f+1;
137 end
I see
in l131 the variables p0, NV1 appear only on the rhs, and they only appear on the rhs elsewhere inside the loop, so this statement is loop-invariant and can be lifted out of the loop; only a small time saving perhaps;
in l134 again, I see another loop-invariant statement, which can again be lifted out of the loop for another small time saving;
but then I started to look very closely, and I can't see why l132, where the call to vdist is made, is inside the loop either. None of the values on the rhs of that assignment are modified in the loop (other than U_neu but I've already lifted that out of the loop).
Tidying up what was left a bit, this is what I ended up with:
U_neu=p0+NV1;
[pos,a12] = vdist(p0(:,2),p0(:,1),U_neu(:,2),U_neu(:,1));
ddist=1852*safety_distance;
for res=1:6
extendedPoly(f,:) = vreckon(p0(:,2),p0(:,1),ddist, a12+1/6.*res*totalRotation);
f=f+1;
end

An option will be to rewrite this FEX file in a way that you would be able to use it GPUs. A smooth way into it for example is a toolbox called Jacket.

Profiling a Haskell program

I have a piece of code that repeatedly samples from a probability distribution using sequence. Morally, it does something like this:
sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
xs <- sequence (replicate n dist)
return (sum xs)
Except that it's a bit more complicated. The actual code I'm interested in is the function likelihoodWeighting at this Github repo.
I noticed that the running time scales nonlinearly with n. In particular, once n exceeds a certain value it hits the memory limit, and the running time explodes. I'm not certain, but I think this is because sequence is building up a long list of thunks which aren't getting evaluated until the call to sum.
Once I get past about 100,000 samples, the program slows to a crawl. I'd like to optimize this (my feeling is that 10 million samples shouldn't be a problem) so I decided to profile it - but I'm having a little trouble understanding the output of the profiler.
Profiling
I created a short executable in a file main.hs that runs my function with 100,000 samples. Here's the output from doing
$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s
First things I notice - it allocates nearly 1.5 GB of heap, and spends 60% of its time on garbage collection. Is this generally indicative of too much laziness?
1,377,538,232 bytes allocated in the heap
1,195,050,032 bytes copied during GC
169,411,368 bytes maximum residency (12 sample(s))
7,360,232 bytes maximum slop
423 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2574 collections, 0 parallel, 2.40s, 2.43s elapsed
Generation 1: 12 collections, 0 parallel, 1.07s, 1.28s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.92s ( 1.94s elapsed)
GC time 3.47s ( 3.70s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.23s ( 0.23s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.63s ( 5.87s elapsed)
%GC time 61.8% (63.1% elapsed)
Alloc rate 716,368,278 bytes per MUT second
Productivity 34.2% of total user, 32.7% of total elapsed
Here are the results from
$ ./main +RTS -p
The first time I ran this, it turned out that there was one function being called repeatedly, and it turned out I could memoize it, which sped things up by a factor of 2. It didn't solve the space leak, however.
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 434 4 0.0 0.0 100.0 100.0
likelihoodWeighting AI.Probability.Bayes 445 1 0.0 0.3 100.0 100.0
distributionLW AI.Probability.Bayes 448 1 0.0 2.6 0.0 2.6
getSampleLW AI.Probability.Bayes 446 100000 20.0 50.4 100.0 97.1
bnProb AI.Probability.Bayes 458 400000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 457 400000 6.7 0.8 6.7 0.8
bnVals AI.Probability.Bayes 455 400000 20.0 6.3 26.7 7.1
bnParents AI.Probability.Bayes 456 400000 6.7 0.8 6.7 0.8
bnSubRef AI.Probability.Bayes 454 800000 13.3 13.5 13.3 13.5
weightedSample AI.Probability.Bayes 447 100000 26.7 23.9 33.3 25.3
bnProb AI.Probability.Bayes 453 100000 0.0 0.0 0.0 0.0
bnCond AI.Probability.Bayes 452 100000 0.0 0.2 0.0 0.2
bnVals AI.Probability.Bayes 450 100000 0.0 0.3 6.7 0.5
bnParents AI.Probability.Bayes 451 100000 6.7 0.2 6.7 0.2
bnSubRef AI.Probability.Bayes 449 200000 0.0 0.7 0.0 0.7
Here's a heap profile. I don't know why it claims the runtime is 1.8 seconds - this run took about 6 seconds.
Can anyone help me to interpret the output of the profiler - i.e. to identify where the bottleneck is, and provide suggestions for how to speed things up?

A huge improvement has already been achieved by incorporating JohnL's suggestion of using foldM in likelihoodWeighting. That reduced memory usage about tenfold here, and brought down the GC times significantly to almost or actually negligible.
A profiling run with the current source yields
probabilityIO AI.Util.Util 26.1 42.4 413 290400000
weightedSample.go AI.Probability.Bayes 16.1 19.1 255 131200080
bnParents AI.Probability.Bayes 10.8 1.2 171 8000384
bnVals AI.Probability.Bayes 10.4 7.8 164 53603072
bnCond AI.Probability.Bayes 7.9 1.2 125 8000384
ndSubRef AI.Util.Array 4.8 9.2 76 63204112
bnSubRef AI.Probability.Bayes 4.7 8.1 75 55203072
likelihoodWeighting.func AI.Probability.Bayes 3.3 2.8 53 19195128
%! AI.Util.Util 3.3 0.5 53 3200000
bnProb AI.Probability.Bayes 2.5 0.0 40 16
bnProb.p AI.Probability.Bayes 2.5 3.5 40 24001152
likelihoodWeighting AI.Probability.Bayes 2.5 2.9 39 20000264
likelihoodWeighting.func.x AI.Probability.Bayes 2.3 0.2 37 1600000
and 13MB memory usage reported by -s, ~5MB maximum residency. That's not too bad already.
Still, there remain some points we can improve. First, a relatively minor thing, in the grand scheme, AI.UTIl.Array.ndSubRef:
ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])
Reversing the list, and mapping (2^) over another list is inefficient, better is
ndSubRef = L.foldl' (\a d -> 2*a + d) 0
which doesn't need to keep the entire list in memory (probably not a big deal, since the lists will be short) as reversing it does, and doesn't need to allocate a second list. The reduction in allocation is noticeable, about 10%, and that part runs measurably faster,
ndSubRef AI.Util.Array 1.7 1.3 24 8000384
in the profile of the modified run, but since it takes only a small part of the overall time, the overall impact is small. There are potentially bigger fish to fry in weightedSample and likelihoodWeighting.
Let's add a bit of strictness in weightedSample to see how that changes things:
weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
go 1.0 (M.fromList fixed) (bnVars bn)
where
go w assignment [] = return (assignment, w)
go w assignment (v:vs) = if v `elem` vars
then
let w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let p = bnProb bn assignment (v,True)
x <- probabilityIO p
go w (M.insert v x assignment) vs
vars = map fst fixed
The weight parameter of go is never forced, nor is the assignment parameter, thus they can build up thunks. Let's enable {-# LANGUAGE BangPatterns #-} and force updates to take effect immediately, also evaluate p before passing it to probabilityIO:
go w assignment (v:vs) = if v `elem` vars
then
let !w' = w * bnProb bn assignment (v, fixed %! v)
in go w' assignment vs
else do
let !p = bnProb bn assignment (v,True)
x <- probabilityIO p
let !assignment' = M.insert v x assignment
go w assignment' vs
That brings a further reduction in allocation (~9%) and a small speedup (~%13%), but the total memory usage and maximum residence haven't changed much.
I see nothing else obvious to change there, so let's look at likelihoodWeighting:
func m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return $! x `seq` w `seq` M.adjust (+w) x m
In the last line, first, w is already evaluated in weightedSample now, so we don't need to seq it here, the key x is required to evaluate the updated map, so seqing that isn't necessary either. The bad thing on that line is M.adjust. adjust has no way of forcing the result of the updated function, so that builds thunks in the map's values. You can force evaluation of the thunks by looking up the modified value and forcing that, but Data.Map provides a much more convenient way here, since the key at which the map is updated is guaranteed to be present, insertWith':
func !m _ = do
(a, w) <- weightedSample bn fixed
let x = a ! e
return (M.insertWith' (+) x w m)
(Note: GHC optimises better with a bang-pattern on m than with return $! ... here). That slightly reduces the total allocation and doesn't measurably change the running time, but has a great impact on total memory used and maximum residency:
934,566,488 bytes allocated in the heap
1,441,744 bytes copied during GC
68,112 bytes maximum residency (1 sample(s))
23,272 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
The biggest improvement in running time to be had would be by avoiding randomIO, the used StdGen is very slow.
I am surprised how much time the bn* functions take, but don't see any obvious inefficiency in those.

I have trouble digesting these profiles, but I have gotten my ass kicked before because the MonadRandom on Hackage is strict. Creating a lazy version of MonadRandom made my memory problems go away.
My colleague has not yet gotten permission to release the code, but I've put Control.Monad.LazyRandom online at pastebin. Or if you want to see some excerpts that explain a fully lazy random search, including infinite lists of random computations, check out Experience Report: Haskell in Computational Biology.

I put together a very elementary example, posted here: http://hpaste.org/71919. I'm not sure if it's anything like your example.. just a very minimal thing that seemed to work.
Compiling with -prof and -fprof-auto and running with 100000 iterations yielded the following head of the profiling output (pardon my line numbers):
8 COST CENTRE MODULE %time %alloc
9
10 sample AI.Util.ProbDist 31.5 36.6
11 bnParents AI.Probability.Bayes 23.2 0.0
12 bnRank AI.Probability.Bayes 10.7 23.7
13 weightedSample.go AI.Probability.Bayes 9.6 13.4
14 bnVars AI.Probability.Bayes 8.6 16.2
15 likelihoodWeighting AI.Probability.Bayes 3.8 4.2
16 likelihoodWeighting.getSample AI.Probability.Bayes 2.1 0.7
17 sample.cumulative AI.Util.ProbDist 1.7 2.1
18 bnCond AI.Probability.Bayes 1.6 0.0
19 bnRank.ps AI.Probability.Bayes 1.1 0.0
And here are the summary statistics:
1,433,944,752 bytes allocated in the heap
1,016,435,800 bytes copied during GC
176,719,648 bytes maximum residency (11 sample(s))
1,900,232 bytes maximum slop
400 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.40s ( 1.41s elapsed)
GC time 1.08s ( 1.24s elapsed)
Total time 2.47s ( 2.65s elapsed)
%GC time 43.6% (46.8% elapsed)
Alloc rate 1,026,674,336 bytes per MUT second
Productivity 56.4% of total user, 52.6% of total elapsed
Notice that the profiler pointed its finger at sample. I forced the return in that function by using $!, and here are some summary statistics afterwards:
1,776,908,816 bytes allocated in the heap
165,232,656 bytes copied during GC
34,963,136 bytes maximum residency (7 sample(s))
483,192 bytes maximum slop
68 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.42s ( 2.44s elapsed)
GC time 0.21s ( 0.23s elapsed)
Total time 2.63s ( 2.68s elapsed)
%GC time 7.9% (8.8% elapsed)
Alloc rate 733,248,745 bytes per MUT second
Productivity 92.1% of total user, 90.4% of total elapsed
Much more productive in terms of GC, but not much changed on the time. You might be able to keep iterating in this profile/tweak fashion to target your bottlenecks and eke out some better performance.

I think your initial diagnosis is correct, and I've never seen a profiling report that's useful once memory effects kick in.
The problem is that you're traversing the list twice, once for sequence and again for sum. In Haskell, multiple list traversals of large lists are really, really bad for performance. The solution is generally to use some type of fold, such as foldM. Your sampleMean function can be written as
{-# LANGUAGE BangPatterns #-}
sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist
for example, traversing the list only once.
You can do the same sort of thing with likelihoodWeighting as well. In order to prevent thunks, it's important to make sure that the accumulator in your fold function has appropriate strictness.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Profiling shell commands in Emacs - shell

Related

How do I order an event study (panel data) dataframe?

How do I rank this data by percentage and total possible?

Execution time slow down at beginning and raising later

Optimizing a Vectorized Matlab Function

Profiling a Haskell program

Categories

Resources