I have the below dummy dataframe:
import pandas
tab_size = 300000
tab_de_merde = [[i*1, 0, i*3, [i*7%3, i*11%3] ] for i in range(tab_size)]
colnames = ['Id', "Date", "Account","Value"]
indexnames = ['Id']
df = pandas.DataFrame(tab_de_merde, columns = colnames ).set_index(indexnames)
And I want to check if the column "Value" contains a 0.
I've tried 3 different solution and I was wondering if the third one (Python Vectorization) was correctly implemented since it doesn't seem to fasten the whole code.
%timeit df[[(0 in x) for x in df['Value'].values]]
#108 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['Value'].apply(lambda x: 0 in x)]
#86.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def list_contains_element(np_array): return [(0 in x) for x in np_array]
%timeit df[list_contains_element(df['Value'].values)]
#106 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I would be very glad if someone could help me understand better how to be faster with vector manipulation.
Related
In the Julia package BenchmarkTools, there are macros like #btime, #belapse that seem redundant to me since Julia has built-in #time, #elapse macros. And it seems to me that these macros serve the same purpose. So what's the difference between #time and #btime, and #elapse and #belapsed?
TLDR ;)
#time and #elapsed just run the code once and measure the time. This measurement may or may not include the compile time (depending whether #time is run for the first or second time) and includes time to resolve global variables.
On the the other hand #btime and #belapsed perform warm up so you know that compile time and global variable resolve time (if $ is used) do not affect the time measurement.
Details
For further understand how this works lets used the #macroexpand (I am also stripping comment lines for readability):
julia> using MacroTools, BenchmarkTools
julia> MacroTools.striplines(#macroexpand1 #elapsed sin(x))
quote
Experimental.#force_compile
local var"#28#t0" = Base.time_ns()
sin(x)
(Base.time_ns() - var"#28#t0") / 1.0e9
end
Compilation if sin is not forced and you get different results when running for the first time and subsequent times. For an example:
julia> #time cos(x);
0.110512 seconds (261.97 k allocations: 12.991 MiB, 99.95% compilation time)
julia> #time cos(x);
0.000008 seconds (1 allocation: 16 bytes)
julia> #time cos(x);
0.000006 seconds (1 allocation: : 16 bytes)
The situation is different with #belapsed:
julia> MacroTools.striplines(#macroexpand #belapsed sin($x))
quote
(BenchmarkTools).time((BenchmarkTools).minimum(begin
local var"##314" = begin
BenchmarkTools.generate_benchmark_definition(Main, Symbol[], Any[], [Symbol("##x#315")], (x,), $(Expr(:copyast, :($(QuoteNode(:(sin(var"##x#315"))))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), $(Expr(:copyast, :($(QuoteNode(nothing))))), BenchmarkTools.Parameters())
end
(BenchmarkTools).warmup(var"##314")
(BenchmarkTools).tune!(var"##314")
(BenchmarkTools).run(var"##314")
end)) / 1.0e9
end
You can see that a minimum value is taken (the code is run several times).
Basically most time you should use BenchmarkTools for measuring times when designing your application.
Last but not least try #benchamrk:
julia> #benchmark sin($x)
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 13.714 ns … 51.151 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.814 ns ┊ GC (median): 0.00%
Time (mean ± σ): 14.089 ns ± 1.121 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▂▄ ▁▂ ▃ ▁ ▂
██▆▅██▇▅▄██▃▁▃█▄▃▁▅█▆▁▄▃▅█▅▃▁▄▇▆▁▁▁▁▁▆▄▄▁▁▃▄▇▃▁▃▁▁▁▆▅▁▁▁▆▅▅ █
13.7 ns Histogram: log(frequency) by time 20 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
I have simulation program written in Julia that does something equivalent to this as a part of its main loop:
# Some fake data
M = [randn(100,100) for m=1:100, n=1:100]
W = randn(100,100)
work = zip(W,M)
result = mapreduce(x -> x[1]*x[2], +,work)
In other words, a simple sum of weighted matrices. Timing the above code yields
0.691084 seconds (79.03 k allocations: 1.493 GiB, 70.59% gc time, 2.79% compilation time)
I am surprised about the large number of memory allocations, as this problem should be possible to do in-place. To see if it was my use of mapreduce that was wrong I also tested the following equivalent implementation:
#time begin
res = zeros(100,100)
for m=1:100
for n=1:100
res += W[m,n] * M[m,n]
end
end
end
which gave
0.442521 seconds (50.00 k allocations: 1.491 GiB, 70.81% gc time)
So, if I wrote this in C++ or Fortran it would be simple to do all of this in-place. Is this impossible in Julia? Or am I missing something here...?
It is possible to do it in place like this:
function ws(W, M)
res = zeros(100,100)
for m=1:100
for n=1:100
#. res += W[m,n] * M[m, n]
end
end
return res
end
and the timing is:
julia> #time ws(W, M);
0.100328 seconds (2 allocations: 78.172 KiB)
Note that in order to perform this operation in-place I used broadcasting (I could also use loops, but it would be the same).
The problem with your code is that in line:
res += W[m,n] * M[m,n]
You get two allocations:
When you do multiplication W[m,n] * M[m,n] a new matrix is allocated.
When you do addition res += ... again a matrix is allocated
By using broadcasting with #. you perform an in-place operation, see https://docs.julialang.org/en/v1/manual/mathematical-operations/#man-dot-operators for more explanations.
Additionally note that I have wrapped the code inside a function. If you do not do it then access both W and M is type unstable which also causes allocations, see https://docs.julialang.org/en/v1/manual/performance-tips/#Avoid-global-variables.
I'd like to add something to Bogumił's answer. The missing broadcast is the main problem, but in addition, the loop and the mapreduce variant differ in a fundamental semantic way.
The purpose of mapreduce is to reduce by an associative operation with identity element init in an unspecified order. This in particular also includes the (theoretical) option of running parts in parallel and doesn't really play well with mutation. From the docs:
The associativity of the reduction is implementation-dependent. Additionally, some implementations may reuse the return value of f for elements that appear multiple times in itr. Use mapfoldl or
mapfoldr instead for guaranteed left or right associativity and invocation of f for every value.
and
It is unspecified whether init is used for non-empty collections.
What the loop variant really corresponds to is a fold, which has a well-defined order and initial (not necessarily identity) element and can thus use an in-place reduction operator:
Like reduce, but with guaranteed left associativity. If provided, the keyword argument init will be used exactly once.
julia> #benchmark foldl((acc, (m, w)) -> (#. acc += m * w), $work; init=$(zero(W)))
BenchmarkTools.Trial: 45 samples with 1 evaluation.
Range (min … max): 109.967 ms … 118.251 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 112.639 ms ┊ GC (median): 0.00%
Time (mean ± σ): 112.862 ms ± 1.154 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▃█ ▁▄▃
▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄███▆███▄▁▄▁▁▄▁▁▄▁▁▁▁▁▄▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
110 ms Histogram: frequency by time 118 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark mapreduce(Base.splat(*), +, $work)
BenchmarkTools.Trial: 12 samples with 1 evaluation.
Range (min … max): 403.100 ms … 458.882 ms ┊ GC (min … max): 4.53% … 3.89%
Time (median): 445.058 ms ┊ GC (median): 4.04%
Time (mean ± σ): 440.042 ms ± 16.792 ms ┊ GC (mean ± σ): 4.21% ± 0.92%
▁ ▁ ▁ ▁ ▁ ▁ ▁▁▁ █ ▁
█▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁▁█▁█▁▁▁▁███▁▁▁▁▁█▁▁▁█ ▁
403 ms Histogram: frequency by time 459 ms <
Memory estimate: 1.49 GiB, allocs estimate: 39998.
Think of it that way: if you would write the function as a parallel for loop with (+) reduction, iteration also would have an unspecified order, and you'd have memory overhead for the necessary copying of the individual results to the accumulating thread.
Thus, there is a trade-off. In your example, allocation/copying dominates. In other cases, the the mapped operation might dominate, and parallel reduction (with unspecified order, but copying overhead) be worth it.
This question already has answers here:
Test if lists share any items in python
(9 answers)
Closed 4 years ago.
Assume I got two pyomo sets A & B, which are containing following elements:
m.A = {1,2,3,4,5}
m.B = {a,b,c,d,5}
I want to check; if A has some elements which are also in B:
EDIT:
Well following does not work:
if m.A & m.B is not None:
raise ValueError
At least for my case when m.A = [None] and m.B = ['some_string'], if-statement is also triggered, but bool(m.A & m.B) is working.
The most compact way you could achieve this is by using the & operator:
a = {1,2,3,4}
b = {4,5,6}
result = bool(a & b)
Speed comparison
Using the & operator:
%timeit bool(a & b)
297 ns ± 3.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Using the intersection method:
%timeit bool(a.intersection(b))
365 ns ± 27.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The two solution are pretty similar, the second one most probably faces overhead from the method call.
You are looking for intersection:
>>> A = {1,2,3,4,5}
>>> B = {'a','b','c','d',5}
>>> A.intersection(B)
set([5])
I've got a look-up problem that boils down to the following situation.
Three columns with positive integers. For some value i, which values in 'column_3' have a value in 'column_1' below i and a value in 'column_2' above i?
import numpy as np
rows = 1e6
i = 5e8
ts = np.zeros((rows,), dtype=[('column_1','int64'),('column_2','int64'),('column_3','int64')])
ts['column_1'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_2'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_3'] = np.random.randint(low=0,high=1e9,size=rows)
This is the operation I'd like to optimize:
%%timeit
a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
Is there anything I'm overlooking that could make this faster?
Would be grateful for any advice!!
Assigning your 3 arrays to A,B,C at creation as well:
In [3]: %%timeit
...: a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
...:
22.5 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: a = C[(A < i)&(B > i)]
...:
...:
9.36 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using a,b,c = ts['column_1'],ts['column_2'],ts['column_3'] instead falls in between.
Those are variants and timings you can play with. As I can see it just minor differences due to indexing differences. Nothing like an order of magnitude difference.
I am aware that this may seem like a vague question, but I wonder if (e.g. in Python) it is preferable to store multiple values in separate variables, or to store them in logical groups (lists, arrays...).
In my precise case I am supposed to translate a Matlab code into Python 2.7. It is a physically based model that digests 8 input variables and creates two large lists as output. I found that the original model has a huge amount of variables that are calculated on the way (>100). As a rule of thumb: if one calculation is accessed more than once, it is stored in a new variable. Demonstrative example:
x = 3
y = 5
x2 = x**2
z = x2 + exp(y)
zz = (y+x)/x2
x^2 is used two times (for the calculation of z and zz), so it is stored as x2. Is this really faster than letting python calculate x**2 two times? Also, would it be faster if I stored them in lists? Like this:
x = [3, 5]
z = x[0]**2 + exp(x[1])
zz = sum(x)/x[0]**2
The organisation of variables in lists may come at the expense of readibility of the code, but I would gladly take that if it makes my code run faster.
There is no performance advantage I can see to keeping it in a list. On the contrary, putting it in a list makes it slower:
>>> %%timeit
...: x = 3
...: y = 5
...: x2 = x**2
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
337 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %%timeit
...: x = [3, 5]
...: z = x[0]**2 + exp(x[1])
...: zz = sum(x)/x[0]**2
...:
716 ns ± 4.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now part of that is because you are calculating x**2 twice in the list condition, but even fixing that issue doesn't make the list version faster:
>>> %%timeit
...: x = [3, 5]
...: x0 = x[0]**2
...: z = x0 + exp(x[1])
...: zz = sum(x)/x0
...:
502 ns ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you compare about performance, another big issue is that you are defining ints then converting them to floats. In MATLAB, x = 5 makes a float, while in python it makes an integer. It is much faster to do everything with floats from the beginning, which you can do by just putting a . or .0 at the end of the number:
>>> %%timeit
...: x = 3.0
...: y = 5.0
...: x2 = x**2.0
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
166 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
If you were to use numpy arrays rather than lists, it is even worse, because you start with a list of floats then have to do a conversion of both the numbers and the lists, then convert them back, all of which is slow:
>>> %%timeit
...: x = np.array([3., 5.])
...: x0 = x[0]**2.
...: z = x0 + np.exp(x[1])
...: zz = x.sum()/x0
...:
3.22 µs ± 8.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a general rule, avoid doing type conversions wherever possible, and avoid indexing when it doesn't help readability. If you have a bunch of values, then the conversion to numpy is useful. But for just two or three it is going to hurt speed and readability.