This question already has answers here:
Test if lists share any items in python
(9 answers)
Closed 4 years ago.
Assume I got two pyomo sets A & B, which are containing following elements:
m.A = {1,2,3,4,5}
m.B = {a,b,c,d,5}
I want to check; if A has some elements which are also in B:
EDIT:
Well following does not work:
if m.A & m.B is not None:
raise ValueError
At least for my case when m.A = [None] and m.B = ['some_string'], if-statement is also triggered, but bool(m.A & m.B) is working.
The most compact way you could achieve this is by using the & operator:
a = {1,2,3,4}
b = {4,5,6}
result = bool(a & b)
Speed comparison
Using the & operator:
%timeit bool(a & b)
297 ns ± 3.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Using the intersection method:
%timeit bool(a.intersection(b))
365 ns ± 27.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The two solution are pretty similar, the second one most probably faces overhead from the method call.
You are looking for intersection:
>>> A = {1,2,3,4,5}
>>> B = {'a','b','c','d',5}
>>> A.intersection(B)
set([5])
Related
I have the below dummy dataframe:
import pandas
tab_size = 300000
tab_de_merde = [[i*1, 0, i*3, [i*7%3, i*11%3] ] for i in range(tab_size)]
colnames = ['Id', "Date", "Account","Value"]
indexnames = ['Id']
df = pandas.DataFrame(tab_de_merde, columns = colnames ).set_index(indexnames)
And I want to check if the column "Value" contains a 0.
I've tried 3 different solution and I was wondering if the third one (Python Vectorization) was correctly implemented since it doesn't seem to fasten the whole code.
%timeit df[[(0 in x) for x in df['Value'].values]]
#108 ms ± 418 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df[df['Value'].apply(lambda x: 0 in x)]
#86.2 ms ± 649 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def list_contains_element(np_array): return [(0 in x) for x in np_array]
%timeit df[list_contains_element(df['Value'].values)]
#106 ms ± 807 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I would be very glad if someone could help me understand better how to be faster with vector manipulation.
After seeing a couple tutorials on the internet on Julia parallelism I decided to implement a small parallel snippet for computing the harmonic series.
The serial code is:
harmonic = function (n::Int64)
x = 0
for i in n:-1:1 # summing backwards to avoid rounding errors
x +=1/i
end
x
end
And I made 2 parallel versions, one using #distributed macro and another using the #everywhere macro (julia -p 2 btw):
#everywhere harmonic_ever = function (n::Int64)
x = 0
for i in n:-1:1
x +=1/i
end
x
end
harmonic_distr = function (n::Int64)
x = #distributed (+) for i in n:-1:1
x = 1/i
end
x
end
However, when I run the above code and #time it, I don't get any speedup - in fact, the #distributed version runs significantly slower!
#time harmonic(10^10)
>>> 53.960678 seconds (29.10 k allocations: 1.553 MiB) 23.60306659488827
job = #spawn harmonic_ever(10^10)
#time fetch(job)
>>> 46.729251 seconds (309.01 k allocations: 15.737 MiB) 23.60306659488827
#time harmonic_distr(10^10)
>>> 143.105701 seconds (1.25 M allocations: 63.564 MiB, 0.04% gc time) 23.603066594889185
What completely and absolutely baffles me is the "0.04% gc time". I'm clearly missing something and also the examples I saw weren't for 1.0.1 version (one for example used #parallel).
You're distributed version should be
function harmonic_distr2(n::Int64)
x = #distributed (+) for i in n:-1:1
1/i # no x assignment here
end
x
end
The #distributed loop will accumulate values of 1/i on every worker an then finally on the master process.
Note that it is also generally better to use BenchmarkTools's #btime macro instead of #time for benchmarking:
julia> using Distributed; addprocs(4);
julia> #btime harmonic(1_000_000_000); # serial
1.601 s (1 allocation: 16 bytes)
julia> #btime harmonic_distr2(1_000_000_000); # parallel
754.058 ms (399 allocations: 36.63 KiB)
julia> #btime harmonic_distr(1_000_000_000); # your old parallel version
4.289 s (411 allocations: 37.13 KiB)
The parallel version is, of course, slower if run only on one process:
julia> rmprocs(workers())
Task (done) #0x0000000006fb73d0
julia> nprocs()
1
julia> #btime harmonic_distr2(1_000_000_000); # (not really) parallel
1.879 s (34 allocations: 2.00 KiB)
I've got a look-up problem that boils down to the following situation.
Three columns with positive integers. For some value i, which values in 'column_3' have a value in 'column_1' below i and a value in 'column_2' above i?
import numpy as np
rows = 1e6
i = 5e8
ts = np.zeros((rows,), dtype=[('column_1','int64'),('column_2','int64'),('column_3','int64')])
ts['column_1'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_2'] = np.random.randint(low=0,high=1e9,size=rows)
ts['column_3'] = np.random.randint(low=0,high=1e9,size=rows)
This is the operation I'd like to optimize:
%%timeit
a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
Is there anything I'm overlooking that could make this faster?
Would be grateful for any advice!!
Assigning your 3 arrays to A,B,C at creation as well:
In [3]: %%timeit
...: a = ts[(ts['column_1'] < i)&(ts['column_2'] > i)]['column_3']
...:
22.5 ms ± 838 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %%timeit
...: a = C[(A < i)&(B > i)]
...:
...:
9.36 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using a,b,c = ts['column_1'],ts['column_2'],ts['column_3'] instead falls in between.
Those are variants and timings you can play with. As I can see it just minor differences due to indexing differences. Nothing like an order of magnitude difference.
I am aware that this may seem like a vague question, but I wonder if (e.g. in Python) it is preferable to store multiple values in separate variables, or to store them in logical groups (lists, arrays...).
In my precise case I am supposed to translate a Matlab code into Python 2.7. It is a physically based model that digests 8 input variables and creates two large lists as output. I found that the original model has a huge amount of variables that are calculated on the way (>100). As a rule of thumb: if one calculation is accessed more than once, it is stored in a new variable. Demonstrative example:
x = 3
y = 5
x2 = x**2
z = x2 + exp(y)
zz = (y+x)/x2
x^2 is used two times (for the calculation of z and zz), so it is stored as x2. Is this really faster than letting python calculate x**2 two times? Also, would it be faster if I stored them in lists? Like this:
x = [3, 5]
z = x[0]**2 + exp(x[1])
zz = sum(x)/x[0]**2
The organisation of variables in lists may come at the expense of readibility of the code, but I would gladly take that if it makes my code run faster.
There is no performance advantage I can see to keeping it in a list. On the contrary, putting it in a list makes it slower:
>>> %%timeit
...: x = 3
...: y = 5
...: x2 = x**2
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
337 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> %%timeit
...: x = [3, 5]
...: z = x[0]**2 + exp(x[1])
...: zz = sum(x)/x[0]**2
...:
716 ns ± 4.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now part of that is because you are calculating x**2 twice in the list condition, but even fixing that issue doesn't make the list version faster:
>>> %%timeit
...: x = [3, 5]
...: x0 = x[0]**2
...: z = x0 + exp(x[1])
...: zz = sum(x)/x0
...:
502 ns ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you compare about performance, another big issue is that you are defining ints then converting them to floats. In MATLAB, x = 5 makes a float, while in python it makes an integer. It is much faster to do everything with floats from the beginning, which you can do by just putting a . or .0 at the end of the number:
>>> %%timeit
...: x = 3.0
...: y = 5.0
...: x2 = x**2.0
...: z = x2 + exp(y)
...: zz = (y+x)/x2
...:
166 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
If you were to use numpy arrays rather than lists, it is even worse, because you start with a list of floats then have to do a conversion of both the numbers and the lists, then convert them back, all of which is slow:
>>> %%timeit
...: x = np.array([3., 5.])
...: x0 = x[0]**2.
...: z = x0 + np.exp(x[1])
...: zz = x.sum()/x0
...:
3.22 µs ± 8.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As a general rule, avoid doing type conversions wherever possible, and avoid indexing when it doesn't help readability. If you have a bunch of values, then the conversion to numpy is useful. But for just two or three it is going to hurt speed and readability.
This question already has answers here:
Weighted random numbers in MATLAB
(4 answers)
Closed 8 years ago.
I need to draw random numbers following a distribution I chose.
Example: draw 7 numbers from 1 to 7 with those probabilities:
1: 0.3
2: 0.2
3: 0.15
4: 0.15
5: 0.1
6: 0.05
7: 0.05
Since in my actual application I need to draw potentially 1000 numbers I need this to be as much efficient as possible (ideally linear).
I know there is a function in MATLAB that draws random numbers from a normal distribution; is there any way to adapt it?
Think you can use randsample too from Statistics Toolbox as referenced here.
%%// Replace 7 with 1000 for original problem
OUT = randsample([1:7], 7, true, [0.3 0.2 0.15 0.15 0.1 0.05 0.05])
numbers = 1:7;
probs = [.3 .2 .15 .15 .1 .05 .05];
N = 1000; %// how many random numbers you want
cumProbs = cumsum(probs(:)); %// will be used as thresholds
r = rand(1,N); %// random numbers between 0 and 1
output = sum(bsxfun(#ge, r, cumProbs))+1; %// how many thresholds are exceeded
You can use gendist from matlab file exchange: http://www.mathworks.com/matlabcentral/fileexchange/34101-random-numbers-from-a-discrete-distribution/content/gendist.m
This generates 1000 random numbers:
gendist([.3,.2,.15,.15,.1,.05,.05],1000,1)
If you do not have randsample, you can use histc like it does internally, just without all the fluff:
N = 100;
nums = 1:7;
p = [.3 .2 .15 .15 .1 .05 .05];
cdf = [0 cumsum(p(:).'/sum(p))]; cdf(end)=1; %' p is pdf
[~, isamps] = histc(rand(N,1),cdf);
out = nums(isamps);