So heres my question, I have a big 3dim array which is 100GB in size as a #zarr file (the array is more than twice the size). I have tried using the histogram from #Dask to calculate but I get an error saying that it cant do it because the file has tuples within tuples. Im guessing thats the zarr file formate rather than anything else?
any thoughts?
edit:
yes the bigger computer thing wouldnt actually work...
Im running a dask client on a single machine, it runsthe calculation but just gets stuck somewhere.
I just tried dask.map function across the file but when I plot it out I get something like this:
ValueError: setting an array element with a sequence.
heres a version of the script:
def histo(img):
return da.histogram(img, bins=255, range=[0, 255])
histo_1 = da.map_blocks(histo, fimg)
I am actually going to try and use it out side of the map function. I wonder rather than the map funtion, does the windowing from map blocks, actually cause the issue. well, ill let you know if it is or now....
edit 2
So I tried to remove the map blocks function as suggested and this was my result:
[in] h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
[in] bins
[out] array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32.,
33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54.,
55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76.,
77., 78., 79., 80., 81., 82., 83., 84., 85., 86., 87.,
88., 89., 90., 91., 92., 93., 94., 95., 96., 97., 98.,
99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131.,
132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142.,
143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153.,
154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164.,
165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175.,
176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186.,
187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197.,
198., 199., 200., 201., 202., 203., 204., 205., 206., 207., 208.,
209., 210., 211., 212., 213., 214., 215., 216., 217., 218., 219.,
220., 221., 222., 223., 224., 225., 226., 227., 228., 229., 230.,
231., 232., 233., 234., 235., 236., 237., 238., 239., 240., 241.,
242., 243., 244., 245., 246., 247., 248., 249., 250., 251., 252.,
253., 254., 255.])
[in] h.compute
[out] <bound method DaskMethodsMixin.compute of dask.array<sum-aggregate, shape=(255,), dtype=int64, chunksize=(255,), chunktype=numpy.ndarray>>
im going to try in another notebook and see if it still occurs.
edit 3
its the stranges thing, but if I just declare the variable h, it comes out as one small element from the dask array?
edit
Strange, if i call the xarray.hist or the da.hist function, they both fall over. If I use the skimage.exposure.histogram it works but it appears that the zarr file is unpacked before the histogram is a calculated. Which is a bit of a problem...
Update 7th June 2020 (with a solution for not big but annoyingly medium data) see below for answer.
You probably want to use dask's function for this rather than map_blocks. For the latter, Dask expects the output of each call to be the same size as the input block, or a shape derived from the input block, instead of the one-dimensional fixed-size output of histogram.
h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
h.compute()
Update 7th June 2020 (with a solution for not big but annoyingly medium data):
So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following:
1) if just using x.compute, the memory would very quickly fill up.
2) Using distributed would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because...it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk... that would all need to be merged at some point.
3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn't compute a histogram when i improved the chunking.
In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it's about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:
imgs = da.from_zarr(.....)
imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim
h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256
h_out = da.to_zarr(h, "histogram.zarr")
I ran the progress bar alongside the process and to get a histogram from file took :
[########################################] | 100% Completed | 18min 47.3s
Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well, thanks for the help earlier in the year #mdurant.
I am generating random healpix maps from an input angular power spectrum Cl. If I use healpy.synalm, then healpy.alm2map, and finally test the map by running healpy.anafast on the generated map, the output and input power spectra do not agree, especially at high l, the output power spectrum is above the input (See plot produced by code below). If I directly use healpy.synfast, I get an output power spectrum that agrees with the input. The same applies if I use the alms from healpy.synfast and generate a map from the synfast alms using healpy.alm2map. When I look into the source code of synfast, it seems to just call synalm and alm2map, so I don't understand, why their results disagree. My test code looks like this:
import numpy as np
import matplotlib.pyplot as plt
import classy
import healpy as hp
NSIDE = 32
A_s=2.3e-9
n_s=0.9624
h=0.6711
omega_b=0.022068
omega_cdm=0.12029
params = {
'output': 'dCl, mPk',
'A_s': A_s,
'n_s': n_s,
'h': h,
'omega_b': omega_b,
'omega_cdm': omega_cdm,
'selection_mean': '0.55',
'selection_magnification_bias_analytic': 'yes',
'selection_bias_analytic': 'yes',
'selection_dNdz_evolution_analytic': 'yes'}
cosmo = classy.Class()
cosmo.set(params)
cosmo.compute()
theory_cl=cosmo.density_cl()['dd']
data_map,data_alm=hp.synfast(theory_cl[0],NSIDE,alm=True)
data_cl=hp.anafast(data_map)
plt.plot(np.arange(len(data_cl)),data_cl,label="synfast")
data_map=hp.alm2map(data_alm,NSIDE)
data_cl=hp.anafast(data_map)
plt.plot(np.arange(len(data_cl)),data_cl,label="synfast using alm")
data_alm=hp.synalm(theory_cl[0])
data_map=hp.alm2map(data_alm,NSIDE)
data_cl=hp.anafast(data_map)
plt.plot(np.arange(len(data_cl)),data_cl,label="synalm")
plt.plot(np.arange(min(len(data_cl),len(theory_cl[0]))),theory_cl[0][:len(data_cl)],label="Theory")
plt.xlabel(r'$\ell$')
plt.ylabel(r'$C_\ell$')
plt.legend()
plt.show()
The offset becomes larger for lower NSIDE.
Thank you very much for your help.
Sorry, I missed that synfast knows about NSIDE, so the lmax is by default based on NSIDE, whereas synalm does not know about it, so it takes the maximum l of the input spectrum as lmax. Setting lmax to 3 * NSIDE -1 in synalm resolves the discrepancy.
I want to create all possible combinations of 4 chemical elements out of a list of 9 and use them to create folders named after these combinations.
The desired list looks something like this:
{Cr, Hf, Mo, Nb, Ta, Ti, V, W, Zr}
What I want to get out of it, would be:
CrHfMoNb
CrHfMoTa
CrHfMoTi
CrHfMoV
...
TiVWZr
and so on for all 126 possible arrangements, stored in a list or something similar so that I can use it as input for creating the folders. These combinations should be ordered alphabetically, so that Hf always comes after Cr and before Ti for example.
I can use both Bash and Python, I prefer the simpler method. If the method could easily be adapted to a different number like combinations of 5 that's a big plus.
Python has "itertools" which includes a function to perform these kind of combinations for you.
combinations('ABCD', 2) returns AB AC AD BC BD CD
so you could do something like...
#!/usr/bin/python3.5
import itertools
output = []
for i in itertools.combinations(['Cr', 'Hf', 'Mo', 'Nb', 'Ta', 'Ti', 'V', 'W', 'Zr'], 4):
output.append("".join(i))
print(sorted(output))
Which would produce all 126 combinations and sort them for you.
['CrHfMoNb', 'CrHfMoTa', 'CrHfMoTi', 'CrHfMoV', 'CrHfMoW', 'CrHfMoZr', 'CrHfNbTa', 'CrHfNbTi', 'CrHfNbV', 'CrHfNbW', 'CrHfNbZr', 'CrHfTaTi', 'CrHfTaV', 'CrHfTaW', 'CrHfTaZr', 'CrHfTiV', 'CrHfTiW', 'CrHfTiZr', 'CrHfVW', 'CrHfVZr', 'CrHfWZr', 'CrMoNbTa', 'CrMoNbTi', 'CrMoNbV', 'CrMoNbW', 'CrMoNbZr', 'CrMoTaTi', 'CrMoTaV', 'CrMoTaW', 'CrMoTaZr', 'CrMoTiV', 'CrMoTiW', 'CrMoTiZr', 'CrMoVW', 'CrMoVZr', 'CrMoWZr', 'CrNbTaTi', 'CrNbTaV', 'CrNbTaW', 'CrNbTaZr', 'CrNbTiV', 'CrNbTiW', 'CrNbTiZr', 'CrNbVW', 'CrNbVZr', 'CrNbWZr', 'CrTaTiV', 'CrTaTiW', 'CrTaTiZr', 'CrTaVW', 'CrTaVZr', 'CrTaWZr', 'CrTiVW', 'CrTiVZr', 'CrTiWZr', 'CrVWZr', 'HfMoNbTa', 'HfMoNbTi', 'HfMoNbV', 'HfMoNbW', 'HfMoNbZr', 'HfMoTaTi', 'HfMoTaV', 'HfMoTaW', 'HfMoTaZr', 'HfMoTiV', 'HfMoTiW', 'HfMoTiZr', 'HfMoVW', 'HfMoVZr', 'HfMoWZr', 'HfNbTaTi', 'HfNbTaV', 'HfNbTaW', 'HfNbTaZr', 'HfNbTiV', 'HfNbTiW', 'HfNbTiZr', 'HfNbVW', 'HfNbVZr', 'HfNbWZr', 'HfTaTiV', 'HfTaTiW', 'HfTaTiZr', 'HfTaVW', 'HfTaVZr', 'HfTaWZr', 'HfTiVW', 'HfTiVZr', 'HfTiWZr', 'HfVWZr', 'MoNbTaTi', 'MoNbTaV', 'MoNbTaW', 'MoNbTaZr', 'MoNbTiV', 'MoNbTiW', 'MoNbTiZr', 'MoNbVW', 'MoNbVZr', 'MoNbWZr', 'MoTaTiV', 'MoTaTiW', 'MoTaTiZr', 'MoTaVW', 'MoTaVZr', 'MoTaWZr', 'MoTiVW', 'MoTiVZr', 'MoTiWZr', 'MoVWZr', 'NbTaTiV', 'NbTaTiW', 'NbTaTiZr', 'NbTaVW', 'NbTaVZr', 'NbTaWZr', 'NbTiVW', 'NbTiVZr', 'NbTiWZr', 'NbVWZr', 'TaTiVW', 'TaTiVZr', 'TaTiWZr', 'TaVWZr', 'TiVWZr']
If you want them "neatly" just use...
#!/usr/bin/python3.5
import itertools
output = []
for i in itertools.combinations(['Cr', 'Hf', 'Mo', 'Nb', 'Ta', 'Ti', 'V', 'W', 'Zr'], 4):
output.append("".join(i))
while output:
print(output.pop(0))
which gives...
CrHfMoNb CrHfMoTa CrHfMoTi CrHfMoV CrHfMoW CrHfMoZr CrHfNbTa CrHfNbTi
CrHfNbV CrHfNbW CrHfNbZr CrHfTaTi CrHfTaV CrHfTaW CrHfTaZr CrHfTiV
CrHfTiW CrHfTiZr CrHfVW CrHfVZr CrHfWZr CrMoNbTa CrMoNbTi CrMoNbV
CrMoNbW CrMoNbZr CrMoTaTi CrMoTaV CrMoTaW CrMoTaZr CrMoTiV CrMoTiW
CrMoTiZr CrMoVW CrMoVZr CrMoWZr CrNbTaTi CrNbTaV CrNbTaW CrNbTaZr
CrNbTiV CrNbTiW CrNbTiZr CrNbVW CrNbVZr CrNbWZr CrTaTiV CrTaTiW
CrTaTiZr CrTaVW CrTaVZr CrTaWZr CrTiVW CrTiVZr CrTiWZr CrVWZr HfMoNbTa
HfMoNbTi HfMoNbV HfMoNbW HfMoNbZr HfMoTaTi HfMoTaV HfMoTaW HfMoTaZr
HfMoTiV HfMoTiW HfMoTiZr HfMoVW HfMoVZr HfMoWZr HfNbTaTi HfNbTaV
HfNbTaW HfNbTaZr HfNbTiV HfNbTiW HfNbTiZr HfNbVW HfNbVZr HfNbWZr
HfTaTiV HfTaTiW HfTaTiZr HfTaVW HfTaVZr HfTaWZr HfTiVW HfTiVZr HfTiWZr
HfVWZr MoNbTaTi MoNbTaV MoNbTaW MoNbTaZr MoNbTiV MoNbTiW MoNbTiZr
MoNbVW MoNbVZr MoNbWZr MoTaTiV MoTaTiW MoTaTiZr MoTaVW MoTaVZr MoTaWZr
MoTiVW MoTiVZr MoTiWZr MoVWZr NbTaTiV NbTaTiW NbTaTiZr NbTaVW NbTaVZr
NbTaWZr NbTiVW NbTiVZr NbTiWZr NbVWZr TaTiVW TaTiVZr TaTiWZr TaVWZr
TiVWZr
Not as short as the python method, but still straightforward. Create a shell array and cycle through that in four for loops yielding the desired 126 lines:
ELARR=(Cr Hf Mo Nb Ta Ti V W Zr)
for ((i=0; i<${#ELARR[#]}; i++))
do for ((j=i+1; j<${#ELARR[#]}; j++))
do for ((k=j+1; k<${#ELARR[#]}; k++))
do for ((l=k+1; l<${#ELARR[#]}; l++))
do echo ${ELARR[i]}${ELARR[j]}${ELARR[k]}${ELARR[l]}
done
done
done
done
CrHfMoNb
CrHfMoTa
CrHfMoTi
CrHfMoV
CrHfMoW
.
.
.
TaTiWZr
TaVWZr
TiVWZr
Will be even shorter if you assign the array's element count to a variable, and use that, and mayhap use a shorter array name...
I have a netcdf file with data as a function of lon,lat and time. I would like to calculate the total number of missing entries in each grid cell summed over the time dimension, preferably with CDO or NCO so I do not need to invoke R, python etc.
I know how to get the total number of missing values
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
as I answered to this related question:
count number of missing values in netcdf file - R
and CDO can tell me the total summed over space with
cdo info in.nc
but I can't work out how to sum over time. Is there a way for example of specifying the dimension to sum over with number_miss in ncap2?
We added the missing() function to ncap2 to solve this problem elegantly as of NCO 4.6.7 (May, 2017). To count missing values through time:
ncap2 -s 'mss_val=three_dmn_var_dbl.missing().ttl($time)' in.nc out.nc
Here ncap2 chains two methods together, missing(), followed by a total over the time dimension. The 2D variable mss_val is in out.nc. The response below does the same but averages over space and reports through time (because I misinterpreted the OP).
Old/obsolete answer:
There are two ways to do this with NCO/ncap2, though neither is as elegant as I would like. Either call assemble the answer one record at a time by calling num_miss() with one record at a time, or (my preference) use the boolean comparison function followed by the total operator along the axes of choice:
zender#aerosol:~$ ncap2 -O -s 'tmp=three_dmn_var_dbl;mss_val=tmp.get_miss();tmp.delete_miss();tmp_bool=(tmp==mss_val);tmp_bool_ttl=tmp_bool.ttl($lon,$lat);print(tmp_bool_ttl);' ~/nco/data/in.nc ~/foo.nc
tmp_bool_ttl[0]=0
tmp_bool_ttl[1]=0
tmp_bool_ttl[2]=0
tmp_bool_ttl[3]=8
tmp_bool_ttl[4]=0
tmp_bool_ttl[5]=0
tmp_bool_ttl[6]=0
tmp_bool_ttl[7]=1
tmp_bool_ttl[8]=0
tmp_bool_ttl[9]=2
or
zender#aerosol:~$ ncap2 -O -s 'for(rec=0;rec<time.size();rec++){nmiss=three_dmn_var_int(rec,:,:).number_miss();print(nmiss);}' ~/nco/data/in.nc ~/foo.nc
nmiss = 0
nmiss = 0
nmiss = 8
nmiss = 0
nmiss = 0
nmiss = 1
nmiss = 0
nmiss = 2
nmiss = 1
nmiss = 2
Even though you are asking for another solution, I would like to show you that it takes only one very short line to find the answer with the help of Python. The variable m_data has exactly the same shape as a variable with missing values read using the netCDF4 package. With the execution of only one np.sum command with the correct axis specified, you have your answer.
import numpy as np
import matplotlib.pyplot as plt
import netCDF4 as nc4
# Generate random data for this experiment.
data = np.random.rand(365, 64, 128)
# Masked data, this is how the data is read from NetCDF by the netCDF4 package.
# For this example, I mask all values less than 0.1.
m_data = np.ma.masked_array(data, mask=data<0.1)
# It only takes one operation to find the answer.
n_values_missing = np.sum(m_data.mask, axis=0)
# Just a plot of the result.
plt.figure()
plt.pcolormesh(n_values_missing)
plt.colorbar()
plt.xlabel('lon')
plt.ylabel('lat')
plt.show()
# Save a netCDF file of the results.
f = nc4.Dataset('test.nc', 'w', format='NETCDF4')
f.createDimension('lon', 128)
f.createDimension('lat', 64 )
n_values_missing_nc = f.createVariable('n_values_missing', 'i4', ('lat', 'lon'))
n_values_missing_nc[:,:] = n_values_missing[:,:]
f.close()