dask histogram from zarr file (a big zarr file)

dask histogram from zarr file (a big zarr file) - image

So heres my question, I have a big 3dim array which is 100GB in size as a #zarr file (the array is more than twice the size). I have tried using the histogram from #Dask to calculate but I get an error saying that it cant do it because the file has tuples within tuples. Im guessing thats the zarr file formate rather than anything else?
any thoughts?
edit:
yes the bigger computer thing wouldnt actually work...
Im running a dask client on a single machine, it runsthe calculation but just gets stuck somewhere.
I just tried dask.map function across the file but when I plot it out I get something like this:
ValueError: setting an array element with a sequence.
heres a version of the script:
def histo(img):
return da.histogram(img, bins=255, range=[0, 255])
histo_1 = da.map_blocks(histo, fimg)
I am actually going to try and use it out side of the map function. I wonder rather than the map funtion, does the windowing from map blocks, actually cause the issue. well, ill let you know if it is or now....
edit 2
So I tried to remove the map blocks function as suggested and this was my result:
[in] h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
[in] bins
[out] array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32.,
33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54.,
55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76.,
77., 78., 79., 80., 81., 82., 83., 84., 85., 86., 87.,
88., 89., 90., 91., 92., 93., 94., 95., 96., 97., 98.,
99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131.,
132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142.,
143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153.,
154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164.,
165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175.,
176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186.,
187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197.,
198., 199., 200., 201., 202., 203., 204., 205., 206., 207., 208.,
209., 210., 211., 212., 213., 214., 215., 216., 217., 218., 219.,
220., 221., 222., 223., 224., 225., 226., 227., 228., 229., 230.,
231., 232., 233., 234., 235., 236., 237., 238., 239., 240., 241.,
242., 243., 244., 245., 246., 247., 248., 249., 250., 251., 252.,
253., 254., 255.])
[in] h.compute
[out] <bound method DaskMethodsMixin.compute of dask.array<sum-aggregate, shape=(255,), dtype=int64, chunksize=(255,), chunktype=numpy.ndarray>>
im going to try in another notebook and see if it still occurs.
edit 3
its the stranges thing, but if I just declare the variable h, it comes out as one small element from the dask array?
edit
Strange, if i call the xarray.hist or the da.hist function, they both fall over. If I use the skimage.exposure.histogram it works but it appears that the zarr file is unpacked before the histogram is a calculated. Which is a bit of a problem...
Update 7th June 2020 (with a solution for not big but annoyingly medium data) see below for answer.

You probably want to use dask's function for this rather than map_blocks. For the latter, Dask expects the output of each call to be the same size as the input block, or a shape derived from the input block, instead of the one-dimensional fixed-size output of histogram.
h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
h.compute()

Update 7th June 2020 (with a solution for not big but annoyingly medium data):
So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following:
1) if just using x.compute, the memory would very quickly fill up.
2) Using distributed would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because...it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk... that would all need to be merged at some point.
3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn't compute a histogram when i improved the chunking.
In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it's about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:
imgs = da.from_zarr(.....)
imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim
h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256
h_out = da.to_zarr(h, "histogram.zarr")
I ran the progress bar alongside the process and to get a histogram from file took :
[########################################] | 100% Completed | 18min 47.3s
Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well, thanks for the help earlier in the year #mdurant.

Related

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?

What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

Keras Inception-v3 fine-tuning workaraound

I am trying to fine-tune Inception-v3, but no matter which layer I choose to freeze I get random predictions. I found that other people are having the same problem: https://github.com/keras-team/keras/issues/9214 . It seems that the problem comes from setting the BN layer to not trainable.
Now I am trying to get the output of the last layer I want to freeze and use it as an input to the following layers, which I will then train:
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False) base_model = InceptionV3(weights='imagenet', include_top=True, input_shape=(299, 299, 3))
model_features = Model(inputs=base_model.input, outputs=base_model.get_layer(
self.Inception_Fine_Tune_Layers[layer_freeze]).output)
#I want to use this as input
values_train = model_features.predict_generator(train_generator, verbose=1)
However, I get Memory error like this, although I have 12Gb, which is more than what I need:
....
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3268864 totalling 3.12MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 3489024 totalling 3.33MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 4211968 totalling 4.02MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5129472 totalling 4.89MiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 3.62GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 68719476736
InUse: 3886957312
MaxInUse: 3889054464
NumAllocs: 3709
MaxAllocSize: 8388608
Any suggestion how to fix that or another workaround to fine-tune Inception will be very helpful.

I can't tell if you're preprocessing your input properly from what you've provided. However, Keras provides functions for preprocessing that are specific to the pre-trained net, in this case Inception V3.
from keras.applications.inception_v3 import preprocess_input
Try adding this to your data generator as the preprocessing function like so...
train_generator = train_datagen.flow_from_directory(
os.path.join(directory, "train_data"),
preprocessing_function=preprocess_input, # <---
target_size=size,
interpolation="bilinear",
classes=["a", "b", "c","d"],
batch_size=1,
shuffle=False)
You should then be able to unfreeze all of the layers, or the select few that you want to train.
Hope that helps!

How to calculate number of missing values summed over time dimension in a netcdf file in bash

I have a netcdf file with data as a function of lon,lat and time. I would like to calculate the total number of missing entries in each grid cell summed over the time dimension, preferably with CDO or NCO so I do not need to invoke R, python etc.
I know how to get the total number of missing values
ncap2 -s "nmiss=var.number_miss()" in.nc out.nc
as I answered to this related question:
count number of missing values in netcdf file - R
and CDO can tell me the total summed over space with
cdo info in.nc
but I can't work out how to sum over time. Is there a way for example of specifying the dimension to sum over with number_miss in ncap2?

We added the missing() function to ncap2 to solve this problem elegantly as of NCO 4.6.7 (May, 2017). To count missing values through time:
ncap2 -s 'mss_val=three_dmn_var_dbl.missing().ttl($time)' in.nc out.nc
Here ncap2 chains two methods together, missing(), followed by a total over the time dimension. The 2D variable mss_val is in out.nc. The response below does the same but averages over space and reports through time (because I misinterpreted the OP).
Old/obsolete answer:
There are two ways to do this with NCO/ncap2, though neither is as elegant as I would like. Either call assemble the answer one record at a time by calling num_miss() with one record at a time, or (my preference) use the boolean comparison function followed by the total operator along the axes of choice:
zender#aerosol:~$ ncap2 -O -s 'tmp=three_dmn_var_dbl;mss_val=tmp.get_miss();tmp.delete_miss();tmp_bool=(tmp==mss_val);tmp_bool_ttl=tmp_bool.ttl($lon,$lat);print(tmp_bool_ttl);' ~/nco/data/in.nc ~/foo.nc
tmp_bool_ttl[0]=0
tmp_bool_ttl[1]=0
tmp_bool_ttl[2]=0
tmp_bool_ttl[3]=8
tmp_bool_ttl[4]=0
tmp_bool_ttl[5]=0
tmp_bool_ttl[6]=0
tmp_bool_ttl[7]=1
tmp_bool_ttl[8]=0
tmp_bool_ttl[9]=2
or
zender#aerosol:~$ ncap2 -O -s 'for(rec=0;rec<time.size();rec++){nmiss=three_dmn_var_int(rec,:,:).number_miss();print(nmiss);}' ~/nco/data/in.nc ~/foo.nc
nmiss = 0
nmiss = 0
nmiss = 8
nmiss = 0
nmiss = 0
nmiss = 1
nmiss = 0
nmiss = 2
nmiss = 1
nmiss = 2

Even though you are asking for another solution, I would like to show you that it takes only one very short line to find the answer with the help of Python. The variable m_data has exactly the same shape as a variable with missing values read using the netCDF4 package. With the execution of only one np.sum command with the correct axis specified, you have your answer.
import numpy as np
import matplotlib.pyplot as plt
import netCDF4 as nc4
# Generate random data for this experiment.
data = np.random.rand(365, 64, 128)
# Masked data, this is how the data is read from NetCDF by the netCDF4 package.
# For this example, I mask all values less than 0.1.
m_data = np.ma.masked_array(data, mask=data<0.1)
# It only takes one operation to find the answer.
n_values_missing = np.sum(m_data.mask, axis=0)
# Just a plot of the result.
plt.figure()
plt.pcolormesh(n_values_missing)
plt.colorbar()
plt.xlabel('lon')
plt.ylabel('lat')
plt.show()
# Save a netCDF file of the results.
f = nc4.Dataset('test.nc', 'w', format='NETCDF4')
f.createDimension('lon', 128)
f.createDimension('lat', 64 )
n_values_missing_nc = f.createVariable('n_values_missing', 'i4', ('lat', 'lon'))
n_values_missing_nc[:,:] = n_values_missing[:,:]
f.close()

SPSS: Syntax for Calculating FSCORE from rotated factors

I found the proper syntax to import my centriod factor extraction into SPSS and rotate it. (The semi-bumbling tale is here.
The next issue is this: because of limitations in SPSS on what subcommands can be used when reading a matrix in (only /ROTATE and /EXTRACTION are permitted), I can't get the factor scores. SPSS displays this error:
"Factor scores cannot be computed with matrix input."
I still need to find a way to get the FSCORE of the newly rotated factors of each factor for all cases by running a regression using the newly rotated factors and saving the regression values as a new variable (/SAVE REG(ALL).
Ideas are welcome. Thank you for your expertise!
Assets: Dataset A of 36 cases and 74 variables (the basis of the centriod factor extraction); centriod factor extraction matrix
Here's the SPSS syntax that almost does what I need - except it uses PCA extraction instead of centroid.
FACTOR
/VARIABLES VAR00001 VAR00002 VAR00003 VAR00004 VAR00005 VAR00006 VAR00007 VAR00008 VAR00009 VAR00010 VAR00011 VAR00012 VAR00013 VAR00014 VAR00015 VAR00016 VAR00017 VAR00018 VAR00019 VAR00020 VAR00021 VAR00022 VAR00023 VAR00024 VAR00025 VAR00026 VAR00027 VAR00028 VAR00029 VAR00030 VAR00031 VAR00032 VAR00033 VAR00034 VAR00035 VAR00036 VAR00037 VAR00038 VAR00039 VAR00040 VAR00041 VAR00042 VAR00043 VAR00044 VAR00045 VAR00046 VAR00047 VAR00048 VAR00049 VAR00050 VAR00051 VAR00052 VAR00053 VAR00054 VAR00055 VAR00056 VAR00057 VAR00058 VAR00059 VAR00060 VAR00061 VAR00062 VAR00063 VAR00064 VAR00065 VAR00066 VAR00067 VAR00068 VAR00069 VAR00070 VAR00071 VAR00072 VAR00073 VAR00074
/MISSING LISTWISE
/ANALYSIS VAR00001 VAR00002 VAR00003 VAR00004 VAR00005 VAR00006 VAR00007 VAR00008 VAR00009 VAR00010 VAR00011 VAR00012 VAR00013 VAR00014 VAR00015 VAR00016 VAR00017 VAR00018 VAR00019 VAR00020 VAR00021 VAR00022 VAR00023 VAR00024 VAR00025 VAR00026 VAR00027 VAR00028 VAR00029 VAR00030 VAR00031 VAR00032 VAR00033 VAR00034 VAR00035 VAR00036 VAR00037 VAR00038 VAR00039 VAR00040 VAR00041 VAR00042 VAR00043 VAR00044 VAR00045 VAR00046 VAR00047 VAR00048 VAR00049 VAR00050 VAR00051 VAR00052 VAR00053 VAR00054 VAR00055 VAR00056 VAR00057 VAR00058 VAR00059 VAR00060 VAR00061 VAR00062 VAR00063 VAR00064 VAR00065 VAR00066 VAR00067 VAR00068 VAR00069 VAR00070 VAR00071 VAR00072 VAR00073 VAR00074
/PRINT INITIAL CORRELATION SIG DET INV REPR AIC EXTRACTION ROTATION FSCORE
/FORMAT BLANK(.544)
/CRITERIA FACTORS(6) ITERATE(80)
/EXTRACTION PC <---Here's the rub.
/CRITERIA ITERATE(80) DELTA(0)
/ROTATION OBLIMIN
/SAVE REG(ALL)
/METHOD=CORRELATION.

I referred to the MATRIX command in my reply to your other post (Rotations). You will need to research the appropriate equations for performing this calculation and set up the matrix algebra within a MATRIX END - MATRIX code block. Easy once you have the math right. I'm too busy/lazy to research and write it but this should provide a fertile lead.

Smoothing measured data in MATLAB?

I have measured data from MATLAB and I'm wondering how to best smooth the data?
Example data (1st colum=x-data / second-colum=y-data):
33400 209.11
34066 210.07
34732 212.3
35398 214.07
36064 215.61
36730 216.95
37396 218.27
38062 219.52
38728 220.11
39394 221.13
40060 221.4
40726 222.5
41392 222.16
42058 223.29
42724 222.77
43390 223.97
44056 224.42
44722 225.4
45388 225.32
46054 225.98
46720 226.7
47386 226.53
48052 226.61
48718 227.43
49384 227.84
50050 228.41
50716 228.57
51382 228.92
52048 229.67
52714 230.02
53380 229.54
54046 231.19
54712 231.00
55378 231.5
56044 231.5
56710 231.79
57376 232.26
58042 233.12
58708 232.65
59374 233.51
60040 234.16
60706 234.21
The data in the second column should be monoton but it isn't. How to make it smooth?
I could probably invent a short algorithm myself but I think it's a better way to use an established and proven one... do you know a good way to somehow integrate the outliners to make it a monoton curve!?
Thanks in advance

Monotone in your case is always increasing!
See the options below (1. Cobb-Douglas; 2. Quadratic; 3. Cubic)
clear all
close all
load needSmooth.dat % Your data
x=needSmooth(:,1);
y=needSmooth(:,2);
n=length(x);
% Figure 1
logX=log(x);
logY=log(y);
Y=logY;
X=[ones(n,1),logX];
B=regress(Y,X);
a=exp(B(1,1));
b=B(2,1);
figure(1)
plot(x,y,'k*')
hold
for i=1:n-1
plot([x(i,1);x(i+1,1)],[a*x(i,1)^b;a*x(i+1,1)^b],'k-')
end
%Figure 2
X=[ones(n,1),x,x.*x];
Y=y;
B=regress(Y,X);
c=B(1,1);
b=B(2,1);
a=B(3,1);
figure(2)
plot(x,y,'k*')
hold
for i=1:n-1
plot([x(i,1);x(i+1,1)],[c+b*x(i,1)+a*x(i,1)^2; c+b*x(i+1,1)+a*x(i+1,1)^2],'k-')
end
%Figure 3
X=[ones(n,1),x,x.*x,x.*x.*x];
Y=y;
B=regress(Y,X);
d=B(1,1);
c=B(2,1);
b=B(3,1);
a=B(4,1);
figure(3)
plot(x,y,'k*')
hold
for i=1:n-1
plot([x(i,1);x(i+1,1)],[d+c*x(i,1)+b*x(i,1)^2+a*x(i,1)^3; d+c*x(i+1,1)+b*x(i+1,1)^2+a*x(i+1,1)^3],'k-')
end
There are also some cooked functions in Matlab such as "smooth" and "spline" that should also work in your case since your data is almost monotone.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

dask histogram from zarr file (a big zarr file) - image

Related

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

Keras Inception-v3 fine-tuning workaraound

How to calculate number of missing values summed over time dimension in a netcdf file in bash

SPSS: Syntax for Calculating FSCORE from rotated factors

Smoothing measured data in MATLAB?

Categories

Resources