So heres my question, I have a big 3dim array which is 100GB in size as a #zarr file (the array is more than twice the size). I have tried using the histogram from #Dask to calculate but I get an error saying that it cant do it because the file has tuples within tuples. Im guessing thats the zarr file formate rather than anything else?
any thoughts?
edit:
yes the bigger computer thing wouldnt actually work...
Im running a dask client on a single machine, it runsthe calculation but just gets stuck somewhere.
I just tried dask.map function across the file but when I plot it out I get something like this:
ValueError: setting an array element with a sequence.
heres a version of the script:
def histo(img):
return da.histogram(img, bins=255, range=[0, 255])
histo_1 = da.map_blocks(histo, fimg)
I am actually going to try and use it out side of the map function. I wonder rather than the map funtion, does the windowing from map blocks, actually cause the issue. well, ill let you know if it is or now....
edit 2
So I tried to remove the map blocks function as suggested and this was my result:
[in] h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
[in] bins
[out] array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32.,
33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54.,
55., 56., 57., 58., 59., 60., 61., 62., 63., 64., 65.,
66., 67., 68., 69., 70., 71., 72., 73., 74., 75., 76.,
77., 78., 79., 80., 81., 82., 83., 84., 85., 86., 87.,
88., 89., 90., 91., 92., 93., 94., 95., 96., 97., 98.,
99., 100., 101., 102., 103., 104., 105., 106., 107., 108., 109.,
110., 111., 112., 113., 114., 115., 116., 117., 118., 119., 120.,
121., 122., 123., 124., 125., 126., 127., 128., 129., 130., 131.,
132., 133., 134., 135., 136., 137., 138., 139., 140., 141., 142.,
143., 144., 145., 146., 147., 148., 149., 150., 151., 152., 153.,
154., 155., 156., 157., 158., 159., 160., 161., 162., 163., 164.,
165., 166., 167., 168., 169., 170., 171., 172., 173., 174., 175.,
176., 177., 178., 179., 180., 181., 182., 183., 184., 185., 186.,
187., 188., 189., 190., 191., 192., 193., 194., 195., 196., 197.,
198., 199., 200., 201., 202., 203., 204., 205., 206., 207., 208.,
209., 210., 211., 212., 213., 214., 215., 216., 217., 218., 219.,
220., 221., 222., 223., 224., 225., 226., 227., 228., 229., 230.,
231., 232., 233., 234., 235., 236., 237., 238., 239., 240., 241.,
242., 243., 244., 245., 246., 247., 248., 249., 250., 251., 252.,
253., 254., 255.])
[in] h.compute
[out] <bound method DaskMethodsMixin.compute of dask.array<sum-aggregate, shape=(255,), dtype=int64, chunksize=(255,), chunktype=numpy.ndarray>>
im going to try in another notebook and see if it still occurs.
edit 3
its the stranges thing, but if I just declare the variable h, it comes out as one small element from the dask array?
edit
Strange, if i call the xarray.hist or the da.hist function, they both fall over. If I use the skimage.exposure.histogram it works but it appears that the zarr file is unpacked before the histogram is a calculated. Which is a bit of a problem...
Update 7th June 2020 (with a solution for not big but annoyingly medium data) see below for answer.
You probably want to use dask's function for this rather than map_blocks. For the latter, Dask expects the output of each call to be the same size as the input block, or a shape derived from the input block, instead of the one-dimensional fixed-size output of histogram.
h, bins =da.histogram(fused_crop, bins=255, range=[0, 255])
h.compute()
Update 7th June 2020 (with a solution for not big but annoyingly medium data):
So unfortunately I got a bit ill around this time and it took a while for me to feel a bit better. Then the pandemic happened and I was on full childcare duty. I tried lots of different option and what ultimately, this looked like was that the following:
1) if just using x.compute, the memory would very quickly fill up.
2) Using distributed would fill the hard drive with spill to disk and take hours but would hang and crash and not do anything because...it would compute (im guessing here but based on the graph and dask api) it would create a sub histogram array for every chunk... that would all need to be merged at some point.
3) The chunking of my data was sub optimal so the amount of tasks was massive but even then I couldn't compute a histogram when i improved the chunking.
In the end I looked for a dynamic way of updating the histogram data. So I used Zarr to do it, by computing to it. Since it allows conccurrent reads and writing functions. As a reminder : my data is a zarr array in 3 dims x,y,z and uncompressed 300GB but compressed it's about 100GB. On my 4 yr old laptop with 16GB of ram using the following worked (I should have said my data was 16 bit unsigned:
imgs = da.from_zarr(.....)
imgs2 = imgs.rechunk((a,b,c)) ## individual chunk dim per dim
h, bins = da.histogram(imgs2, bins = 255, range=[0, 65535]) # binning to 256
h_out = da.to_zarr(h, "histogram.zarr")
I ran the progress bar alongside the process and to get a histogram from file took :
[########################################] | 100% Completed | 18min 47.3s
Which I dont think is too bad for a 300GB array. Hopefully this helps someone else as well, thanks for the help earlier in the year #mdurant.
I have a set of data points that are supposed to define a smooth surface. X/Z are specified at intervals and Y is dependent upon Y/Z.
The location of the points is rounded to within a fixed tolerance.
If one draws a curve (e.g. spline) through an X or Z slice, the rounding causes the curves to zigzag or create steps.
For example a set of linear value forming a straight line, when rounded to an integer form a step:
1 1 2 2 3 instead of 1 1.5 2 2.5 3
What would like to do is fair the data to convert the rounded values to real values that produce smooth curves that will define a smooth surface without oscillations.
There are tools to do this on 2D. They look for oscillations in the 2d derivative. The problem this creates in my case is that fairing along an X/Y curves can unfair along an intersecting Y/X curve. I need to create something that will fair along X/Y and Y/Z at the same time.
I was wondering if there anyone knows of algorithms to do this.
Sample Data Set
1392,175.8125,378
1296,161.0625,378
1128,231.375,725.5
1152,233.625,723.8125
1176,235.875,722.125
1200,238.25,720.4375
1224,240.5,718.875
1248,242.875,717.25
1272,245.375,715.625
1296,247.75,714.125
1320,250.25,712.5625
1344,252.8125,711.125
1368,255.375,709.5625
1392,258.0625,708.0625
1416,260.75,706.625
1440,263.5,705.25
1464,266.125,703.8125
1128,100.875,0
1152,101.75,0
1176,102.6875,0
1200,103.5625,0
1224,104.3125,0
1248,105.125,0
1272,105.875,0
1296,106.625,0
1320,107.25,0
1344,107.875,0
1368,108.625,0
1392,109.1875,0
1416,109.8125,0
1440,110.4375,0
1464,111.125,0
1128,115.5,3
1152,116.75,3
1176,117.8125,3
1200,118.875,3
1224,120.0625,3
1248,121.0625,3
1272,122,3
1296,123,3
1320,124,3
1344,125,3
1368,126,3
1392,126.875,3
1416,127.875,3
1440,128.875,3
1464,129.75,3
1128,122.25,6
1152,123.4375,6
1176,124.5625,6
1200,125.75,6
1224,126.875,6
1248,128,6
1272,129.125,6
1296,130.25,6
1320,131.375,6
1344,132.4375,6
1368,133.5,6
1392,134.5625,6
1416,135.625,6
1440,136.625,6
1464,137.625,6
1128,131.375,12
1152,132.6875,12
1176,133.9375,12
1200,135.25,12
1224,136.4375,12
1248,137.6875,12
1272,138.875,12
1296,140.25,12
1320,141.3125,12
1344,142.5,12
1368,143.75,12
1392,145,12
1416,146.1875,12
1440,147.375,12
1464,148.5,12
1128,144.9375,24
1152,146.25,24
1176,147.5,24
1200,148.75,24
1224,150.1875,24
1248,151.4375,24
1272,152.75,24
1296,154.125,24
1320,155.4375,24
1344,156.75,24
1368,158,24
1392,159.25,24
1416,160.5625,24
1440,161.875,24
1464,163.125,24
1128,161.625,48
1152,162.8125,48
1176,164.1875,48
1200,165.625,48
1224,166.9375,48
1248,168.3125,48
1272,169.75,48
1296,171.25,48
1320,172.6875,48
1344,174,48
1368,175.375,48
1392,176.75,48
1416,178.1875,48
1440,179.625,48
1464,181.0625,48
1128,168.75,72
1152,170.25,72
1176,171.75,72
1200,173.25,72
1224,174.875,72
1248,176.375,72
1272,178,72
1296,179.5,72
1320,181.1875,72
1344,182.75,72
1368,184.25,72
1392,185.8125,72
1416,187.5,72
1440,189.125,72
1464,190.75,72
1128,170.1875,96
1152,171.875,96
1176,173.5,96
1200,175.25,96
1224,176.875,96
1248,178.75,96
1272,180.5,96
1296,182.25,96
1320,184,96
1344,185.75,96
1368,187.5625,96
1392,189.3125,96
1416,191.125,96
1440,193,96
1464,194.9375,96
1128,169.0625,120
1152,170.8125,120
1176,172.75,120
1200,174.625,120
1224,176.5,120
1248,178.375,120
1272,180.3125,120
1296,182.625,120
1320,184.25,120
1344,186.25,120
1368,188.125,120
1392,190.1875,120
1416,192.125,120
1440,194.25,120
1464,196.3125,120
1128,166.625,144
1152,168.625,144
1176,170.625,144
1200,172.625,144
1224,174.75,144
1248,176.75,144
1272,178.8125,144
1296,181,144
1320,183.0625,144
1344,185.25,144
1368,187.4375,144
1392,189.5,144
1416,191.75,144
1440,194,144
1464,196.3125,144
1128,163.75,168
1152,165.75,168
1176,168,168
1200,170.1875,168
1224,172.4375,168
1248,174.625,168
1272,176.75,168
1296,179.125,168
1320,181.375,168
1344,183.625,168
1368,186,168
1392,188.625,168
1416,190.75,168
1440,193.125,168
1464,195.625,168
1128,160.5625,192
1152,162.875,192
1176,165.1875,192
1200,167.5,192
1224,169.875,192
1248,172.25,192
1272,174.625,192
1296,177.0625,192
1320,179.4375,192
1344,181.9375,192
1368,184.4375,192
1392,186.875,192
1416,189.5,192
1440,192.0625,192
1464,194.6875,192
1128,157.3125,216
1152,159.8125,216
1176,162.3125,216
1200,164.8125,216
1224,167.25,216
1248,169.6875,216
1272,172.3125,216
1296,174.875,216
1320,177.3125,216
1344,179.9375,216
1368,182.625,216
1392,185.25,216
1416,188,216
1440,190.8125,216
1464,193.5625,216
1128,154.1875,240
1152,156.75,240
1176,159.375,240
1200,161.9375,240
1224,164.5,240
1248,167.1875,240
1272,169.875,240
1296,172.625,240
1320,175.4375,240
1344,178.0625,240
1368,180.875,240
1392,183.6875,240
1416,186.5,240
1440,189.5,240
1464,192.4375,240
1128,151,264
1152,153.6875,264
1176,156.4375,264
1200,159.125,264
1224,161.875,264
1248,164.75,264
1272,167.5,264
1296,170.3125,264
1320,173.25,264
1344,176.125,264
1368,179.125,264
1392,182.0625,264
1416,185.125,264
1440,188.1875,264
1464,191.3125,264
1128,147.6875,288
1152,150.5,288
1176,153.3125,288
1200,156.1875,288
1224,159.125,288
1248,162.0625,288
1272,165.0625,288
1296,168.125,288
1320,171.1875,288
1344,174.25,288
1368,177.3125,288
1392,180.5,288
1416,183.6875,288
1440,186.9375,288
1464,190.25,288
1128,143.625,318
1152,146.6875,318
1176,149.6875,318
1200,152.75,318
1224,155.875,318
1248,159,318
1272,162.25,318
1296,165.625,318
1320,168.875,318
1344,172.125,318
1368,175.375,318
1392,178.75,318
1416,182.125,318
1440,185.6875,318
1464,189.1875,318
1128,139.875,348
1152,143.125,348
1176,146.3125,348
1200,149.5,348
1224,152.9375,348
1248,156.1875,348
1272,159.75,348
1296,163.1875,348
1320,166.625,348
1344,170.125,348
1368,173.75,348
1392,177.125,348
1416,180.75,348
1440,184.5,348
1464,188.25,348
1128,136.25,378
1152,139.6875,378
1176,143.125,378
1200,146.5625,378
1224,150.125,378
1248,153.6875,378
1272,157.375,378
1320,164.6875,378
1344,168.3125,378
1368,172.0625,378
1416,179.625,378
1440,183.5,378
1464,187.4375,378
1128,132.5625,415.25
1152,136.125,415.25
1176,139.8125,415.25
1200,143.4375,415.25
1224,147.1875,415.25
1248,151,415.25
1272,154.875,415.25
1296,158.6875,415.25
1320,162.625,415.25
1344,166.5625,415.25
1368,170.5,415.25
1392,174.375,415.25
1416,178.5625,415.25
1440,182.6875,415.25
1464,186.875,415.25
1128,130.9375,444
1152,134.625,444
1176,138.625,444
1200,142.1875,444
1224,146,444
1248,149.9375,444
1272,153.9375,444
1296,157.9375,444
1320,162,444
1344,166,444
1368,170.125,444
1392,174.25,444
1416,178.375,444
1440,182.625,444
1464,186.875,444
1128,130.8125,474
1152,134.625,474
1176,138.5625,474
1200,142.5,474
1224,146.4375,474
1248,150.5,474
1272,154.5625,474
1296,158.6875,474
1320,162.875,474
1344,166.9375,474
1368,171.25,474
1392,175.5,474
1416,179.75,474
1440,184,474
1464,188.375,474
1128,132.1875,501
1152,136.125,501
1176,140.125,501
1200,144.1875,501
1224,148.25,501
1248,152.375,501
1272,156.5,501
1296,160.75,501
1320,165,501
1344,169.25,501
1368,173.5625,501
1392,177.75,501
1416,182.3125,501
1440,186.75,501
1464,191.125,501
1128,135.25,528
1152,139.375,528
1176,143.375,528
1200,147.5,528
1224,151.625,528
1248,155.8125,528
1272,160,528
1296,164.25,528
1320,168.625,528
1344,172.875,528
1368,177.25,528
1392,181.625,528
1416,186.25,528
1440,190.625,528
1464,195.125,528
1128,142.125,564
1152,146.25,564
1176,150.4375,564
1200,154.5625,564
1224,158.6875,564
1248,162.9375,564
1272,167.1875,564
1296,171.5,564
1320,176.0625,564
1344,180.375,564
1368,184.875,564
1392,189.25,564
1416,193.75,564
1440,198.3125,564
1464,202.8125,564
1128,153.4375,600
1152,157.5,600
1176,161.75,600
1200,165.9375,600
1224,170.125,600
1248,174.25,600
1272,178.625,600
1296,182.9375,600
1320,187.25,600
1344,191.625,600
1368,196,600
1392,200.375,600
1416,204.75,600
1440,209.25,600
1464,213.375,600
1128,169.4375,636
1152,173.4375,636
1176,177.5625,636
1200,181.6875,636
1224,185.875,636
1248,190.0625,636
1272,194.25,636
1296,198.375,636
1320,202.375,636
1344,206.75,636
1368,211,636
1392,215.3125,636
1416,219.5,636
1440,223.8125,636
1464,228.0625,636
1128,190.25,672
1152,194.1875,672
1176,198.25,672
1200,202.125,672
1224,206.1875,672
1248,210.125,672
1272,214.1875,672
1296,218.1875,672
1320,222.1875,672
1344,226.25,672
1368,230.3125,672
1392,234.375,672
1416,238.375,672
1440,242.5,672
1464,246.5625,672
1128,207.0625,696
1152,211,696
1176,214.875,696
1200,218.75,696
1224,222.625,696
1248,226.5625,696
1272,230.4375,696
1296,234.3125,696
1320,238.125,696
1344,242,696
1368,245.75,696
1392,249.75,696
1416,253.5,696
1440,257.375,696
1464,261.125,696