Mutate new column from random value in existing columns - random

I'm looking to mutate my data and create a new column which randomly selects a value from the existing data. My data looks something like:
individual
age_2010
age_2011
age_2012
age_2013
a
20
21
NA
21
b
33
34
35
36
c
76
NA
78
79
d
46
46
48
49
And I want it to look like:
individual
age_2010
age_2011
age_2012
age_2013
Random Sample
a
20
21
22
NA
21
b
33
34
35
36
36
c
76
NA
78
79
78
d
46
46
48
49
48
Is there any way to add a new column which includes a random figure from any of the previous age columns, and preferably keeping the data in wide form?

I think this is an easier approach:
d[, RandomSample:=sample(na.omit(t(.SD)),1),individual]
If dealing with the edge cases discussed above is desired, and one wanted to follow this approach, we could do this:
f <- function(df) {
s = na.omit(t(df))
ifelse(length(s)>0, sample(s,1),NA_real_)
}
d[, RandomSample:=f(.SD),individual]
Or,
we could just wrap the original approach in tryCatch
d[, RandomSample:=tryCatch(sample(na.omit(t(.SD)),1),error=\(e) NA),individual]

You can reshape longer, then do grouped sampling:
library(data.table)
# Sample data
d <- structure(list(individual = c("a", "b", "c", "d"), age_2010 = c(20, 33, 76, 46), age_2011 = c(21, 34, NA, 46), age_2012 = c(NA, 35, 78, 48), age_2013 = c(21, 36, 79, 49)), row.names = c(NA, -4L), spec = structure(list(cols = list(individual = structure(list(), class = c("collector_character", "collector")), age_2010 = structure(list(), class = c("collector_double", "collector")), age_2011 = structure(list(), class = c("collector_double", "collector")), age_2012 = structure(list(), class = c("collector_double", "collector")), age_2013 = structure(list(), class = c("collector_double", "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 2L), class = "col_spec"), class = c("data.table", "data.frame"))
d
#> individual age_2010 age_2011 age_2012 age_2013
#> 1: a 20 21 NA 21
#> 2: b 33 34 35 36
#> 3: c 76 NA 78 79
#> 4: d 46 46 48 49
# Solution
d[, "Random Sample"] <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(x = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[[`)(2) # extract vector from frame
d
#> individual age_2010 age_2011 age_2012 age_2013 Random Sample
#> 1: a 20 21 NA 21 21
#> 2: b 33 34 35 36 33
#> 3: c 76 NA 78 79 76
#> 4: d 46 46 48 49 49
Alternatively, you can also use apply(), which is less verbose but much slower:
d[, "Random Sample"] <- apply(d[, -1], 1, \(x) x |> na.omit() |> sample(1))
See the benchmark here for speed comparison. On just 40k observations, apply() needs 59 times longer and 8 times the memory.
# Make large sample data set
d_large <- d |>
list() |>
rep(1e4) |>
rbindlist()
bench::mark(
base = apply(d_large[, -1], 1, \(x) x |> na.omit() |> sample(1)),
dt = d_large |>
melt("individual") |>
(`[`)(!is.na(value),
.(x = sample(value, 1)),
keyby = .(individual)) |>
(`[[`)(2),
check = F
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 base 617.86ms 617.9ms 1.62 103.3MB 12.9
#> 2 dt 6.96ms 10.5ms 80.9 13.1MB 47.3
Created on 2022-07-27 by the reprex package (v2.0.1)
Edit:
Here are versions that work with the edge case where all years are NA. In the first case I went for a join with the original table, which is a bit more expensive than the other version
# Solution with Data Table
d <- d |>
melt("individual") |> # go long
(`[`)(!is.na(value), # drop NAs
.(`Random Sample` = sample(value, 1)), # sampling
keyby = .(individual)) |> # Grouping variable
(`[`)(d) # right join with original frame
Here I simply used purrr::possibly() to return NA when sampling a zero length vector.
# Solution with apply
d[, "Random Sample"] <- apply(d[, -1], 1,
\(x) x |> na.omit() |> purrr::possibly(sample, NA)(1))

Related

Algorithm for visiting all grid cells in pseudo-random order that has a guaranteed uniformity at any stage

Context:
I have a hydraulic erosion algorithm that needs to receive an array of droplet starting positions. I also already have a pattern replicating algorithm, so I only need a good pattern to replicate.
The Requirements:
I need an algorism that produces a set of n^2 entries in a set of format (x,y) or [index] that describe cells in an nxn grid (where n = 2^i where i is any positive integer).
(as a set it means that every cell is mentioned in exactly one entry)
The pattern [created by the algorism ] should contain zero to none clustering of "visited" cells at any stage.
The cell (0,0) is as close to (n-1,n-1) as to (1,1), this relates to the definition of clustering
Note
I was/am trying to find solutions through fractal-like patterns built through recursion, but at the time of writing this, my solution is a lookup table of a checkerboard pattern(list of black cells + list of white cells) (which is bad, but yields fewer artifacts than an ordered list)
C, C++, C#, Java implementations (if any) are preferred
You can use a linear congruential generator to create an even distribution across your n×n space. For example, if you have a 64×64 grid, using a stride of 47 will create the pattern on the left below. (Run on jsbin) The cells are visited from light to dark.
That pattern does not cluster, but it is rather uniform. It uses a simple row-wide transformation where
k = (k + 47) mod (n * n)
x = k mod n
y = k div n
You can add a bit of randomness by making k the index of a space-filling curve such as the Hilbert curve. This will yield the pattern on the right. (Run on jsbin)
     
     
You can see the code in the jsbin links.
I have solved the problem myself and just sharing my solution:
here are my outputs for the i between 0 and 3:
power: 0
ordering:
0
matrix visit order:
0
power: 1
ordering:
0 3 2 1
matrix visit order:
0 3
2 1
power: 2
ordering:
0 10 8 2 5 15 13 7 4 14 12 6 1 11 9 3
matrix visit order:
0 12 3 15
8 4 11 7
2 14 1 13
10 6 9 5
power: 3
ordering:
0 36 32 4 18 54 50 22 16 52 48 20 2 38 34 6
9 45 41 13 27 63 59 31 25 61 57 29 11 47 43 15
8 44 40 12 26 62 58 30 24 60 56 28 10 46 42 14
1 37 33 5 19 55 51 23 17 53 49 21 3 39 35 7
matrix visit order:
0 48 12 60 3 51 15 63
32 16 44 28 35 19 47 31
8 56 4 52 11 59 7 55
40 24 36 20 43 27 39 23
2 50 14 62 1 49 13 61
34 18 46 30 33 17 45 29
10 58 6 54 9 57 5 53
42 26 38 22 41 25 37 21
the code:
public static int[] GetPattern(int power, int maxReturnSize = int.MaxValue)
{
int sideLength = 1 << power;
int cellsNumber = sideLength * sideLength;
int[] ret = new int[cellsNumber];
for ( int i = 0 ; i < cellsNumber && i < maxReturnSize ; i++ ) {
// this loop's body can be used for per-request computation
int x = 0;
int y = 0;
for ( int p = power - 1 ; p >= 0 ; p-- ) {
int temp = (i >> (p * 2)) % 4; //2 bits of the index starting from the begining
int a = temp % 2; // the first bit
int b = temp >> 1; // the second bit
x += a << power - 1 - p;
y += (a ^ b) << power - 1 - p;// ^ is XOR
// 00=>(0,0), 01 =>(1,1) 10 =>(0,1) 11 =>(1,0) scaled to 2^p where 0<=p
}
//to index
int index = y * sideLength + x;
ret[i] = index;
}
return ret;
}
I do admit that somewhere along the way the values got transposed, but it does not matter because of how it works.
After doing some optimization I came up with this loop body:
int x = 0;
int y = 0;
for ( int p = 0 ; p < power ; p++ ) {
int temp = ( i >> ( p * 2 ) ) & 3;
int a = temp & 1;
int b = temp >> 1;
x = ( x << 1 ) | a;
y = ( y << 1 ) | ( a ^ b );
}
int index = y * sideLength + x;
(the code assumes that c# optimizer, IL2CPP, and CPP compiler will optimize variables temp, a, b out)

what is the best way to generate random pattern inside of a table

I'v got a table (2d array), c x r. Need to generate a random pattern of connected cells inside of it. No self-crossings and no diagonal-moves. See related picture for example. ex. 1
с = 6, r = 7, the pattern is shown in numbers.
I'w wrote a function for this and it works fine, but I'm looking for hard optimization. In the code below you can see that if the pattern gets into a dead end it just rebuilds itself from the start. That is very inefficient if the pattern length is close or equals to the number of cells, c*r (42 in the example). So some smart solution is needed for this, like moving the whole pattern symmetrically when it runs out of possible moves or to add some analytics to the function so it never cathes up in the dead ends. Again, for the low values of c, r and patternLength my example works fine, but I'm looking for algorithmic perfection and high performance even on pretty high numbers.
function ClassLogic:generatePattern()
--[[ subfunctions ]]
--choosing next point for the pattern
local move = function( seq )
--getting the last sequence point
local last = seq[#seq]
-- checking the nearness of walls
local
wallLeft,
wallRight,
wallUp,
wallDown =
(last.c==1),
(last.c==config.tableSize.c),
(last.r==1),
(last.r==config.tableSize.r)
-- checking the nearness of already sequenced points
local
spLeft,
spRight,
spUp,
spDown =
(utilities.indexOfTable( seq, { c = last.c - 1, r = last.r } )~=-1),
(utilities.indexOfTable( seq, { c = last.c + 1, r = last.r } )~=-1),
(utilities.indexOfTable( seq, { c = last.c, r = last.r - 1 } )~=-1),
(utilities.indexOfTable( seq, { c = last.c, r = last.r + 1 } )~=-1)
local leftRestricted = (wallLeft or spLeft)
local rightRestricted = (wallRight or spRight)
local upRestricted = (wallUp or spUp)
local downRestricted = (wallDown or spDown)
if ( leftRestricted and rightRestricted and upRestricted and downRestricted ) then
-- dead end
print('d/e')
return nil
else
-- go somewhere possible
local possibleDirections = {}
if (not leftRestricted) then possibleDirections[#possibleDirections+1] = 1 end
if (not rightRestricted) then possibleDirections[#possibleDirections+1] = 2 end
if (not upRestricted) then possibleDirections[#possibleDirections+1] = 3 end
if (not downRestricted) then possibleDirections[#possibleDirections+1] = 4 end
local direction = possibleDirections[math.random( 1, #possibleDirections )]
if (direction==1) then
--next point is left
return { c = last.c - 1, r = last.r }
elseif (direction==2) then
--next point is right
return { c = last.c + 1, r = last.r }
elseif (direction==3) then
--next point is up
return { c = last.c, r = last.r - 1 }
elseif (direction==4) then
--next point is down
return { c = last.c, r = last.r + 1 }
end
end
end
--[[ subfunctions end ]]
-- choose random entry point
local entry = { c = math.random( 1, config.tableSize.c ),
r = math.random( 1, config.tableSize.r ) }
-- start points sequence
local pointSequence = { [1] = entry }
-- building the pattern
local succeed = false
while (not succeed) do
for i = 2, self.patternLength do
local nextPoint = move( pointSequence )
if (nextPoint~=nil) then
pointSequence[i] = nextPoint
if (i==self.patternLength) then succeed = true end
else
pointSequence = { [1] = entry }
break
end
end
end
return pointSequence
end
Any ideas or approaches on how this could be realized would be highly appreciated. Maybe some recursive backtracker or a pathfinding or a random-walk algorithms?
The snake-style growing is not enough for good performance.
The main idea is to randomly modify the path being generated by adding small detours like the following:
- - 6 - - - - 8 - -
- - 5 - - - 6 7 - -
- - 4 1 - ===> - 5 4 1 -
- - 3 2 - - - 3 2 -
- - - - - - - - - -
(note the additional two cells added to the left of 4-5 segment)
Such implementation works very fast for area filling < 95%
local function generate_path(W, H, L)
-- W = field width (number of columns) -- c = 1..W
-- H = field height (number of rows) -- r = 1..H
-- L = path length, must be within range 1..W*H
assert(L >= 1 and L <= W * H, "Path length is greater than field area")
local function get_idx(x, y)
return x >= 1 and x <= W and y >= 1 and y <= H and (y - 1) * W + x
end
local function get_x_y(idx)
local x = (idx - 1) % W + 1
local y = (idx - x) / W + 1
return x, y
end
local function random_sort(array)
for last = #array, 2, -1 do
local pos = math.random(last)
array[pos], array[last] = array[last], array[pos]
end
end
local path_sum_x = 0
local path_sum_y = 0
local path_ctr = 0
local is_unused = {} -- [idx] = true/nil (or idx recently swapped with)
local function mark_as_unused(idx, value)
local x, y = get_x_y(idx)
path_sum_x = path_sum_x - x
path_sum_y = path_sum_y - y
path_ctr = path_ctr - 1
is_unused[idx] = value or true
end
local function mark_as_path(idx)
local x, y = get_x_y(idx)
path_sum_x = path_sum_x + x
path_sum_y = path_sum_y + y
path_ctr = path_ctr + 1
is_unused[idx] = nil
end
for x = 1, W do
for y = 1, H do
is_unused[get_idx(x, y)] = true
end
end
-- create path of length 1 by selecting random cell
local idx = get_idx(math.random(W), math.random(H))
mark_as_path(idx)
local path = {first = idx, last = idx, [idx] = {}}
-- path[idx] == {next=next_idx/nil, prev=prev_idx/nil}
local function grow()
local variants = {
{dx=-1, dy=0, origin="last"}, {dx=1, dy=0, origin="last"},
{dx=0, dy=-1, origin="last"}, {dx=0, dy=1, origin="last"},
{dx=-1, dy=0, origin="first"}, {dx=1, dy=0, origin="first"},
{dx=0, dy=-1, origin="first"}, {dx=0, dy=1, origin="first"}
}
random_sort(variants)
for _, vector in ipairs(variants) do
local x, y = get_x_y(path[vector.origin])
local idx = get_idx(vector.dx + x, vector.dy + y)
if is_unused[idx] then
if vector.origin == 'first' then
-- add new first cell of the path
local old_first = path.first
path[old_first].prev = idx
path[idx] = {next = old_first}
path.first = idx
else
-- add new last cell of the path
local old_last = path.last
path[old_last].next = idx
path[idx] = {prev = old_last}
path.last = idx
end
mark_as_path(idx)
return true
end
end
end
local function shrink()
if math.random(2) == 2 then
-- remove first cell of the path
local old_first = path.first
local new_first = assert(path[old_first].next)
path[old_first] = nil
path.first = new_first
path[new_first].prev = nil
mark_as_unused(old_first)
else
-- remove last cell of the path
local old_last = path.last
local new_last = assert(path[old_last].prev)
path[old_last] = nil
path.last = new_last
path[new_last].next = nil
mark_as_unused(old_last)
end
end
local function inflate()
local variants = {}
local idx1 = path.first
repeat
local idx4 = path[idx1].next
if idx4 then
local x1, y1 = get_x_y(idx1)
local x4, y4 = get_x_y(idx4)
local dx14, dy14 = x4 - x1, y4 - y1
local dx, dy = dy14, dx14
for side = 1, 2 do
dx, dy = -dx, -dy
local x2, y2 = x1 + dx, y1 + dy
local idx2 = get_idx(x2, y2)
local idx3 = get_idx(x2 + dx14, y2 + dy14)
if is_unused[idx2] and is_unused[idx3] then
table.insert(variants, {idx1, idx2, idx3, idx4})
end
end
end
idx1 = idx4
until not idx4
if #variants > 0 then
local idx1, idx2, idx3, idx4 =
(table.unpack or unpack)(variants[math.random(#variants)])
-- insert idx2 and idx3 between idx1 and idx4
path[idx1].next = idx2
path[idx2] = {prev = idx1, next = idx3}
path[idx3] = {prev = idx2, next = idx4}
path[idx4].prev = idx3
mark_as_path(idx2)
mark_as_path(idx3)
return true
end
end
local function euclid(dx, dy)
return dx*dx + dy*dy
end
local function swap()
local variants = {}
local path_center_x = path_sum_x / path_ctr
local path_center_y = path_sum_y / path_ctr
local idx1 = path.first
repeat
local idx2 = path[idx1].next
local idx3 = idx2 and path[idx2].next
if idx3 then
local x1, y1 = get_x_y(idx1)
local x2, y2 = get_x_y(idx2)
local x3, y3 = get_x_y(idx3)
local dx12, dy12 = x2 - x1, y2 - y1
local dx23, dy23 = x3 - x2, y3 - y2
if dx12 * dx23 + dy12 * dy23 == 0 then
local x, y = x1 + dx23, y1 + dy23
local idx = get_idx(x, y)
local dist2 = euclid(x2 - path_center_x, y2 - path_center_y)
local dist = euclid(x - path_center_x, y - path_center_y)
if is_unused[idx] and dist2<dist and is_unused[idx]~=idx2 then
table.insert(variants, {idx1, idx2, idx3, idx})
end
end
end
idx1 = idx2
until not idx3
if #variants > 0 then
local idx1, idx2, idx3, idx =
(table.unpack or unpack)(variants[math.random(#variants)])
-- swap idx2 and idx
path[idx1].next = idx
path[idx] = path[idx2]
path[idx3].prev = idx
path[idx2] = nil
mark_as_unused(idx2, idx)
mark_as_path(idx)
return true
end
end
local actions = {grow, inflate, swap}
repeat
random_sort(actions)
local success
for _, action in ipairs(actions) do
success = action()
if success then
break
end
end
if not success and path_ctr < L then
-- erase and rewind
while path_ctr > 1 do
shrink()
end
end
until path_ctr >= L
while path_ctr > L do
shrink()
end
local pointSequence = {}
local idx = path.first
local step = 0
repeat
step = step + 1
path[idx].step = step
local x, y = get_x_y(idx)
pointSequence[step] = {c = x, r = y}
idx = path[idx].next
until not idx
local field = 'W = '..W..', H = '..H..', L = '..L..'\n'
for y = 1, H do
for x = 1, W do
local c = path[get_idx(x, y)]
field = field..(' '..(c and c.step or '-')):sub(-4)
end
field = field..'\n'
end
print(field)
return pointSequence
end
Usage example:
math.randomseed(os.time())
local pointSequence = generate_path(6, 7, 10)
-- pointSequence = {[1]={r=r1,c=c1}, [2]={r=r2,c=c2},...,[10]={r=r10,c=c10}}
Result examples:
W = 5, H = 5, L = 10
- - - 9 10
- 6 7 8 -
- 5 4 1 -
- - 3 2 -
- - - - -
W = 5, H = 5, L = 19
15 16 17 18 19
14 1 2 3 4
13 12 11 6 5
- - 10 7 -
- - 9 8 -
W = 6, H = 7, L = 35
- 35 34 25 24 23
- - 33 26 21 22
- 31 32 27 20 19
- 30 29 28 - 18
- 1 10 11 12 17
3 2 9 8 13 16
4 5 6 7 14 15
W = 19, H = 21, L = 394
77 78 79 84 85 118 119 120 121 122 123 124 125 126 127 128 129 254 255
76 75 80 83 86 117 116 115 114 141 140 139 138 135 134 131 130 253 256
73 74 81 82 87 88 89 112 113 142 145 146 137 136 133 132 - 252 257
72 69 68 67 92 91 90 111 - 143 144 147 148 149 150 151 152 251 258
71 70 65 66 93 108 109 110 163 162 161 160 159 158 157 156 153 250 259
58 59 64 63 94 107 166 165 164 191 192 193 196 197 - 155 154 249 260
57 60 61 62 95 106 167 168 189 190 - 194 195 198 241 242 243 248 261
56 55 54 53 96 105 170 169 188 203 202 201 200 199 240 239 244 247 262
47 48 51 52 97 104 171 172 187 204 205 206 231 232 237 238 245 246 263
46 49 50 99 98 103 174 173 186 209 208 207 230 233 236 267 266 265 264
45 42 41 100 101 102 175 184 185 210 211 228 229 234 235 268 269 270 271
44 43 40 39 38 177 176 183 214 213 212 227 226 225 276 275 274 273 272
33 34 35 36 37 178 179 182 215 216 217 218 223 224 277 278 279 280 281
32 29 28 23 22 - 180 181 12 11 10 219 222 287 286 285 284 283 282
31 30 27 24 21 18 17 14 13 8 9 220 221 288 289 290 291 292 293
380 381 26 25 20 19 16 15 394 7 4 3 304 303 300 299 296 295 294
379 382 383 384 387 388 391 392 393 6 5 2 305 302 301 298 297 312 313
378 371 370 385 386 389 390 347 346 343 342 1 306 307 308 309 310 311 314
377 372 369 364 363 350 349 348 345 344 341 340 333 332 319 318 317 316 315
376 373 368 365 362 351 352 353 354 355 338 339 334 331 320 321 322 323 324
375 374 367 366 361 360 359 358 357 356 337 336 335 330 329 328 327 326 325

JPEG compression implementation in MATLAB

I'm working on an implementation of the JPEG compression algorithm in MATLAB. I've run into some issues when computing the discrete cosine transform(DCT) of the 8x8 image blocks(T = H * F * H_transposed, H is the matrix containing the DCT coefficients of an 8x8 matrix, generated with dctmtx(8) and F is an 8x8 image block). The code is bellow:
jpegCompress.m
function y = jpegCompress(x, quality)
% y = jpegCompress(x, quality) compresses an image X based on 8 x 8 DCT
% transforms, coefficient quantization and Huffman symbol coding. Input
% quality determines the amount of information that is lost and compression achieved. y is the encoding structure containing fields:
% y.size size of x
% y.numblocks number of 8 x 8 encoded blocks
% y.quality quality factor as percent
% y.huffman Huffman coding structure
narginchk(1, 2); % check number of input arguments
if ~ismatrix(x) || ~isreal(x) || ~ isnumeric(x) || ~ isa(x, 'uint8')
error('The input must be a uint8 image.');
end
if nargin < 2
quality = 1; % default value for quality
end
if quality <= 0
error('Input parameter QUALITY must be greater than zero.');
end
m = [16 11 10 16 24 40 51 61 % default JPEG normalizing array
12 12 14 19 26 58 60 55 % and zig-zag reordering pattern
14 13 16 24 40 57 69 56
14 17 22 29 51 87 80 62
18 22 37 56 68 109 103 77
24 35 55 64 81 104 113 92
49 64 78 87 103 121 120 101
72 92 95 98 112 100 103 99] * quality;
order = [1 9 2 3 10 17 25 18 11 4 5 12 19 26 33 ...
41 34 27 20 13 6 7 14 21 28 35 42 49 57 50 ...
43 36 29 22 15 8 16 23 30 37 44 51 58 59 52 ...
45 38 31 24 32 39 46 53 60 61 54 47 40 48 55 ...
62 63 56 64];
[xm, xn] = size(x); % retrieve size of input image
x = double(x) - 128; % level shift input
t = dctmtx(8); % compute 8 x 8 DCT matrix
% Compute DCTs pf 8 x 8 blocks and quantize coefficients
y = blkproc(x, [8 8], 'P1 * x * P2', t, t');
y = blkproc(y, [8 8], 'round(x ./ P1)', m); % <== nearly all elements from y are zero after this step
y = im2col(y, [8 8], 'distinct'); % break 8 x 8 blocks into columns
xb = size(y, 2); % get number of blocks
y = y(order, :); % reorder column elements
eob = max(x(:)) + 1; % create end-of-block symbol
r = zeros(numel(y) + size(y, 2), 1);
count = 0;
for j = 1:xb % process one block(one column) at a time
i = find(y(:, j), 1, 'last'); % find last non-zero element
if isempty(i) % check if there are no non-zero values
i = 0;
end
p = count + 1;
q = p + i;
r(p:q) = [y(1:i, j); eob]; % truncate trailing zeros, add eob
count = count + i + 1; % and add to output vector
end
r((count + 1):end) = []; % delete unused portion of r
y = struct;
y.size = uint16([xm xn]);
y.numblocks = uint16(xb);
y.quality = uint16(quality * 100);
y.huffman = mat2huff(r);
mat2huff is implemented as:
mat2huff.m
function y = mat2huff(x)
%MAT2HUFF Huffman encodes a matrix.
% Y = mat2huff(X) Huffman encodes matrix X using symbol
% probabilities in unit-width histogram bins between X's minimum
% and maximum value s. The encoded data is returned as a structure
% Y :
% Y.code the Huffman - encoded values of X, stored in
% a uint16 vector. The other fields of Y contain
% additional decoding information , including :
% Y.min the minimum value of X plus 32768
% Y.size the size of X
% Y.hist the histogram of X
%
% If X is logical, uintB, uint16 ,uint32 ,intB ,int16, or double,
% with integer values, it can be input directly to MAT2HUF F. The
% minimum value of X must be representable as an int16.
%
% If X is double with non - integer values --- for example, an image
% with values between O and 1 --- first scale X to an appropriate
% integer range before the call.For example, use Y
% MAT2HUFF (255 * X) for 256 gray level encoding.
%
% NOTE : The number of Huffman code words is round(max(X(:)))
% round (min(X(:)))+1. You may need to scale input X to generate
% codes of reasonable length. The maximum row or column dimension
% of X is 65535.
if ~ismatrix(x) || ~isreal(x) || (~isnumeric(x) && ~islogical(x))
error('X must be a 2-D real numeric or logical matrix.');
end
% Store the size of input x.
y.size = uint32(size(x));
% Find the range of x values
% by +32768 as a uint16.
x = round(double(x));
xmin = min(x(:));
xmax = max(x(:));
pmin = double(int16(xmin));
pmin = uint16(pmin+32768);
y.min = pmin;
% Compute the input histogram between xmin and xmax with unit
% width bins , scale to uint16 , and store.
x = x(:)';
h = histc(x, xmin:xmax);
if max(h) > 65535
h = 65535 * h / max(h);
end
h = uint16(h);
y.hist = h;
% Code the input mat rix and store t h e r e s u lt .
map = huffman(double(h)); % Make Huffman code map
hx = map(x(:) - xmin + 1); % Map image
hx = char(hx)'; % Convert to char array
hx = hx(:)';
hx(hx == ' ') = [ ]; % Remove blanks
ysize = ceil(length(hx) / 16); % Compute encoded size
hx16 = repmat('0', 1, ysize * 16); % Pre-allocate modulo-16 vector
hx16(1:length(hx)) = hx; % Make hx modulo-16 in length
hx16 = reshape(hx16, 16, ysize); % Reshape to 16-character words
hx16 = hx16' - '0'; % Convert binary string to decimal
twos = pow2(15 : - 1 : 0);
y.code = uint16(sum(hx16 .* twos(ones(ysize ,1), :), 2))';
Why is the block processing step generating mostly null values?
It is likely that multiplying the Quantization values you have by four is causing the DCT coefficients to go to zero.

How to speed up Pandas multilevel dataframe shift by group?

I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))
How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095
similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.
try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0

How to efficiently calculate a row in pascal's triangle?

I'm interested in finding the nth row of pascal triangle (not a specific element but the whole row itself). What would be the most efficient way to do it?
I thought about the conventional way to construct the triangle by summing up the corresponding elements in the row above which would take:
1 + 2 + .. + n = O(n^2)
Another way could be using the combination formula of a specific element:
c(n, k) = n! / (k!(n-k)!)
for each element in the row which I guess would take more time the the former method depending on the way to calculate the combination. Any ideas?
>>> def pascal(n):
... line = [1]
... for k in range(n):
... line.append(line[k] * (n-k) / (k+1))
... return line
...
>>> pascal(9)
[1, 9, 36, 84, 126, 126, 84, 36, 9, 1]
This uses the following identity:
C(n,k+1) = C(n,k) * (n-k) / (k+1)
So you can start with C(n,0) = 1 and then calculate the rest of the line using this identity, each time multiplying the previous element by (n-k) / (k+1).
A single row can be calculated as follows:
First compute 1. -> N choose 0
Then N/1 -> N choose 1
Then N*(N-1)/1*2 -> N choose 2
Then N*(N-1)*(N-2)/1*2*3 -> N choose 3
.....
Notice that you can compute the next value from the previous value, by just multipyling by a single number and then dividing by another number.
This can be done in a single loop. Sample python.
def comb_row(n):
r = 0
num = n
cur = 1
yield cur
while r <= n:
r += 1
cur = (cur* num)/r
yield cur
num -= 1
The most efficient approach would be:
std::vector<int> pascal_row(int n){
std::vector<int> row(n+1);
row[0] = 1; //First element is always 1
for(int i=1; i<n/2+1; i++){ //Progress up, until reaching the middle value
row[i] = row[i-1] * (n-i+1)/i;
}
for(int i=n/2+1; i<=n; i++){ //Copy the inverse of the first part
row[i] = row[n-i];
}
return row;
}
here is a fast example implemented in go-lang that calculates from the outer edges of a row and works it's way to the middle assigning two values with a single calculation...
package main
import "fmt"
func calcRow(n int) []int {
// row always has n + 1 elements
row := make( []int, n + 1, n + 1 )
// set the edges
row[0], row[n] = 1, 1
// calculate values for the next n-1 columns
for i := 0; i < int(n / 2) ; i++ {
x := row[ i ] * (n - i) / (i + 1)
row[ i + 1 ], row[ n - 1 - i ] = x, x
}
return row
}
func main() {
for n := 0; n < 20; n++ {
fmt.Printf("n = %d, row = %v\n", n, calcRow( n ))
}
}
the output for 20 iterations takes about 1/4 millisecond to run...
n = 0, row = [1]
n = 1, row = [1 1]
n = 2, row = [1 2 1]
n = 3, row = [1 3 3 1]
n = 4, row = [1 4 6 4 1]
n = 5, row = [1 5 10 10 5 1]
n = 6, row = [1 6 15 20 15 6 1]
n = 7, row = [1 7 21 35 35 21 7 1]
n = 8, row = [1 8 28 56 70 56 28 8 1]
n = 9, row = [1 9 36 84 126 126 84 36 9 1]
n = 10, row = [1 10 45 120 210 252 210 120 45 10 1]
n = 11, row = [1 11 55 165 330 462 462 330 165 55 11 1]
n = 12, row = [1 12 66 220 495 792 924 792 495 220 66 12 1]
n = 13, row = [1 13 78 286 715 1287 1716 1716 1287 715 286 78 13 1]
n = 14, row = [1 14 91 364 1001 2002 3003 3432 3003 2002 1001 364 91 14 1]
n = 15, row = [1 15 105 455 1365 3003 5005 6435 6435 5005 3003 1365 455 105 15 1]
n = 16, row = [1 16 120 560 1820 4368 8008 11440 12870 11440 8008 4368 1820 560 120 16 1]
n = 17, row = [1 17 136 680 2380 6188 12376 19448 24310 24310 19448 12376 6188 2380 680 136 17 1]
n = 18, row = [1 18 153 816 3060 8568 18564 31824 43758 48620 43758 31824 18564 8568 3060 816 153 18 1]
n = 19, row = [1 19 171 969 3876 11628 27132 50388 75582 92378 92378 75582 50388 27132 11628 3876 969 171 19 1]
An easy way to calculate it is by noticing that the element of the next row can be calculated as a sum of two consecutive elements in the previous row.
[1, 5, 10, 10, 5, 1]
[1, 6, 15, 20, 15, 6, 1]
For example 6 = 5 + 1, 15 = 5 + 10, 1 = 1 + 0 and 20 = 10 + 10. This gives a simple algorithm to calculate the next row from the previous one.
def pascal(n):
row = [1]
for x in xrange(n):
row = [l + r for l, r in zip(row + [0], [0] + row)]
# print row
return row
print pascal(10)
In Scala Programming: i would have done it as simple as this:
def pascal(c: Int, r: Int): Int = c match {
case 0 => 1
case `c` if c >= r => 1
case _ => pascal(c-1, r-1)+pascal(c, r-1)
}
I would call it inside this:
for (row <- 0 to 10) {
for (col <- 0 to row)
print(pascal(col, row) + " ")
println()
}
resulting to:
.
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
1 6 15 20 15 6 1
1 7 21 35 35 21 7 1
1 8 28 56 70 56 28 8 1
1 9 36 84 126 126 84 36 9 1
1 10 45 120 210 252 210 120 45 10 1
To explain step by step:
Step 1: We make sure that if our column is the first one we always return figure 1.
Step 2: Since each X-th row there are X number of columns. So we say that; the last column X is greater than or equal to X-th row, then the return figure 1.
Step 3: Otherwise, we get the sum of the repeated pascal of the column just before the current one and the row just before the current one ; and the pascal of that column and the row just before the current one.
Good Luck.
Let me build upon Shane's excellent work for an R solution. (Thank you, Shane!. His code for generating the triangle:
pascalTriangle <- function(h) {
lapply(0:h, function(i) choose(i, 0:i))
}
This will allow one to store the triangle as a list. We can then index whatever row desired. But please add 1 when indexing! For example, I'll grab the bottom row:
pt_with_24_rows <- pascalTriangle(24)
row_24 <- pt_with_24_rows[25] # add one
row_24[[1]] # prints the row
So, finally, make-believe I have a Galton Board problem. I have the arbitrary challenge of finding out percentage of beans have clustered in the center: say, bins 10 to 15 (out of 25).
sum(row_24[[1]][10:15])/sum(row_24[[1]])
Which turns out to be 0.7704771. All good!
In Ruby, the following code will print out the specific row of Pascals Triangle that you want:
def row(n)
pascal = [1]
if n < 1
p pascal
return pascal
else
n.times do |num|
nextNum = ((n - num)/(num.to_f + 1)) * pascal[num]
pascal << nextNum.to_i
end
end
p pascal
end
Where calling row(0) returns [1] and row(5) returns [1, 5, 10, 10, 5, 1]
Here is the another best and simple way to design a Pascal Triangle dynamically using VBA.
`1
11
121
1331
14641`
`Sub pascal()
Dim book As Excel.Workbook
Dim sht As Worksheet
Set book = ThisWorkbook
Set sht = book.Worksheets("sheet1")
a = InputBox("Enter the Number", "Fill")
For i = 1 To a
For k = 1 To i
If i >= 2 And k >= 2 Then
sht.Cells(i, k).Value = sht.Cells(i - 1, k - 1) + sht.Cell(i- 1, k)
Else
sht.Cells(i, k).Value = 1
End If
Next k
Next i
End Sub`
I used Ti-84 Plus CE
The use of –> in line 6 is the store value button
Forloop syntax is
:For(variable, beginning, end [, increment])
:Commands
:End
nCr syntax is
:valueA nCr valueB
List indexes start at 1 so that's why i set it to R+1
N= row
R= column
PROGRAM: PASCAL
:ClrHome
:ClrList L1
:Disp "ROW
:Input N
:For(R,0,N,1)
:N nCr R–>L1(R+1)
:End
:Disp L1
This is the fastest way I can think of to do this in programming (with a ti 84) but if you mean to be able to calculate the row using pen and paper then just draw out the triangle cause doing factorals are a pain!
Here's an O(n) space-complexity solution in Python:
def generate_pascal_nth_row(n):
result=[1]*n
for i in range(n):
previous_res = result.copy()
for j in range(1,i):
result[j] = previous_res[j-1] + previous_res[j]
return result
print(generate_pascal_nth_row(6))
class Solution{
public:
int comb(int n,int r){
long long c=1;
for(int i=1;i<=r;i++) { //calculates n!/(n-r)!
c=((c*n))/i; n--;
}
return c;
}
vector<int> getRow(int n) {
vector<int> v;
for (int i = 0; i < n; ++i)
v.push_back(comb(n,i));
return v;
}
};
faster than 100% submissions on leet code https://leetcode.com/submissions/detail/406399031/
The most efficient way to calculate a row in pascal's triangle is through convolution. First we chose the second row (1,1) to be a kernel and then in order to get the next row we only need to convolve curent row with the kernel.
So convolution of the kernel with second row gives third row [1 1]*[1 1] = [1 2 1], convolution with the third row gives fourth [1 2 1]*[1 1] = [1 3 3 1] and so on
This is a function in julia-lang (very simular to matlab):
function binomRow(n::Int64)
baseVector = [1] #the first row is equal to 1.
kernel = [1,1] #This is the second row and a kernel.
row = zeros(n)
for i = 1 : n
row = baseVector
baseVector = conv(baseVector, kernel) #convoltion with kernel
end
return row::Array{Int64,1}
end
To find nth row -
int res[] = new int[n+1];
res[0] = 1;
for(int i = 1; i <= n; i++)
for(int j = i; j > 0; j++)
res[j] += res[j-1];

Resources