RX LINQ partition the input stream - linq

I have a input stream where the input element consist of Date, Depth and Area.
I want to plot the Area against the Depth and want therefor to take out a window of Depth e.g. between 1.0-100.0m.
The problem is that I want to down sample the input stream since there can be many inputs with close Depth values.
I want to partition the input into x bins, e.g. 2 bins means all depth values between 1-50 is averaged in the first bin and 51-100 in the second bin.
I was thinking something like this:
var q = from e in input
where (e.Depth > 1) && (e.Depth <= 100)
// here I need some way of partition the sequence into bins
// and averaging the elements.
Split a collection into `n` parts with LINQ? wants to do something similar without rx.

Modified answer as per your comment. steps = number of buckets.
var min = 1, max = 100;
var steps = 10;
var f = (max - min + 1) / steps; // The extra 1 is really an epsilon. #hack
var q = from e in input
where e.Depth > 1 && e.depth <= 100
let x = e.Depth - min
group e by x < max ? (x - (x % f)) : ;
This is the function we're grouping by for the given e.Depth.
This probably won't work so great with floating point values (due to precision), unless you floor/ceil the selection, but then you might run out of integers, so you may need to scale a bit... something like group e by Math.Floor((x - (x % f)) * scaleFactor).

This should do what you want
static int GetBucket(double value, double min, double max, int bucketCount)
{
return (int)((value - min) / (max - min) * bucketCount + 0.5);
}
var grouped = input.GroupBy(e => GetBucket(e.Depth, 1, 100, 50));

Related

Scoring two sequences of ordered numbers for their similarity to one-another

How would I go about scoring two sequences of numbers such that
5, 8, 28, 31 (differences of 3, 20 and 3)
6, 9, 26, 29 differences of 3, 17 and 3
are considered similar "enough" but a sequence of
8 11 31 34 (differences of 3, 20 and 3, errors of 3, 3, 3, 3)
Is too dissimilar to allow?
The second set of numbers has an absolute error of
1 1 2 2 and that is low "enough" to accept.
If that error was too high I'd like to be able to reject it.
To give a little background, these are indicators of time and when events arrived to a computer. The first sequence is the expected time of arrival and the second sequence is the actual times they arrived. Knowing that the sequence is at least in the correct order I need to be able to score the similarity to the expectation and accept or reject it by tweaking some sort of value.
If it were standard deviation for a set of numbers where order didn't matter I could just reject the second set based on its own standard deviation.
Since this is not the case I had the idea of measuring deviance and position error.
Position error shouldn't exceed 3, though this number should not be integer - it needs to be decimal as the numbers are more realistically floating point, or at least accurate to 6 decimal places.
It also needs to work equally well, or perhaps offer a variant in which a much longer series of numbers can be scored fairly.
In the longer series of numbers it it not likely the position error will exceed 3 so the position error would still be fairly low.
This is a partial solution I have found using a Person's correlation coefficient series for each time x fits into y. It uses the form of the equation that works off expected values. The comments describe it fairly well.
function getPearsonsCorrelation(x, y)
{
/**
* Pearsons can be calculated in an alternative fashion as
* p(x, y) = (E(xy) - E(x)*E(y))/sqrt[(E(x^2)-(E(x))^2)*(E(y^2)-(E(y))^2)]
* where p(x, y) is the Pearson's correlation result, E is a function referring to the expected value
* E(x) = var expectedValue = 0; for(var i = 0; i < x.length; i ++){ expectedValue += x[i]*p[i] }
* where p[i] is the probability of that variable occurring, here we substitute in 1 every time
* hence this simplifies to E(x) = sum of all x values
* sqrt is the square root of the result in square brackets
* ^2 means to the power of two, or rather just square that value
**/
var maxdelay = y.length - x.length; // we will calculate Pearson's correlation coefficient at every location x fits into y
var xl = x.length
var results = [];
for(var d = 0; d <= maxdelay; d++){
var xy = [];
var x2 = [];
var y2 = [];
var _y = y.slice(d, d + x.length); // take just the segment of y at delay
for(var i = 0; i < xl; i ++){
xy.push(x[i] * _y[i]); // x*y array
x2.push(x[i] * x[i]); // x squareds array
y2.push(_y[i] * _y[i]); // y squareds array
}
var sum_x = 0;
var sum_y = 0;
var sum_xy = 0;
var sum_x2 = 0;
var sum_y2 = 0;
for(var i = 0; i < xl; i ++){
sum_x += x[i]; // expected value of x
sum_y += _y[i]; // expected value of y
sum_xy += xy[i]; // expected value of xy/n
sum_x2 += x2[i]; // expected value of (x squared)/n
sum_y2 += y2[i]; // expected value of (y squared)/n
}
var numerator = xl * sum_xy - sum_x * sum_y; // expected value of xy - (expected value of x * expected value of y)
var denomLetSide = xl * sum_x2 - sum_x * sum_x; // expected value of (x squared) - (expected value of x) squared
var denomRightSide = xl * sum_y2 - sum_y * sum_y; // expected value of (y squared) - (expected value of y) squared
var denom = Math.sqrt(denomLetSide * denomRightSide);
var pearsonsCorrelation = numerator / denom;
results.push(pearsonsCorrelation);
}
return results;
}

Can I easily skip pixels in Bresenham's line algorithm?

I have a program which is using Bresenham's line algorithm to scan pixels in a line. This is reading pixels rather than writing them, and in my particular case, reading them is costly.
I can however determine that some spans of pixels do not need to be read. It looks something like this:
Normal scan of all pixels:
*start
\
\
\
\
\
*end
Scan without reading all pixels:
*start
\
\
- At this point I know I can skip (for example) the next 100 pixels
in the loop. Crucially, I can't know this until I reach the gap.
\
*end
The gap in the middle is much quicker because I can just iterate over the pixels without reading them.
However, can I modify the loop in any way to just jump directly forward 100 pixels within the loop, calculating directly the required values 100 steps ahead in the line algorithm?
Bresenhams middlepoint algorithm calculates 'distance' of point from a theoretical line going from (ax,ay)->(bx,by) by summing up digital differences delta_x = (by-ay), delta_y = (ax-bx).
Thus, if one want's to skip 7 pixels, one has to add accum += 7*delta_x; then dividing by delta_y one can check how many pixels should have been moved in y-direction and taking a remainder accum = accum % delta_y one should be able to continue at proper position.
The nice thing is that the algorithm is originated from the necessity of avoiding a division...
Disclaimer: whatever told may need to be adjusted by half delta.
Your main loop looks essentially something like:
while (cnt > 0) // cnt is 1 + the biggest of abs(x2-x1) and abs(y2-y1)
{
ReadOrWritePixel(x, y);
k += n; // n is the smallest of abs(x2-x1) and abs(y2-y1)
if (k < m) // m is the biggest of abs(x2-x1) and abs(y2-y1)
{
// continuing a horizontal/vertical segment
x += dx2; // dx2 = sgn(x2-x1) or 0
y += dy2; // dy2 = sgn(y2-y1) or 0
}
else
{
// beginning a new horizontal/vertical segment
k -= m;
x += dx1; // dx1 = sgn(x2-x1)
y += dy1; // dy1 = sgn(y2-y1)
}
cnt--;
}
So, skipping some q pixels is equivalent to the following adjustments (unless I made a mistake somewhere):
cntnew = cntold - q
knew = (kold + n * q) % m
xnew = xold + ((kold + n * q) / m) * dx1 + (q - ((kold + n * q) / m)) * dx2
ynew = yold + ((kold + n * q) / m) * dy1 + (q - ((kold + n * q) / m)) * dy2
Note that / and % are integer division and modulo operators.

Number distribution

Problem: We have x checkboxes and we want to check y of them evenly.
Example 1: select 50 checkboxes of 100 total.
[-]
[x]
[-]
[x]
...
Example 2: select 33 checkboxes of 100 total.
[-]
[-]
[x]
[-]
[-]
[x]
...
Example 3: select 66 checkboxes of 100 total:
[-]
[x]
[x]
[-]
[x]
[x]
...
But we're having trouble to come up with a formula to check them in code, especially once you go 11/111 or something similar. Anyone has an idea?
Let's first assume y is divisible by x. Then we denote p = y/x and the solution is simple. Go through the list, every p elements, mark 1 of them.
Now, let's say r = y%x is non zero. Still p = y/x where / is integer devision. So, you need to:
In the first p-r elements, mark 1 elements
In the last r elements, mark 2 elements
Note: This depends on how you define evenly distributed. You might want to spread the r sections withx+1 elements in between p-r sections with x elements, which indeed is again the same problem and could be solved recursively.
Alright so it wasn't actually correct. I think this would do though:
Regardless of divisibility:
if y > 2*x, then mark 1 element every p = y/x elements, x times.
if y < 2*x, then mark all, and do the previous step unmarking y-x out of y checkboxes (so like in the previous case, but x is replaced by y-x)
Note: This depends on how you define evenly distributed. You might want to change between p and p+1 elements for example to distribute them better.
Here's a straightforward solution using integer arithmetic:
void check(char boxes[], int total_count, int check_count)
{
int i;
for (i = 0; i < total_count; i++)
boxes[i] = '-';
for (i = 0; i < check_count; i++)
boxes[i * total_count / check_count] = 'x';
}
total_count is the total number of boxes, and check_count is the number of boxes to check.
First, it sets every box to unchecked. Then, it checks check_count boxes, scaling the counter to the number of boxes.
Caveat: this is left-biased rather than right-biased like in your examples. That is, it prints x--x-- rather than --x--x. You can turn it around by replacing
boxes[i * total_count / check_count] = 'x';
with:
boxes[total_count - (i * total_count / check_count) - 1] = 'x';
Correctness
Assuming 0 <= check_count <= total_count, and that boxes has space for at least total_count items, we can prove that:
No check marks will overlap. i * total_count / check_count increments by at least one on every iteration, because total_count >= check_count.
This will not overflow the buffer. The subscript i * total_count / check_count
Will be >= 0. i, total_count, and check_count will all be >= 0.
Will be < total_count. When n > 0 and d > 0:
(n * d - 1) / d < n
In other words, if we take n * d / d, and nudge the numerator down, the quotient will go down, too.
Therefore, (check_count - 1) * total_count / check_count will be less than total_count, with the assumptions made above. A division by zero won't happen because if check_count is 0, the loop in question will have zero iterations.
Say number of checkboxes is C and the number of Xes is N.
You example states that having C=111 and N=11 is your most troublesome case.
Try this: divide C/N. Call it D. Have index in the array as double number I. Have another variable as counter, M.
double D = (double)C / (double)N;
double I = 0.0;
int M = N;
while (M > 0) {
if (checkboxes[Round(I)].Checked) { // if we selected it, skip to next
I += 1.0;
continue;
}
checkboxes[Round(I)].Checked = true;
M --;
I += D;
if (Round(I) >= C) { // wrap around the end
I -= C;
}
}
Please note that Round(x) should return nearest integer value for x.
This one could work for you.
I think the key is to keep count of how many boxes you expect to have per check.
Say you want 33 checks in 100 boxes. 100 / 33 = 3.030303..., so you expect to have one check every 3.030303... boxes. That means every 3.030303... boxes, you need to add a check. 66 checks in 100 boxes would mean one check every 1.51515... boxes, 11 checks in 111 boxes would mean one check every 10.090909... boxes, and so on.
double count = 0;
for (int i = 0; i < boxes; i++) {
count += 1;
if (count >= boxes/checks) {
checkboxes[i] = true;
count -= count.truncate(); // so 1.6 becomes 0.6 - resetting the count but keeping the decimal part to keep track of "partial boxes" so far
}
}
You might rather use decimal as opposed to double for count, or there's a slight chance the last box will get skipped due to rounding errors.
Bresenham-like algorithm is suitable to distribute checkboxes evenly. Output of 'x' corresponds to Y-coordinate change. It is possible to choose initial err as random value in range [0..places) to avoid biasing.
def Distribute(places, stars):
err = places // 2
res = ''
for i in range(0, places):
err = err - stars
if err < 0 :
res = res + 'x'
err = err + places
else:
res = res + '-'
print(res)
Distribute(24,17)
Distribute(24,12)
Distribute(24,5)
output:
x-xxx-xx-xx-xxx-xx-xxx-x
-x-x-x-x-x-x-x-x-x-x-x-x
--x----x----x---x----x--
Quick html/javascript solution:
<html>
<body>
<div id='container'></div>
<script>
var cbCount = 111;
var cbCheckCount = 11;
var cbRatio = cbCount / cbCheckCount;
var buildCheckCount = 0;
var c = document.getElementById('container');
for (var i=1; i <= cbCount; i++) {
// make a checkbox
var cb = document.createElement('input');
cb.type = 'checkbox';
test = i / cbRatio - buildCheckCount;
if (test >= 1) {
// check the checkbox we just made
cb.checked = 'checked';
buildCheckCount++;
}
c.appendChild(cb);
c.appendChild(document.createElement('br'));
}
</script>
</body></html>
Adapt code from one question's answer or another answer from earlier this month. Set N = x = number of checkboxes and M = y = number to be checked and apply formula (N*i+N)/M - (N*i)/M for section sizes. (Also see Joey Adams' answer.)
In python, the adapted code is:
N=100; M=33; p=0;
for i in range(M):
k = (N+N*i)/M
for j in range(p,k-1): print "-",
print "x",
p=k
which produces
- - x - - x - - x - - x - - [...] x - - x - - - x where [...] represents 25 --x repetitions.
With M=66 the code gives
x - x x - x x - x x - x x - [...] x x - x x - x - x where [...] represents mostly xx- repetitions, with one x- in the middle.
Note, in C or java: Substitute for (i=0; i<M; ++i) in place of for i in range(M):. Substitute for (j=p; j<k-1; ++j) in place of for j in range(p,k-1):.
Correctness: Note that M = x boxes get checked because print "x", is executed M times.
What about using Fisher–Yates shuffle ?
Make array, shuffle and pick first n elements. You do not need to shuffle all of them, just first n of array. Shuffling can be find in most language libraries.

Custom paging algorithm to calculate pages to display

I'm working on a custom data pager for a custom google maps control. The control needs to work out what range of pages to display. For example, if the user is on page 6 then the control must display pages 1 through to 10. If the user is on page 37, then the control must display pages 30 throught to 40.
The variables I have available are:
X - Total results (points on the map)
Y - The current page size. i.e. the amount of points per page.
Z - The current page being displayed
Q - The number of page numbers to display (a const of 10)
I have come up with:
Starting Index = Z - (Z % Q)
Ending Index = Z - (Z % Q) + Q
This, however, doesn't work for when the current page is less than 10. It also doesn't figure out whether there is a max page reached, i.e. we always display a full range of 10. However, if we display the range 30-40 the final page could actually be 38.
If anyone can come up with a more elegant algorithm it would be appreciated.
It might be easier if you think in terms of chapters.
Say each set of pages is a chapter, chapter are numbered starting from 0, 1, 2,...
Then the rth chapter has pages in the range
Qr + 1 <= page <= Q(r+1)
Now consider floor(page/Q). This is r if page is not a multiple of Q, otherwise it is r+1.
Given an r, you can find out the pages of the chapter as Lower = Qr + 1 and higher = min(max, Q(r+1)).
So you can do this.
if (Z < 1 || Z > max_page) { error;}
if (Z % Q == 0) {
r = Z/Q - 1; // integer division, gives floor.
}
else {
r = Z/Q; // floor.
}
Begin = Q*r + 1;
End = Min (Q*(r+1), max_page);
To get rid of the if, you can now replace it with
if (Z < 1 || Z > max_page) { error;}
r = (Z-1)/Q;
Begin = Q*r + 1;
End = Min (Q*(r+1), max_page);
This works because:
Qr + 1 <= Z <= Q(r+1) if and only if
Qr <= Z-1 <= Qr + (Q-1).
Thus floor((Z-1)/Q) = r.
Here we go:
def lower(Z):
return (Z - 1) // Q * Q + 1
def upper(Z):
return min(int(ceil(X / Y)), ((Z - 1) // Q + 1) * Q)
// is integer division.

Tickmark algorithm for a graph axis

I'm looking for an algorithm that places tick marks on an axis, given a range to display, a width to display it in, and a function to measure a string width for a tick mark.
For example, given that I need to display between 1e-6 and 5e-6 and a width to display in pixels, the algorithm would determine that I should put tickmarks (for example) at 1e-6, 2e-6, 3e-6, 4e-6, and 5e-6. Given a smaller width, it might decide that the optimal placement is only at the even positions, i.e. 2e-6 and 4e-6 (since putting more tickmarks would cause them to overlap).
A smart algorithm would give preference to tickmarks at multiples of 10, 5, and 2. Also, a smart algorithm would be symmetric around zero.
As I didn't like any of the solutions I've found so far, I implemented my own. It's in C# but it can be easily translated into any other language.
It basically chooses from a list of possible steps the smallest one that displays all values, without leaving any value exactly in the edge, lets you easily select which possible steps you want to use (without having to edit ugly if-else if blocks), and supports any range of values. I used a C# Tuple to return three values just for a quick and simple demonstration.
private static Tuple<decimal, decimal, decimal> GetScaleDetails(decimal min, decimal max)
{
// Minimal increment to avoid round extreme values to be on the edge of the chart
decimal epsilon = (max - min) / 1e6m;
max += epsilon;
min -= epsilon;
decimal range = max - min;
// Target number of values to be displayed on the Y axis (it may be less)
int stepCount = 20;
// First approximation
decimal roughStep = range / (stepCount - 1);
// Set best step for the range
decimal[] goodNormalizedSteps = { 1, 1.5m, 2, 2.5m, 5, 7.5m, 10 }; // keep the 10 at the end
// Or use these if you prefer: { 1, 2, 5, 10 };
// Normalize rough step to find the normalized one that fits best
decimal stepPower = (decimal)Math.Pow(10, -Math.Floor(Math.Log10((double)Math.Abs(roughStep))));
var normalizedStep = roughStep * stepPower;
var goodNormalizedStep = goodNormalizedSteps.First(n => n >= normalizedStep);
decimal step = goodNormalizedStep / stepPower;
// Determine the scale limits based on the chosen step.
decimal scaleMax = Math.Ceiling(max / step) * step;
decimal scaleMin = Math.Floor(min / step) * step;
return new Tuple<decimal, decimal, decimal>(scaleMin, scaleMax, step);
}
static void Main()
{
// Dummy code to show a usage example.
var minimumValue = data.Min();
var maximumValue = data.Max();
var results = GetScaleDetails(minimumValue, maximumValue);
chart.YAxis.MinValue = results.Item1;
chart.YAxis.MaxValue = results.Item2;
chart.YAxis.Step = results.Item3;
}
Take the longest of the segments about zero (or the whole graph, if zero is not in the range) - for example, if you have something on the range [-5, 1], take [-5,0].
Figure out approximately how long this segment will be, in ticks. This is just dividing the length by the width of a tick. So suppose the method says that we can put 11 ticks in from -5 to 0. This is our upper bound. For the shorter side, we'll just mirror the result on the longer side.
Now try to put in as many (up to 11) ticks in, such that the marker for each tick in the form i*10*10^n, i*5*10^n, i*2*10^n, where n is an integer, and i is the index of the tick. Now it's an optimization problem - we want to maximize the number of ticks we can put in, while at the same time minimizing the distance between the last tick and the end of the result. So assign a score for getting as many ticks as we can, less than our upper bound, and assign a score to getting the last tick close to n - you'll have to experiment here.
In the above example, try n = 1. We get 1 tick (at i=0). n = 2 gives us 1 tick, and we're further from the lower bound, so we know that we have to go the other way. n = 0 gives us 6 ticks, at each integer point point. n = -1 gives us 12 ticks (0, -0.5, ..., -5.0). n = -2 gives us 24 ticks, and so on. The scoring algorithm will give them each a score - higher means a better method.
Do this again for the i * 5 * 10^n, and i*2*10^n, and take the one with the best score.
(as an example scoring algorithm, say that the score is the distance to the last tick times the maximum number of ticks minus the number needed. This will likely be bad, but it'll serve as a decent starting point).
Funnily enough, just over a week ago I came here looking for an answer to the same question, but went away again and decided to come up with my own algorithm. I am here to share, in case it is of any use.
I wrote the code in Python to try and bust out a solution as quickly as possible, but it can easily be ported to any other language.
The function below calculates the appropriate interval (which I have allowed to be either 10**n, 2*10**n, 4*10**n or 5*10**n) for a given range of data, and then calculates the locations at which to place the ticks (based on which numbers within the range are divisble by the interval). I have not used the modulo % operator, since it does not work properly with floating-point numbers due to floating-point arithmetic rounding errors.
Code:
import math
def get_tick_positions(data: list):
if len(data) == 0:
return []
retpoints = []
data_range = max(data) - min(data)
lower_bound = min(data) - data_range/10
upper_bound = max(data) + data_range/10
view_range = upper_bound - lower_bound
num = lower_bound
n = math.floor(math.log10(view_range) - 1)
interval = 10**n
num_ticks = 1
while num <= upper_bound:
num += interval
num_ticks += 1
if num_ticks > 10:
if interval == 10 ** n:
interval = 2 * 10 ** n
elif interval == 2 * 10 ** n:
interval = 4 * 10 ** n
elif interval == 4 * 10 ** n:
interval = 5 * 10 ** n
else:
n += 1
interval = 10 ** n
num = lower_bound
num_ticks = 1
if view_range >= 10:
copy_interval = interval
else:
if interval == 10 ** n:
copy_interval = 1
elif interval == 2 * 10 ** n:
copy_interval = 2
elif interval == 4 * 10 ** n:
copy_interval = 4
else:
copy_interval = 5
first_val = 0
prev_val = 0
times = 0
temp_log = math.log10(interval)
if math.isclose(lower_bound, 0):
first_val = 0
elif lower_bound < 0:
if upper_bound < -2*interval:
if n < 0:
copy_ub = round(upper_bound*10**(abs(temp_log) + 1))
times = copy_ub // round(interval*10**(abs(temp_log) + 1)) + 2
else:
times = upper_bound // round(interval) + 2
while first_val >= lower_bound:
prev_val = first_val
first_val = times * copy_interval
if n < 0:
first_val *= (10**n)
times -= 1
first_val = prev_val
times += 3
else:
if lower_bound > 2*interval:
if n < 0:
copy_ub = round(lower_bound*10**(abs(temp_log) + 1))
times = copy_ub // round(interval*10**(abs(temp_log) + 1)) - 2
else:
times = lower_bound // round(interval) - 2
while first_val < lower_bound:
first_val = times*copy_interval
if n < 0:
first_val *= (10**n)
times += 1
if n < 0:
retpoints.append(first_val)
else:
retpoints.append(round(first_val))
val = first_val
times = 1
while val <= upper_bound:
val = first_val + times * interval
if n < 0:
retpoints.append(val)
else:
retpoints.append(round(val))
times += 1
retpoints.pop()
return retpoints
When passing in the following three data-points to the function
points = [-0.00493, -0.0003892, -0.00003292]
... the output I get (as a list) is as follows:
[-0.005, -0.004, -0.003, -0.002, -0.001, 0.0]
When passing this:
points = [1.399, 38.23823, 8309.33, 112990.12]
... I get:
[0, 20000, 40000, 60000, 80000, 100000, 120000]
When passing this:
points = [-54, -32, -19, -17, -13, -11, -8, -4, 12, 15, 68]
... I get:
[-60, -40, -20, 0, 20, 40, 60, 80]
... which all seem to be a decent choice of positions for placing ticks.
The function is written to allow 5-10 ticks, but that could easily be changed if you so please.
Whether the list of data supplied contains ordered or unordered data it does not matter, since it is only the minimum and maximum data points within the list that matter.
This simple algorithm yields an interval that is multiple of 1, 2, or 5 times a power of 10. And the axis range gets divided in at least 5 intervals. The code sample is in java language:
protected double calculateInterval(double range) {
double x = Math.pow(10.0, Math.floor(Math.log10(range)));
if (range / x >= 5)
return x;
else if (range / (x / 2.0) >= 5)
return x / 2.0;
else
return x / 5.0;
}
This is an alternative, for minimum 10 intervals:
protected double calculateInterval(double range) {
double x = Math.pow(10.0, Math.floor(Math.log10(range)));
if (range / (x / 2.0) >= 10)
return x / 2.0;
else if (range / (x / 5.0) >= 10)
return x / 5.0;
else
return x / 10.0;
}
I've been using the jQuery flot graph library. It's open source and does axis/tick generation quite well. I'd suggest looking at it's code and pinching some ideas from there.

Resources