how to automatically sort the x values in gnuplot - sorting

When I try to plot something with linespoints, unless the values that go to the x axis are already sorted within the file, each point gets connected to the point that is on the next line of the file:
Only when I sort the values in the file I can get the desired effect, which is that each point gets connected to the point with one smaller and one larger x value:
Is there a way to do this within gnuplot, without having to sort the files in bash?

Gnuplot offers some smoothing filters which as first step sort the data by their x-values. plot ... smooth unique first sorts the data points by their x-value and for equal x-values it computes the average y-value. So if you are sure that the x-values are unique, then you can use this option. Otherwise you must use an external tool or script to do the sorting with plot '< sort file.dat'

Related

How to Fix False Negative

Background:
I am making a program that detects the grid of a maze (like this). The way I do this is by getting the average color of each row/column and graphing it to locate the general grid lines (like this). With these grid lines I can group each row/column that is under the color threshold and map a line on the maze.
Problem:
What I am running into is a problem with certain mazes where there are no vertical lines. This will cause my algorithm to not detect a line and create errors as shown below.
Question:
What methods would you suggest for a fix to this problem?
Note: I was thinking something like pattern detection to fill in the missing data?
If your input maze is guaranteed to be based on a grid, like the images you show, then I would suggest a more deterministic approach.
It is probably sufficient to find one wall on each column. So instead of averaging all pixels in a column (which loses a lot of useful information), you can measure e.g. the longest consecutive list of black pixels. If this is much longer than the width of a wall, then you know it is the length of a wall and thus you know the column lies on a grid line.
When you have done this for all columns, you get a discrete graph instead and you can choose a value somewhere in the middle of each peak for the actual column line.
Some grid lines might not have vertical walls at all though, but you can easily interpolate these when you have found at least 3 grid lines.
Another approach would be performing some signal processing and find the period of your function, but I think simple interpolation would be easier to implement and understand.
Edit: The interpolation can be done in different ways. In the easiest case, you assume that at least one column has a "neighbour", i.e., two detected columns that are adjacent in the grid, and that you detect the first and last column.
In this case, all you need to do is find the smallest distance between neighbours to find the grid cell width. You can also compare it with the cell height and choose whichever is smaller. Then, apply this width between the first and last columns to get all the columns.
Another approach, if you can't make this assumption, is that you repeatedly apply every column you detect with the same period throughout the grid, counting from the front and from the back, like so:
|_ _ _ _|_ _ _ _ _ _| => |_ _ _ _|_ _ _ _|_ _| => |_ _|_ _|_ _|_ _|_ _|
and repeating until no more edits are being made.
You could try implementing a Hough transform extraction. It's purpose is detecting distances between imperfect object classes - and with a little bit of tweaking you can make it extract/detect your maze grid.
A transform extraction can perform groupings of edge points into object candidates.
Here's the Wikipedia article, which gives in-depth explanations on how it works: https://en.wikipedia.org/wiki/Hough_transform#Detecting_lines
I hope this helps :)
if I understand your approach correctly you are counting the "whiteness" or "blackness" over a row/column and want to use those distributions to determine the grid
what about extracting the grid lines like you planned to do, and measure them, to find a candidate value for the grid spacing sp(the whole procedure is the same for rows and columns and can be done independantly)
from there you can create a candidate grid with that spacing
now measure the candidate the same way you measured the source...
extract only the spikes in your graph as discrete values, do that for the source image s and the candidate grid c, we are are only interested in the coordinate axis, and we will have to offset said axis so one of their respective spikes match
now for each value x_s in your discrete value list for s, find the corresponding value x_c in c, with (x_s - sp/2) < x_c < (x_s + sp/2)
if there's at least one x_s that has no x_c, consider the test a fail (abort criteria for later, or the sp candidate was way off)
once all x_s values have a corresponding x_c, calculate their difference and adjust sp with the mean difference and test the new candidate ...
we are looking for the biggest sp that passes the test, and since only smaller sp values could pass (think: if sp is the grid spacing, sp/(2^x) will also pass the test), you can test if sp*2 still passes or you can look at how many x_c values have no x_s value

How to make moving average using geopandas nearest neighbors?

I have a geodataframe ("GDF") with one column as "values", and another column as "geometry" (that in fact are actual geographical regions), so each row represents a region.
The "values" column is zero in many rows, and a large number in some rows.
I need to make a "moving average" or rolling average, using the nearest neighbors up to a certain "max_distance" (we can assume that the GDF has a locally projected CRS, so the max_distance has real meaning).
Thus, the averaged_values would have neither zero or large values in most of the regions, but an average value.
One way to do it would be
for region in GDF:
averaged_values=sjoin_nearest(GDF,GDF,maxdistance=1000).values.mean()
But really I don't know how to proceed.
The expected output would be a geodataframe with 3 columns:
"values", "averaged_values", and "geometry".
Any ideas?
What you are trying to do is also called a spatial lag. The best way is to create spatial weights matrix based on a set distance and compute the lag, both using libpysal library, which is a part of the geopandas ecosystem.
import libpysal
# create weights
W = libpysal.weights.DistanceBand.from_dataframe(gdf, threshold=1000)
# row-normalise weights
W.transform = "r"
# create lag
gdf["averaged_values"] = libpysal.weights.lag_spatial(W, gdf["values"])

How can I find the sum of the absolute value of the difference between two columns?

I am trying to find the Spearman's Footrule Distance between two columns in google sheets in only one cell. The Spearman's Footrule Distance basically finds the distance between two vectors by summing the absolute value of the differences between the elements at each index of the vector. For example, the distance between (-1, 0, 2) and (1, -1, 2) is (|-1-1| + |0--1| + |2-2|) = (|-2|+|1|+|0|) = 2+1+0 = 3. I have a formula that finds the sum of the difference between the two columns, but I can't figure out how to make it sum the absolute value of the difference.
So far, this is what I have: =SUMPRODUCT(B1:B3>A1:A3,B1:B3-A1:A3). This returns exactly what it is supposed to return, but I want it to be the absolute value of the difference. How can I achieve this?
Side Note: I have to find this in only one cell because my co-editors do not want any extra columns (hidden or not) in this particular spreadsheet.
spreadsheet (sample data in the first worksheet): https://docs.google.com/spreadsheets/d/12CXk-vzJxYaEhD1QsAXx25JRDDubnqV9zAvG2sc1ykw/edit#gid=64105883
You can do
=ARRAYFORMULA(SUM(ABS(A1:A3-B1:B3)))
ARRAYFORMULA is nice for applying an operation to multiple ranges/arrays and getting back the result as an array so that you can do things to that (like sum).
Enables the display of values returned from an array formula into multiple rows and/or columns and the use of non-array functions with arrays.
See docs here.
this would also work:
=ARRAYFORMULA(SUM(SUBSTITUTE(A1:A3-B1:B3, "-", )*1))

Divide grid (2D array) into random shaped parts?

The Problem
I want to divide a grid (2D array) into random shaped parts (think earth's tectonic plates).
Criteria are:
User inputs grid size (program should scale because this could be very large).
User inputs grid division factor (how many parts).
Grid is a rectangular shaped hex grid, and is capped top and bottom, wrap around left and right.
No fragmentation of the parts.
No parts inside other parts.
No tiny or super-large parts.
Random shaped parts, that are not perfect circles, or strung-out snaking shapes.
My solution:
Create a method that can access/manipulate adjacent cells.
Randomly determine the size of each part (the sum of all the parts equal the size of the whole 2D array).
Fill the entire 2D array with the last part's id number.
For each part except the last:
Seed the current part id number in a random cell of the 2D array.
Iterate over the entire array and store the address of each cell adjacent to any cells already seeded with the current part id number.
Extract one of the stored addresses and fill that cell with the current plate id number (and so the part starts to form).
Repeat until the part size is reached.
Note that to avoid parts with long strung out "arms" or big holes inside them, I created two storage arrays: one for cells adjacent
to just one cell with the current part id number, and the other for cells adjacent to more than one, then I exhaust the latter before the former.
Running my solution gives the following:
Grid size: 200
width: 20
height: 10
Parts: 7
66633333111114444466
00033331111114444466
00003331111114444466
00003331111144444660
00000333111164444660
00000336111664422600
00000336615522222200
00006655555522222200
00006655555552222220
00066655555552222220
Part number: 0
Part size: 47
Part number: 1
Part size: 30
Part number: 2
Part size: 26
Part number: 3
Part size: 22
Part number: 4
Part size: 26
Part number: 5
Part size: 22
Part number: 6
Part size: 27
Problems with my solution:
The last part is always fragmented - in the case above there are three separate groups of sixes.
The algorithm will stall when parts form in cul-de-sacs and don't have room to grow to their full size (the algorithm does not allow forming parts over other parts, unless it's the last part, which is layed down over the entire 2D array at the start).
If I don't specify the part sizes before forming the 2d array, and just make do with specifying the number of parts and randomly generating the part sizes on the fly, this leaves open the possibility of tiny parts being formed, that might aswell not be there at all, especially when the 2D array is very large. My current part size method limits the parts sizes to between 10% and 40% of the total size of the 2D array. I may be okay with not specifying the parts sizes if there is some super-elegant way to do this - the only control the user will have is 2d array size and number of parts.
Other ideas:
Form the parts in perfectly aligned squares, then run over the 2D array and randomly allow each part to encroach on other parts, warping them into random shapes.
Draw snaking lines across the grid and fill in the spaces created, maybe using some math like this: http://mathworld.wolfram.com/PlaneDivisionbyLines.html
Conclusion:
So here's the rub: I am a beginner programmer who is unsure if I'm tackling this problem in the right way. I can create some more "patch up" methods, that shift the fragmented parts together, and allow forming parts to "jump out" of the cul-de-sacs if they get stuck in them, but it feels messy.
How would you approach this problem? Is there some sexy math I could use to simplify things perhaps?
Thx
I did something similar for a game a few months back, though it was a rectangular grid rather than a hex grid. Still, the theory is the same, and it came up with nice contiguous areas of roughly equal size -- some were larger, some were smaller, but none were too small or too large. YMMV.
Make an array of pointers to all the spaces in your grid. Shuffle the array.
Assign the first N of them IDs -- 1, 2, 3, etc.
Until the array points to no spaces that do not have IDs,
Iterate through the array looking for spaces that do not have IDs
If the space has neighbors in the grid that DO have IDs, assign the space
the ID from a weighted random selection of the IDs of its neighbors.
If it doesn't have neighbors with IDs, skip to the next.
Once there are no non-empty spaces, you have your map with sufficiently blobby areas.
Here's what I'd do: use Voronoi algorithm. At first place some random points, then let the Voronoi algorithm generate the parts. To get the idea how it looks like consult: this applet.
As Rekin suggested, a Voronoi diagram plus some random perturbation will generally do a good job, and on a discretized space like you've got, is relatively easy to implement.
I just wanted to give some ideas about how to do the random perturbation. If you do it at the final resolution, then it's either going to take a very long time, or be pretty minimal. You might try doing a multi-resolution perturbation. So, start with a rather small grid, randomly seed, compute the Voronoi diagram. Then randomly perturb the borders - something like, for each pair of adjacent cells with different regions, push the region one way or the other. You might need to run a post-process to make sure you have no tiny islands.. a simple floodfill will work.
Then create a grid that's twice the size (in each direction), and copy your regions over. You can probably use nearest neighbor. Then perturb the borders again, and repeat until you reach your desired resolution.

Find the "largest" dense sub matrix in a large sparse matrix

Given a large sparse matrix (say 10k+ by 1M+) I need to find a subset, not necessarily continuous, of the rows and columns that form a dense matrix (all non-zero elements). I want this sub matrix to be as large as possible (not the largest sum, but the largest number of elements) within some aspect ratio constraints.
Are there any known exact or aproxamate solutions to this problem?
A quick scan on Google seems to give a lot of close-but-not-exactly results. What terms should I be looking for?
edit: Just to clarify; the sub matrix need not be continuous. In fact the row and column order is completely arbitrary so adjacency is completely irrelevant.
A thought based on Chad Okere's idea
Order the rows from largest count to smallest count (not necessary but might help perf)
Select two rows that have a "large" overlap
Add all other rows that won't reduce the overlap
Record that set
Add whatever row reduces the overlap by the least
Repeat at #3 until the result gets to small
Start over at #2 with a different starting pair
Continue until you decide the result is good enough
I assume you want something like this. You have a matrix like
1100101
1110101
0100101
You want columns 1,2,5,7 and rows 1 and 2, right? That submatrix would 4x2 with 8 elements. Or you could go with columns 1,5,7 with rows 1,2,3 which would be a 3x3 matrix.
If you want an 'approximate' method, you could start with a single non-zero element, then go on to find another non-zero element and add it to your list of rows and columns. At some point you'll run into a non-zero element that, if it's rows and columns were added to your collection, your collection would no longer be entirely non-zero.
So for the above matrix, if you added 1,1 and 2,2 you would have rows 1,2 and columns 1,2 in your collection. If you tried to add 3,7 it would cause a problem because 1,3 is zero. So you couldn't add it. You could add 2,5 and 2,7 though. Creating the 4x2 submatrix.
You would basically iterate until you can't find any more new rows and columns to add. That would get you too a local minimum. You could store the result and start again with another start point (perhaps one that didn't fit into your current solution).
Then just stop when you can't find any more after a while.
That, obviously, would take a long time, but I don't know if you'll be able to do it any more quickly.
I know you aren't working on this anymore, but I thought someone might have the same question as me in the future.
So, after realizing this is an NP-hard problem (by reduction to MAX-CLIQUE) I decided to come up with a heuristic that has worked well for me so far:
Given an N x M binary/boolean matrix, find a large dense submatrix:
Part I: Generate reasonable candidate submatrices
Consider each of the N rows to be a M-dimensional binary vector, v_i, where i=1 to N
Compute a distance matrix for the N vectors using the Hamming distance
Use the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to cluster vectors
Initially, each of the v_i vectors is a singleton cluster. Step 3 above (clustering) gives the order that the vectors should be combined into submatrices. So each internal node in the hierarchical clustering tree is a candidate submatrix.
Part II: Score and rank candidate submatrices
For each submatrix, calculate D, the number of elements in the dense subset of the vectors for the submatrix by eliminating any column with one or more zeros.
Select the submatrix that maximizes D
I also had some considerations regarding the min number of rows that needed to be preserved from the initial full matrix, and I would discard any candidate submatrices that did not meet this criteria before selecting a submatrix with max D value.
Is this a Netflix problem?
MATLAB or some other sparse matrix libraries might have ways to handle it.
Is your intent to write your own?
Maybe the 1D approach for each row would help you. The algorithm might look like this:
Loop over each row
Find the index of the first non-zero element
Find the index of the non-zero row element with the largest span between non-zero columns in each row and store both.
Sort the rows from largest to smallest span between non-zero columns.
At this point I start getting fuzzy (sorry, not an algorithm designer). I'd try looping over each row, lining up the indexes of the starting point, looking for the maximum non-zero run of column indexes that I could.
You don't specify whether or not the dense matrix has to be square. I'll assume not.
I don't know how efficient this is or what its Big-O behavior would be. But it's a brute force method to start with.
EDIT. This is NOT the same as the problem below.. My bad...
But based on the last comment below, it might be equivilent to the following:
Find the furthest vertically separated pair of zero points that have no zero point between them.
Find the furthest horizontally separated pair of zero points that have no zeros between them ?
Then the horizontal region you're looking for is the rectangle that fits between these two pairs of points?
This exact problem is discussed in a gem of a book called "Programming Pearls" by Jon Bentley, and, as I recall, although there is a solution in one dimension, there is no easy answer for the 2-d or higher dimensional variants ...
The 1=D problem is, effectively, find the largest sum of a contiguous subset of a set of numbers:
iterate through the elements, keeping track of a running total from a specific previous element, and the maximum subtotal seen so far (and the start and end elemnt that generateds it)... At each element, if the maxrunning subtotal is greater than the max total seen so far, the max seen so far and endelemnt are reset... If the max running total goes below zero, the start element is reset to the current element and the running total is reset to zero ...
The 2-D problem came from an attempt to generate a visual image processing algorithm, which was attempting to find, within a stream of brightnesss values representing pixels in a 2-color image, find the "brightest" rectangular area within the image. i.e., find the contained 2-D sub-matrix with the highest sum of brightness values, where "Brightness" was measured by the difference between the pixel's brighness value and the overall average brightness of the entire image (so many elements had negative values)
EDIT: To look up the 1-D solution I dredged up my copy of the 2nd edition of this book, and in it, Jon Bentley says "The 2-D version remains unsolved as this edition goes to print..." which was in 1999.

Resources