Halcon - Select subregion with highest Y value - area

I have a region that contains multiple smaller regions. After using connection and select_shape to clean it up i am left with one single region. It happens sometimes tough that 2 or 3 regions meet the criteria and are displayed.
I need a way to select the region that is positioned lowest in the picture. So with the highest Y value. The only way that comes to my mind is use the 'area_center' comand and then iterate through all found regions, but maybe there is a more elegant way?

you can use the function smallest_rectangle1(Regions, Row1, Column1, Row2, Column2).
In this case, the tuple Row2 contains the higher Y values of every region.

Related

Is there any option to do FOR loop in excel?

I have an excel that I'm calculating my Scrum Task's completed average. I have Story point item also in the excel. My calculation is:
Result= SP * percentage of completion --> This calculation is for each row and after that I sum up all result and taking the summary.
But sometimes I am adding new task and for each task I am adding the calculation to the average result.
Is there any way to use for loop in the excel?
for(int i=0;i<50;i++){ if(SP!=null && task!=null)(B+i)*(L+i)}
My calculation is like below:
AVERAGE((B4*L4+B5*L5+B6*L6+B7*L7+B8*L8+B9*L9+B10*L10)/SUM(B4:B10))
First of all, AVERAGE is not doing anything in your formula, since the argument you pass to it is just one single value. You already do an average calculation by dividing by the sum. That average is in fact a weighted average, and so you could not even achieve that with a plain AVERAGE function.
I see several ways to make this formula more generic, so it keeps working when you add rows:
1. Use SUMPRODUCT
=SUMPRODUCT(B4:B100,L4:L100)/SUM(B4:B100)
The row number 100 is chosen arbitrarily, but should evidently encompass all data rows. If you have no data occurring below your table, then it is safe to add a large margin. You'll want to avoid the situation where you think you add a line to the table, but actually get outside of the range of the formula. Using proper Excel tables can help to avoid this situation.
2. Use an array formula
This would be a second resort for when the formula becomes more complicated and cannot be executed with a "simple" SUMPRODUCT. But the above would translate to this array formula:
=SUM(B4:B100*L4:L100)/SUM(B4:B100)
Once you have typed this in the formula bar, make sure to press Ctrl+Shift+Enter to enter it. Only then will it act as an array formula.
Again, the same remark about row number 100.
3. Use an extra column
Things get easy when you use an extra column for storing the product of B & L values for each row. So you would put in cell N4 the following formula:
=B4*L4
...and then copy that relative formula to the other rows. You can hide that column if you want.
Then the overal formula can be:
=SUM(N4:N100)/SUM(B4:B100)
With this solution you must take care to always copy a row when inserting a new row, as you need the N column to have the intermediate product formula also for any new row.

pair-wise aggregation of input lines in Hadoop

A bunch of driving cars produce traces (sequences of ordered positions)
car_id order_id position
car1 0 (x0,y0)
car1 1 (x1,y1)
car1 2 (x2,y2)
car2 0 (x0,y0)
car2 1 (x1,y1)
car2 2 (x2,y2)
car2 3 (x3,y3)
car2 4 (x4,y4)
car3 0 (x0,y0)
I would like to compute the distance (path length) driven by the cars.
At the core, I need to process all records line by line pair-wise. If the
car_id of the previous line is the same as the current one then I need to
compute the distance to the previous position and add it up to the aggregated
value. If the car_id of the previous line is different from the current line
then I need to output the aggregate for the previous car_id, and initialize the
aggregate of the current car_id with zero.
How should the architecture of the hadoop program look like? Is it possible to
archieve the following:
Solution (1):
(a) Every mapper computes the aggregated distance of the trace (per physical
block)
(b) Every mapper aggregates the distances further in case the trace was split
among multiple blocks and nodes
Comment: this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Solution (2)
(a) The mappers read the data line by line (do no computations) and send the
data to the reducer based on the car_id.
(b) The reducers sort the data for individual car_ids based on order_id,
computes the distances, and aggregates them
Comment: high network load due to laziness of mappers
Solution (3)
(a) implement a custom reader to read define a logical record to be the whole
trace of one car
(b) each mapper computes the distances and the aggregate
(c) reducer is not really needed as everything is done by the mapper
Comment: high main memory costs as the whole trace needs to be loaded into main
memory (although only two lines are used at a time).
I would go with Solution (2), since it is the cleanest to implement and reuse.
You certainly want to sort based on car_id AND order_id, so you can compute the distances on the fly without loading them all up into memory.
Your concern about high network usage is valid, however, you can pre-aggregate the distances in a combiner.
How would that look like, let's take some pseudo-code:
Mapper:
foreach record:
emit((car_id, order_id), (x,y))
Combiner:
if(prev_order_id + 1 == order_id): // subsequent measures
// compute distance and emit that as the last possible order
emit ((car_id, MAX_VALUE), distance(prev, cur))
else:
// send to the reducer, since it is probably crossing block boundaries
emit((car_id, order_id), (x,y))
The reducer then has two main parts:
compute the sum over subsequent measures, like the combiner did
sum over all existing sums, tagged with order_id = MAX_VALUE
That's already best-effort what you can get from a network usage POV.
From a software POV, better use Spark- your logic will be five lines instead of 100 across three class files.
For your other question:
this solution requires to know whether I am on the last record (line)
of the block. Is this information available at all?
Hadoop only guarantees that it is not splitting through records when reading, it may very well be that your record is already touching two different blocks underneath. The way to find that out is basically to rewrite your input format to make this information available to your mappers, or even better- take your logic into account when splitting blocks.

Corresp biplot in R - column scores plotted as 0

I am quite new to R, I am trying to do a Corresp analysis (MASS package) on summarized data. While the output shows row and column score, the resulting biplot shows the column scores as zero, making the plot unreadable (all values arranged by row scores in an expected manner, but flat along the column scores).
the code is
corresp(some_data)
biplot(corresp(some_data, nf = 2))
I would be grateful for any suggestions as to what I'm doing wrong and how to amend this, thanks in advance!
Martin
link to the image
the plot
corresp results
As suggested here:
http://www.statsoft.com/textbook/correspondence-analysis
the biplot actually depicts distributions of the row/column variables over 2 extracted dimensions where the variables' dependency is "the sharpest".
Looks like in your case a good deal of dependencies is concentrated along just one dimension, while the second dimension is already mush less significant.
It does not seem, however, that you relationships are weak. On the contrary, looking at your graph, one can observe the red (column) variable's interception with 2 distinct regions of the other variable values.
Makes sense?
Regards,
Igor

Representing 2D data optimized by row vs by column vs flat

In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?
I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.

Optimal data structure for a table

Our team is working on implementation of the table widget for mobile platform (one of the application is mobile office like MS Excel).
We need to optimize the data structure for storing table data (the simple 2-d array is used).
Could you, please, suggest the optimal data structure for storing table data. Below are some of requirements for the data structure:
the size of the table can be up to 2^32 x 2^32;
majority of table cells are empty (i.e. the table is sparse), so is is desirable not to store data for empty cells;
interface of the data structure should support inserting/removing rows and columns;
data structure should allow to iterate through non-empty cells in forward and backward direction;
cells of the table can be merged (i.e. one cell can span more than one row and/or column).
After thinking more about the problem with the row/column insertion/deletion, I've come up with something that looks promising.
First, create and maintain 2 sorted data structures (e.g. search trees) containing all horizontal and all vertical indices that have at least one non-empty cell.
For this table:
ABCDE
1
2*
3 % #
4
5 $
You'd have:
A,B,D,E - used horizontal indices
2,3,5 - used vertical indices
Store those A,B,D,E,2,3,5 index values inside some kind of a node in the 2 aforementioned structures such that you can link something to it knowing the node's address in memory (again, a tree node fits perfectly).
In each cell (non-empty) have a pair of links to the index nodes describing its location (I'm using & to denote a link/reference to a node):
*: &2,&A
%: &3,&B
#: &3,&E
$: &5,&D
This is sufficient to define a table.
Now, how do we handle row/column insertion? We insert the new row/column index into the respective (horizontal or vertical) index data structure and update the index values after it (=to the right or below). Then we add new cells for this new row/column (if any) and link them to the appropriate index nodes.
For example, let's insert a row between rows 3 and 4 and add a cell with # in it at 4C (in the new row):
ABCDE
1
2*
3 % #
4 # <- new row 4
5 <- used to be row 4
6 $ <- used to be row 5
Your index structures are now:
A,B,C(new),D,E - used horizontal indices
2,3,4(new),6(used to be 5) - used vertical indices
The cells now link to the index nodes like this:
*: &2,&A - same as before
%: &3,&B - same as before
#: &3,&E - same as before
#: &4,&C - new cell linking to new index nodes 4 and C
$: &6,&D - used to be &5,&D
But look at the $ cell. It still points to the same two physical nodes as before, it's just that the vertical/row node now contains index 6 instead of index 5.
If there were 100 cells nodes below the $ cell, say occupying only 5 non-empty rows, you'd need to update only 5 indices in the row/vertical index data structure, not 100.
You can delete rows and columns in a similar fashion.
Now, to make this all useful, you also need to be able to locate every cell by its coordinates.
For that you can create another sorted data structure (again, possibly a search tree), where every key is a combination of the addresses of the index nodes and the value is the location of cell data (or the cell data itself).
With that, if you want to get to cell 3B, you find the nodes for 3 and B in the index data structures, take their addresses &3 and &B, combine them into &3*232+&B and use that as a key to locate the % cell in the 3rd data structure I've just defined. (Note: 232 is actually 2pointer size in bits and can vary from system to system.)
Whatever happens to other cells, the addresses &3 and &B in the %'s cell links will remain the same, even if the indices of the % cell change from 3B to something else.
You may develop iteration on top of this easily.
Merging should be feasible too, but I haven't focused on it.
I would suggest just storing Key Value pairs like you would in excel. For example think of your excel document has columns A - AA etc... and rows 1 - 256000...etc So just store the values that have date like in some type of key-value pairs.
For example:
someKeyValueStore = new KeyValueStore();
someData = new Cell(A1,"SomeValue");
someOtherData = new Cell(C2,"SomeOtherValue");
someKeyValueStore.AddKeyValuePair(someData);
someKeyValueStore.AddKeyValuePair(someOtherData);
In this case you don't have to care about empty cells at all. You just have access to the ones that are not empty. Of course you probably would want to keep track of the keys in a collection so you could easily see if you had a value for a particular key or not. But that is essentially the simplest way to handle it.

Resources