Match closest value from two different files and print specific columns - bash

Hi guys I have two files each of them with N columns and M rows.
File1
1 2 4 6 8
20 4 8 10 12
15 5 7 9 11
File2
1 a1 b1 c5 d1
2 a1 b2 c4 d2
3 a2 b3 c3 d3
19 a3 b4 c2 d4
14 a4 b5 c1 d5
And what I need is to search the closest value in the column 1, and print specific columns in the output. so for example the output should be:
File3
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5
Since 1 = 1, 19 is the closest to 20 and 14 to 15, the output are those lines.
How can I do this in awk or any other tool?
Help!
This is what I have until now:
echo "ARGIND == 1 {
s1[\$1]=\$1;
s2[\$1]=\$2;
s3[\$1]=\$3;
s4[\$1]=\$4;
s5[\$1]=\$5;
}
ARGIND == 2 {
bestdiff=-1;
for (v in s1)
if (bestdiff < 0 || (v-\$1)**2 <= bestdiff)
{
s11=s1[v];
s12=s2[v];
s13=s3[v];
s14=s4[v];
s15=s5[v];
bestdiff=(v-\$1)**2;
if (bestdiff < 2){
print \$0
print s11,s12,s13,s14,s15}}">diff.awk
awk -f diff.awk file2 file1
output:
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 1
14 a4 b5 c1 d5
1 2
1 1
14 15
I have no idea why the last three lines.

What I ended with trying to give a way to answer:
function closest(b,i) { # define a function
distance=999999; # this should be higher than the max index to avoid returning null
for (x in b) { # loop over the array to get its keys
(x+0 > i+0) ? tmp = x - i : tmp = i - x # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found = x # and save the key actually found closest
}
}
return found # return the closest key
}
{ # parse the files for each line (no condition)
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[$1]=$0 # make an array with $1 as key
} else {
akeys[max++] = $1 # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[$1]=$0 # make an array with $1 as key
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the first file keys
print a[akeys[i]] # print the value for this file
if (akeys[i] in b) { # if the same key exist in second file
print b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print b[bindex] # print what we found
}
}
}
I hope this is enough commented to be clear, feel free to comment if needed.
Warning This may become really slow if you have a large number of line in the second file as the second array will be parsed for each key of first file which is not present in second file./Warning
Given your sample inputs a1 and a2:
$ mawk -f closest.awk a1 a2
1 2 4 6 8
1 a1 b1 c5 d1
20 4 8 10 12
19 a3 b4 c2 d4
15 5 7 9 11
14 a4 b5 c1 d5

Related

Find linear trend up to the maximum value using awk

I have a datafile as below:
ifile.txt
-10 /
-9 /
-8 /
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
5 40
6 35
7 30
8 25
9 /
10 /
Here "/" are the missing values. I would like to compute the linear trend up to the maximum value in the y-axis (i.e. up to the value "41" in 2nd column). So it should calculate the trend from the following data:
-7 3
-6 4
-5 13
-4 16
-3 17
-2 23
-1 26
0 29
1 32
2 35
3 38
4 41
Other (x, y) won't be consider because the y values are less than 41 after (4, 41)
The following script is working fine for all values:
awk '!/\//{sx+=$1; sy+=$2; c++;
sxx+=$1*$1; sxy+=$1*$2}
END {det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")}' ifile.txt
But I can't able to do it for maximum value
For the given example the result will be 3.486
Updated based on your comments. I assumed your trend calculations were good and used them:
$ awk '
$2!="/" {
b1[++j]=$1 # buffer them up until or if used
b2[j]=$2
if(max=="" || $2>max) { # once a bigger than current max found
max=$2 # new champion
for(i=1;i<=j;i++) { # use all so far buffered values
# print b1[i], b2[i] # debug to see values used
sx+=b1[i] # Your code from here on
sy+=b2[i]
c++
sxx+=b1[i]*b1[i]
sxy+=b1[i]*b2[i]
}
j=0 # buffer reset
delete b1
delete b2
}
}
END {
det=c*sxx-sx*sx
print (det?(c*sxy-sx*sy)/det:"DIV0")
}' file
For data:
0 /
1 1
2 2
3 4
4 3
5 5
6 10
7 7
8 8
with debug print uncommented program would output:
1 1
2 2
3 4
4 3
5 5
6 10
1.51429
You can do the update of the concerned rows only when $2 > max and save the intermediate rows into variables. for example using associate arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add all elemenet of a1 to a and b1 to b
for (k in a1) {
a[k] = a1[k]; b[k] = b1[k]
}
# add the current row to a, b
a[NR] = $1; b[NR] = $2;
# reset a1, b1
delete a1; delete b1;
next;
}
# if $2 <= max, then set a1, b1
{ a1[NR] = $1; b1[NR] = $2 }
END{
for (k in a) {
#print k, a[k], b[k]
sx += a[k]; sy += b[k]; sxx += a[k]*a[k]; sxy += a[k]*b[k]; c++
}
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt
#3.48601
Or calculate sx, sy etc directly instead of using arrays:
awk '
$2 == "/" {next}
$2 > max {
# update max if $2 > max
max = $2;
# add the current Row plus the cached values
sx += $1+sx1; sy += $2+sy1; sxx += $1*$1+sxx1; sxy += $1*$2+sxy1; c += 1+c1
# reset the cached variables
sx1 = 0; sy1 = 0; sxx1 = 0; sxy1 = 0; c1 = 0;
next;
}
# if $2 <= max, then calculate and cache the values
{ sx1 += $1; sy1 += $2; sxx1 += $1*$1; sxy1 += $1*$2; c1++ }
END{
det=c*sxx-sx*sx;
print (det?(c*sxy-sx*sy)/det:"DIV0")
}
' ifile.txt

Algorithm for visiting all grid cells in pseudo-random order that has a guaranteed uniformity at any stage

Context:
I have a hydraulic erosion algorithm that needs to receive an array of droplet starting positions. I also already have a pattern replicating algorithm, so I only need a good pattern to replicate.
The Requirements:
I need an algorism that produces a set of n^2 entries in a set of format (x,y) or [index] that describe cells in an nxn grid (where n = 2^i where i is any positive integer).
(as a set it means that every cell is mentioned in exactly one entry)
The pattern [created by the algorism ] should contain zero to none clustering of "visited" cells at any stage.
The cell (0,0) is as close to (n-1,n-1) as to (1,1), this relates to the definition of clustering
Note
I was/am trying to find solutions through fractal-like patterns built through recursion, but at the time of writing this, my solution is a lookup table of a checkerboard pattern(list of black cells + list of white cells) (which is bad, but yields fewer artifacts than an ordered list)
C, C++, C#, Java implementations (if any) are preferred
You can use a linear congruential generator to create an even distribution across your n×n space. For example, if you have a 64×64 grid, using a stride of 47 will create the pattern on the left below. (Run on jsbin) The cells are visited from light to dark.
That pattern does not cluster, but it is rather uniform. It uses a simple row-wide transformation where
k = (k + 47) mod (n * n)
x = k mod n
y = k div n
You can add a bit of randomness by making k the index of a space-filling curve such as the Hilbert curve. This will yield the pattern on the right. (Run on jsbin)
     
     
You can see the code in the jsbin links.
I have solved the problem myself and just sharing my solution:
here are my outputs for the i between 0 and 3:
power: 0
ordering:
0
matrix visit order:
0
power: 1
ordering:
0 3 2 1
matrix visit order:
0 3
2 1
power: 2
ordering:
0 10 8 2 5 15 13 7 4 14 12 6 1 11 9 3
matrix visit order:
0 12 3 15
8 4 11 7
2 14 1 13
10 6 9 5
power: 3
ordering:
0 36 32 4 18 54 50 22 16 52 48 20 2 38 34 6
9 45 41 13 27 63 59 31 25 61 57 29 11 47 43 15
8 44 40 12 26 62 58 30 24 60 56 28 10 46 42 14
1 37 33 5 19 55 51 23 17 53 49 21 3 39 35 7
matrix visit order:
0 48 12 60 3 51 15 63
32 16 44 28 35 19 47 31
8 56 4 52 11 59 7 55
40 24 36 20 43 27 39 23
2 50 14 62 1 49 13 61
34 18 46 30 33 17 45 29
10 58 6 54 9 57 5 53
42 26 38 22 41 25 37 21
the code:
public static int[] GetPattern(int power, int maxReturnSize = int.MaxValue)
{
int sideLength = 1 << power;
int cellsNumber = sideLength * sideLength;
int[] ret = new int[cellsNumber];
for ( int i = 0 ; i < cellsNumber && i < maxReturnSize ; i++ ) {
// this loop's body can be used for per-request computation
int x = 0;
int y = 0;
for ( int p = power - 1 ; p >= 0 ; p-- ) {
int temp = (i >> (p * 2)) % 4; //2 bits of the index starting from the begining
int a = temp % 2; // the first bit
int b = temp >> 1; // the second bit
x += a << power - 1 - p;
y += (a ^ b) << power - 1 - p;// ^ is XOR
// 00=>(0,0), 01 =>(1,1) 10 =>(0,1) 11 =>(1,0) scaled to 2^p where 0<=p
}
//to index
int index = y * sideLength + x;
ret[i] = index;
}
return ret;
}
I do admit that somewhere along the way the values got transposed, but it does not matter because of how it works.
After doing some optimization I came up with this loop body:
int x = 0;
int y = 0;
for ( int p = 0 ; p < power ; p++ ) {
int temp = ( i >> ( p * 2 ) ) & 3;
int a = temp & 1;
int b = temp >> 1;
x = ( x << 1 ) | a;
y = ( y << 1 ) | ( a ^ b );
}
int index = y * sideLength + x;
(the code assumes that c# optimizer, IL2CPP, and CPP compiler will optimize variables temp, a, b out)

File processing: Combining multiple files with different number of columns and rows

I have multiple tab delimieted files where only the two first columns are in common. I'm trying to combine them in one tab delimited file .
Example: let's say we have 3 files (file1, file2, file3) that we want to combine into file4.
(row and column names are just for demonstration purposes and are not included in any of the files)
Input files =>
File1: 2 rows(r1,r2), 3 columns(c1,c2,c3)
c1 c2 c3
r1 a b c
r2 d e f
File2: 3 rows(r3,r4,r5), 3 columns(c1,c2,c4)
c1 c2 c4
r3 1 2 3
r4 4 5 6
r5 7 8 9
File3: 1 row(r6), 4 columns(c1, c2, c5, c6)
c1 c2 c5 c6
r6 w x y z
Output file =>
for all 3 files, the 2 first columns (c1, c2) have the same name
File4:
c1 c2 c3 c4 c5 c6
r1 a b c - - -
r2 d e f - - -
r3 1 2 - 3 - -
r4 4 5 - 6 - -
r5 7 8 - 9 - -
r6 w x - - y z
What I'm trying to do is: for each of the files add the needed empty columns so that all files have the same number of columns then reorder the columns with "awk" then use "cat" to stack them vertically. But I don't know if this is the best way or there is a more efficient way to do it.
Thanks,
The following essentially does the task. It essentially builds up a matrix entry which is indexed by the row and column names.
awk '(FNR==1) {
for(i=1;i<=NF;++i) {
if (!($i in columns)) { column_order[++cn] = $i; columns[$i] }
c[i+1]=$i
}
next
}
!($1 in rows) { row_order[++rn] = $1; rows[$1] }
{ for(i=2;i<=NF;++i) entry[$1,c[i]]=$i }
END {
s="";for(j=1;j<=cn;++j) s=s OFS column_order[j]; print s
for(i=1;i<=rn;++i) {
row_name=row_order[i]
s=row_name
for(j=1;j<=cn;++j) {
col_name = column_order[j]
s=s OFS ((row_name,col_name) in entry ? entry[row_name,col_name] : "-")
}
print s
}
}' file1 file2 file3 file4 ... filen

Logic: Applying gravity to a vector

There is a method called gravity(Vector[] vector) The vector contains sequence of numbers. The gravity function should return a new vector after applying gravity which is explained below.
Assume 0's are air and 1's are brick. When gravity is applied the bricks should fall down to the lowest level.
Let vector = [3, 7, 8]
Converting this to binary we get:
0 0 1 1 for 3
0 1 1 1 for 7
1 0 0 0 for 8
Applying gravity:
0 0 0 0 which is 0
0 0 1 1 which is 3
1 1 1 1 which is 15
So the gravity function should return [0, 3, 15].
Hope you people understood the explanation. I tried a lot but I couldn't figure out the logic for this. One thing I observed was the sum of the numbers in the vector before and after applying gravity remains same.
That is,
3 + 7 + 8 = 18 = 0 + 3 + 15 for the above case.
I think it is as simple as counting the total '1' bit of each position...
Let N be the input vector size, b be the longest binary length of the input elements
Pre-compute the total # of '1' bit of each position, stored in count[], O(N*b)
Run Gravity Function, that is, to regenerate N numbers from the count[], O(N*b)
Total run time is O(N*b)
Below is the sample code in C++
#include<bits/stdc++.h>
using namespace std;
int v[5] = {3,9,7,8,5};
int cnt[5] = {0};
vector<int> ans;
vector<int> gravity(){
vector<int> ret;
for(int i=0; i<5;i++){
int s = 0;
for(int j=0; j<5;j++)
if(cnt[j]){
s += (1<<j); cnt[j]--;
}
ret.push_back(s);
}
return ret;
}
int main(){
// precompute sum of 1 of each bit
for(int i=0, j=0, tmp=v[i]; i<5; i++, j=0, tmp=v[i]){
while(tmp){
if(tmp&1) cnt[j]++;
tmp >>= 1; j++;
}
}
ans = gravity();
for(int i=ans.size()-1; i>=0; i--) printf("%d ", ans[i]);
return 0;
}
The output is as follows:
Success time: 0 memory: 3272 signal:0
0 1 1 15 15
Start at the bottom. Any bricks in the row on top of that one will fall down except where there is already a brick on the bottom. So, the new bottom row is:
bottom_new = bottom_old OR top_old
The new top is:
top_new = bottom_old AND top_old
That is, there will be a brick in the new bottom row if there was a brick in either row, but there's only going to be a brick in the new top row if there was a brick in both rows.
Then you just work your way up the stack, with the new top row becoming the old bottom row for the next step.
The only solution I can think of so far uses nested for loops:
v is the input vector of N integers
D is the number of digits in each integer
c keeps track of the bottom-most free space where a brick can fall
The algorithm checks if the ith bit in the number n is set using (n & (1<<i)), which works in most C-like languages.
The algorithm in C:
for (int j=0; j<D; ++j)
int bit = 1<<j;
int c = N-1;
for (int i=N-1; i>=0; --i)
if (v[i] & bit) { // if bit j of number v[i] is set...
v[i] ^= bit; // set bit j in the number i to 0 using XOR
v[c] ^= bit; // set bottom-most bit in the number i to 1 using XOR
c -= 1; //increment by bottom row 1
}
If N is small and known it advance, you could work out the truth tables for the values of each digit and get the correct result using only bitwise operations and no loops.
Solution:
So I found a solution which needs recursion I guess. Though I don't know the condition to stop the recursion.
The vector v = [3, 7, 8] is very simple that its not possible to explain why recursion is required so am considering a new vector v = [3, 9, 7, 8, 5]
In binary form :
0 0 1 1 - a4
1 0 0 1 - a3
0 1 1 1 - a2
1 0 0 0 - a1
0 1 0 1 - a0
Iteration 1 :
0 0 0 0 - b7 (b7 = a4 AND b5)
0 0 1 1 - b6 (b6 = a4 OR b5)
0 0 0 0 - b5 (b5 = a3 AND b3) ignore this
1 0 0 1 - b4 (b4 = a3 OR b3)
0 0 0 0 - b3 (b3 = a2 AND b1) ignore this
0 1 1 1 - b2 (b2 = a2 OR b1)
0 0 0 0 - b1 (b1 = a0 AND a1) ignore this
1 1 0 1 - b0 (b0 = a0 OR a1)
Intermediate vector = [b7, b6, b4, b2, b0] = [0, 3, 9, 7, 13]
Iteration 2 :
0 0 0 0 - c7 (c7 = b4 AND c5)
0 0 0 1 - c6 (c6 = b4 OR c5)
0 0 0 1 - c5 (c5 = b3 AND c3) ignore this
0 0 1 1 - c4 (c4 = b3 OR c3)
0 0 0 1 - c3 (c3 = b2 AND c1) ignore this
1 1 0 1 - c2 (c2 = b2 OR c1)
0 1 0 1 - c1 (c1 = b0 AND b1) ignore this
1 1 1 1 - c0 (c0 = b0 OR b1)
Intermediate vector = [c7, c6, c4, c2, c0] = [0, 1, 3, 13, 15]
Iteration 3 :
0 0 0 0 - d7 (d7 = c4 AND d5)
0 0 0 1 - d6 (d6 = c4 OR d5)
0 0 0 1 - d5 (d5 = c3 AND d3) ignore this
0 0 0 1 - d4 (d4 = c3 OR d3)
0 0 0 1 - d3 (d3 = c2 AND d1) ignore this
1 1 1 1 - d2 (d2 = c2 OR d1)
1 1 0 1 - d1 (d1 = c0 AND c1) ignore this
1 1 1 1 - d0 (d0 = c0 OR c1)
Resultant vector = [d7, d6, d4, d2, d0] = [0, 1, 1, 15, 15]
I got this solution by going backwards through the vector.
Another solution:
Construct a multidimensional array with all the bits of all the elements in the vector (i.e) if v = [3,7,8] then construct a 3x4 array and store all the bits.
Count the number of 1's in each column and store the count.
Fill each column with count number of 1's starting from the bottom bit.
This approach is simple but requires construction of large matrices.

Update DataTable from another table with LINQ

I have 2 DataTables that look like this:
DataTable 1:
cheie_primara cheie_secundara judet localitate
1 11 A
2 22 B
3 33 C
4 44 D
5 55 A
6 66 B
7 77 C
8 88 D
9 99 A
DataTable 2:
ID_CP BAN JUDET LOCALITATE ADRESA
1 11 A aa random
2 22 B ss random
3 33 C ee random
4 44 D xx random
5 55 A rr random
6 66 B aa random
7 77 C ss random
8 88 D ee random
9 99 A xx random
and I want to update DataTable 1 with the field["LOCALITATE"] using the maching key DataTable1["cheie_primara"] and DataTable2["ID_CP"].
Like this:
cheie_primara cheie_secundara judet localitate
1 11 A aa
2 22 B ss
3 33 C ee
4 44 D xx
5 55 A rr
6 66 B aa
7 77 C ss
8 88 D ee
9 99 A xx
Is there a LINQ methode to update DataTable1 ?
Thanks!
This is working:
DataTable1.AsEnumerable()
.Join( DataTable2.AsEnumerable(),
dt1_Row => dt1_Row.ItemArray[0],
dt2_Row => dt2_Row.ItemArray[0],
(dt1_Row, dt2_Row) => new { dt1_Row, dt2_Row })
.ToList()
.ForEach(o =>
o.dt1_Row.SetField(3, o.dt2_Row.ItemArray[3]));
If you want to use Linq, here's how I'd go about it;
var a = (from d1 in DataTable1
join d2 in DataTable2 on d1.cheie_primara equals d2.ID_CP
select new {d1, d2.LOCALITATE}).ToList();
a.ForEach(b => b.d1.localitate = b.LOCALITATE);

Resources