Find the closest values: Multiple columns conditions - bash

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.
File1
1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1
File 2
1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d
2 4 5 e
9 4 1 f
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j
So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.
Output:
1 2 a1*
1 2 b*
1 4 b1
1 4 d
8 5 c1
9 5 g
* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2
For the other output is the same procedure.
The solution to find the closest it is in my other post and was given by #Tensibai.
But any solution will work.
Thanks!

Sounds a little convoluted but works:
function closest(array,searched) {
distance=999999; # this should be higher than the max index to avoid returning null
split(searched,skeys,OFS)
# Get the first part of key
for (x in array) { # loop over the array to get its keys
split(x,mkeys,OFS) # split the array key
(mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found1 = mkeys[1] # and save the key actually found closest
}
}
# At this point we have the first part of key found, let's redo the work for the second part
distance=999999;
for (x in array) {
split(x,mkeys,OFS)
if (mkeys[1] == found1) { # Filter on the first part of key
(mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found2 = mkeys[2] # and save the key actually found closest
}
}
}
# Now we got the second field, woot
return (found1 OFS found2) # return the combined key from out two search
}
{
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
} else {
key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[key] = $5 # make an array with the key stored previously and $5 as value
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
if (akeys[i] in b) { # if the same key exist in second file
print akeys[i],b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print bindex,b[bindex] # print what we found
}
}
}
Note I'm using OFS to combine the fields so if you change it for output it will behave properly.
WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING
There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.
Output from your sample files:
$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

Related

How to extract vectors from a given condition matrix in Octave

I'm trying to extract a matrix with two columns. The first column is the data that I want to group into a vector, while the second column is information about the group.
A =
1 1
2 1
7 2
9 2
7 3
10 3
13 3
1 4
5 4
17 4
1 5
6 5
the result that i seek are
A1 =
1
2
A2 =
7
9
A3 =
7
10
13
A4=
1
5
17
A5 =
1
6
as an illustration, I used the eval function but it didn't give the results I wanted
Assuming that you don't actually need individually named separated variables, the following will put the values into separate cells of a cell array, each of which can be an arbitrary size and which can be then retrieved using cell index syntax. It makes used of logical indexing so that each iteration of the for loop assigns to that cell in B just the values from the first column of A that have the correct number in the second column of A.
num_cells = max (A(:,2));
B = cell (num_cells,1);
for idx = 1:max(A(:,2))
B(idx) = A((A(:,2)==idx),1);
end
B =
{
[1,1] =
1
2
[2,1] =
7
9
[3,1] =
7
10
13
[4,1] =
1
5
17
[5,1] =
1
6
}
Cell arrays are accessed a bit differently than normal numeric arrays. Array indexing (with ()) will return another cell, e.g.:
>> B(1)
ans =
{
[1,1] =
1
2
}
To get the contents of the cell so that you can work with them like any other variable, index them using {}.
>> B{1}
ans =
1
2
How it works:
Use max(A(:,2)) to find out how many array elements are going to be needed. A(:,2) uses subscript notation to indicate every value of A in column 2.
Create an empty cell array B with the right number of cells to contain the separated parts of A. This isn't strictly necessary, but with large amounts of data, things can slow down a lot if you keep adding on to the end of an array. Pre-allocating is usually better.
For each iteration of the for loop, it determines which elements in the 2nd column of A have the value matching the value of idx. This returns a logical array. For example, for the third time through the for loop, idx = 3, and:
>> A_index3 = A(:,2)==3
A_index3 =
0
0
0
0
1
1
1
0
0
0
0
0
That is a logical array of trues/falses indicating which elements equal 3. You are allowed to mix both logical and subscripts when indexing. So using this we can retrieve just those values from the first column:
A(A_index3, 1)
ans =
7
10
13
we get the same result if we do it in a single line without the A_index3 intermediate placeholder:
>> A(A(:,2)==3, 1)
ans =
7
10
13
Putting it in a for loop where 3 is replaced by the loop variable idx, and we assign the answer to the idx location in B, we get all of the values separated into different cells.

Can AWK array be used to get largest cluster in correlation matrix?

I have a matrix that describes correlation between items A-K, where 1=correlated and 0=uncorrelated.
Is there an easy way to extract the largest cluster from the data? In other words, the cluster with the most correlated elements. Below is some sample data:
# A B C D E F G H I J K
A 1 1 1 1 1 1 1 1 1 1 1
B 1 1 1 1 1 1 1 1 1 1 1
C 1 1 1 1 1 1 1 1 1 1 1
D 1 1 1 1 0 1 0 0 0 1 1
E 1 1 1 0 1 0 0 1 1 0 0
F 1 1 1 1 0 1 0 0 1 0 1
G 1 1 1 0 0 0 1 0 0 0 0
H 1 1 1 0 1 0 0 1 1 1 1
I 1 1 1 0 1 1 0 1 1 0 0
J 1 1 1 1 0 0 0 1 0 1 0
K 1 1 1 1 0 1 0 1 0 0 1
Swapping a few columns/rows by eye, the expected result would be the top left of the matrix, which is a cluster of size 6 that contains: {A, B, C, D, F, K}
I know awk isn't the most user-friendly for this application, but I'm keen on using awk since this will integrate into a larger awk script. That being said, I'm not completely immovable on the language.
Not sure where to start but here's a more complex version of what I'm thinking in python:
https://stats.stackexchange.com/questions/138325/clustering-a-correlation-matrix
Assumptions:
all matrices are symmetric (ie, square; equal to its transpose; matrix[x,y]=matrix[y,x])
matrix[x,x]=1 for all x
all matrix entries are 0 or 1
not interested in 1-element clusters
not interested in permutations of the same cluster (ie, A,B is the same as B,A)
since we don't have to worry about permutations we can focus on processing elements in the order in which they show up in the matrix (eg, we process A,B,C and ignore the equivalents of A,C,B, B,A,C, B,C,A, C,A,B and C,B,A); this allows us to focus on processing just the top/right half of the matrix (above the identity/diagonal) and in order from left to right; this will greatly reduce the number of permutations we need to evaluate
as demonstrated in the question, elements that make up a cluster can be shifted up/left in the matrix so as to fill the top/left of the matrix with 1's (this comes into play during processing where for each new element we merely need to test the equivalent of the new column/row added to this top/left portion of the matrix)
Regarding the last assumption ... assume we have cluster A,D and we now want to test A,D,F; we just need to test the new column/row entries (?):
Current Cluster New Cluster ?
A D A D F
A 1 1 A 1 1 ? # if matrix is symmetric then only need to test
D 1 1 D 1 1 ? # the new column *OR* the new row, not both;
F ? ? 1 # bottom/right == 1 == matrix[F,F] per earlier assumption
One idea using a recursive function and two GNU awk's features: a) array of arrays (aka, multi-dimensional arrays) and b) PROCINFO["sorted_in"] for custom sorting of clusters to stdout:
awk '
######
# load matrix into memory
FNR==1 { n=(NF-1) # number of elements to be processed
for (i=2;i<=NF;i++)
label[i-1]=$i # save labels
next
}
{ for (i=2;i<=NF;i++)
m[FNR-1][i-1]=$i # populate matrix array m[row#][column#]
}
######
# define our recursive function
function find_cluster(cluster, i, clstrcount, stackseq, j, k, corrcount) {
# cluster : current working cluster (eg, "A,B,C")
# i : index of latest element (eg, for "A,B,C" => latest element is "C" so i = 3
# clstrcount : number of elements in current cluster
# stackseq : sequence number of stack[] array
# : stack[] contains list of indexes for current cluster (for "A,B,C" stack = "1,2,3")
# j,k,corrcount : declaring additional variables as "local" to this invocation of the function
clstrcount++ # number of elements to be processed at this call/level
for (j=i+1;j<=n;j++) { # process all elements/indexes greater than i
corrcount=1 # reset correlation count; always start with 1 since m[j][j]=1
# check the new column/row added to the top/left of the matrix to see if it extends the current cluster (ie, all entries are "1")
for (k in stack) { # loop through element/indexes in stack
if (m[stack[k]][j]) # check column entries
corrcount++
if (m[j][stack[k]]) # check row entries; not necessary if matrix is symmetric but we will add it here to show the m[][] references
corrcount++
}
if (corrcount == (stackseq*2 +1) ) { # if we have all "1"s we have a new cluster of size clstrcount
stack[++stackseq]=j # "push" current element/index on stack; increment stack seq/index
cluster=cluster "," label[j] # add current element/label to cluster
max= (clstrcount>max) ? clstrcount : max # new max(cluster count) ?
clusters[clstrcount][++clsterseq]=cluster # add new cluster to our master list: clusters[cluster_count][seq]
find_cluster(cluster, j, clstrcount, stackseq) # recursive call to check for next element(s)
delete stack[stackseq--] # back from recursive call so "pop" curent element (j) from stack
gsub(/[,][^,]+$/,"",cluster) # remove current element/label from cluster to make way for next element/label to be tested
}
}
}
######
# start looking for clusters of size 2+
END { max=2 # not interested in clusters of "1"
for (i=1;i<n;i++) { # loop through list of elements
clstrcount=1 # init cluster count = 1
clstrseq=0 # init clusters[...][seq] sequence seed
cluster=label[i] # reset cluster to current element/label
stackseq=1 # reset stack[seq] sequence seed
stack[stackseq]=i # "push" current element on stack
find_cluster(cluster, i, clstrcount, stackseq) # start recursive calls looking for next element in cluster
}
######
# for now just display clusters with size > 2; adjust this next line to add/remove cluster sizes from stdout
if (max>2) # print list of clusters with length > 2
for (i=max;i>2;i--) { # print from largest to smallest and ...
PROCINFO["sorted_in"]="#val_str_asc" # in alphabetical order
printf "####### clusters of size %s:\n", i
for (j in clusters[i]) # loop through all entries for clusters of size "i"
print clusters[i][j]
}
}
' matrix.dat
NOTE: The current version is (admittedly) a bit verbose ... the results of jotting down a first-pass solution as I was working through the details; with some further analysis it may be possible to reduce the code; having said that, the time it takes to find all 2+ sized clusters in this 11-element matrix isn't too bad:
real 0m0.084s
user 0m0.031s
sys 0m0.046s
This generates:
####### clusters of size 6:
A,B,C,D,F,K
A,B,C,E,H,I
####### clusters of size 5:
A,B,C,D,F
A,B,C,D,J
A,B,C,D,K
A,B,C,E,H
A,B,C,E,I
A,B,C,F,I
A,B,C,F,K
A,B,C,H,I
A,B,C,H,J
A,B,C,H,K
A,B,D,F,K
A,B,E,H,I
A,C,D,F,K
A,C,E,H,I
B,C,D,F,K
B,C,E,H,I
####### clusters of size 4:
A,B,C,D
A,B,C,E
A,B,C,F
A,B,C,G
A,B,C,H
A,B,C,I
A,B,C,J
A,B,C,K
A,B,D,F
A,B,D,J
A,B,D,K
A,B,E,H
A,B,E,I
A,B,F,I
A,B,F,K
A,B,H,I
A,B,H,J
A,B,H,K
A,C,D,F
A,C,D,J
A,C,D,K
A,C,E,H
A,C,E,I
A,C,F,I
A,C,F,K
A,C,H,I
A,C,H,J
A,C,H,K
A,D,F,K
A,E,H,I
B,C,D,F
B,C,D,J
B,C,D,K
B,C,E,H
B,C,E,I
B,C,F,I
B,C,F,K
B,C,H,I
B,C,H,J
B,C,H,K
B,D,F,K
B,E,H,I
C,D,F,K
C,E,H,I
####### clusters of size 3:
A,B,C
A,B,D
A,B,E
A,B,F
A,B,G
A,B,H
A,B,I
A,B,J
A,B,K
A,C,D
A,C,E
A,C,F
A,C,G
A,C,H
A,C,I
A,C,J
A,C,K
A,D,F
A,D,J
A,D,K
A,E,H
A,E,I
A,F,I
A,F,K
A,H,I
A,H,J
A,H,K
B,C,D
B,C,E
B,C,F
B,C,G
B,C,H
B,C,I
B,C,J
B,C,K
B,D,F
B,D,J
B,D,K
B,E,H
B,E,I
B,F,I
B,F,K
B,H,I
B,H,J
B,H,K
C,D,F
C,D,J
C,D,K
C,E,H
C,E,I
C,F,I
C,F,K
C,H,I
C,H,J
C,H,K
D,F,K
E,H,I

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

Bash/Nawk whitespace problems

I have 100 datafiles, each with 1000 rows, and they all look something like this:
0 0 0 0
1 0 1 0
2 0 1 -1
3 0 1 -2
4 1 1 -2
5 1 1 -3
6 1 0 -3
7 2 0 -3
8 2 0 -4
9 3 0 -4
10 4 0 -4
.
.
.
999 1 47 -21
1000 2 47 -21
I have developed a script which is supposed to take the square of each value in columns 2,3,4, and then sum and square root them.
Like so:
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
It then calculates the square of that value, and averages these numbers over every data-file to output the average "calc" for each row and average "fluc" for each row.
The meaning of these numbers is this:
The first number is the step number, the next three are coordinates on the x, y and z axis respectively. I am trying to find the distance the "steps" have taken me from the origin, this is calculated with the formula r = sqrt(x^2 + y^2 + z^2). Next I need the fluctuation of r, which is calculated as f = r^4 or f = (r^2)^2.
These must be averages over the 100 data files, which leads me to:
r = r + sqrt(x^2 + y^2 + z^2)
avg = r/s
and similarly for f where s is the number of read data files which I figure out using sum=$(ls -l *.data | wc -l).
Finally, my last calculation is the deviation between the expected r and the average r, which is calculated as stddev = sqrt(fluc - (r^2)^2) outside of the loop using final values.
The script I created is:
#!/bin/bash
sum=$(ls -l *.data | wc -l)
paste -d"\t" *.data | nawk -v s="$sum" '{
for(i=0;i<=s-1;i++)
{
t1 = 2+(i*4)
t2 = 3+(i*4)
t3 = 4+(i*4)
temp = ($t1*$t1) + ($t2*$t2) + ($t3*$t3)
calc = $calc + sqrt ($temp)
fluc = $fluc + ($calc*$calc)
}
stddev = sqrt(($calc^2) - ($fluc))
print $1" "calc/s" "fluc/s" "stddev
temp=0
calc=0
stddev=0
}'
Unfortunately, part way through I receive an error:
nawk: cmd. line:9: (FILENAME=- FNR=3) fatal: attempt to access field -1
I am not experienced enough with awk to be able to figure out exactly where I am going wrong, could someone point me in the right direction or give me a better script?
The expected output is one file with:
0 0 0 0
1 (calc for all 1's) (fluc for all 1's) (stddev for all 1's)
2 (calc for all 2's) (fluc for all 2's) (stddev for all 2's)
.
.
.
The following script should do what you want. The only thing that might not work yet is the choice of delimiters. In your original script you seem to have tabs. My solution assumes spaces. But changing that should not be a problem.
It simply pipes all files sequentially into the nawk without counting the files first. I understand that this is not required. Instead of trying to keep track of positions in the file it uses arrays to store seperate statistical data for each step. In the end it iterates over all step indexes found and outputs them. Since the iteration is not sorted there is another pipe into a Unix sort call which handles this.
#!/bin/bash
# pipe the data of all files into the nawk processor
cat *.data | nawk '
BEGIN {
FS=" " # set the delimiter for the columns
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
}
}' | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
EDIT
As sugested by #edmorton awk can take care of loading the files itself. The following enhanced version removes the call to cat and instead passes the file pattern as parameter to nawk. Also, as suggested by #NictraSavios the new version introduces a special handling for the output of the statistics of the last step. Note that the gathering of the statistics is still done for all steps. It's a little difficult to suppress this during the reading of the data since at that point we don't know yet what the last step will be. Although this can be done with some extra effort you would probably loose a lot of robustness of your data handling since right now the script does not make any assumptions about:
the number of files provided,
the order of the files processed,
the number of steps in each file,
the order of the steps in a file,
the completeness of steps as a range without "holes".
Enhanced script:
#!/bin/bash
nawk '
BEGIN {
FS=" " # set the delimiter for the columns (not really required for space which is the default)
maxstep = -1
}
{
step = $1 # step is in column 1
temp = $2*$2 + $3*$3 + $4*$4
# remember maximum step for selected output
if (step > maxstep)
maxstep = step
# use arrays indexed by step to store data
calc[step] = calc[step] + sqrt (temp)
fluc[step] = fluc[step] + calc[step]*calc[step]
count[step] = count[step] + 1 # count the number of samples seen for a step
}
END {
# iterate over all existing steps (this is not sorted!)
for (i in count) {
stddev = sqrt((calc[i] * calc[i]) + (fluc[i] * fluc[i]))
if (i == maxstep)
# handle the last step in a special way
print i" "calc[i]/count[i]" "fluc[i]/count[i]" "stddev
else
# this is the normal handling
print i" "calc[i]/count[i]
}
}' *.data | sort -n -k 1 # that' why we sort here: first column "-k 1" and numerically "-n"
You could also use:
awk -f c.awk *.data
where c.awk is
{
j=FNR
temp=$2*$2+$3*$3+$4*$4
calc[j]=calc[j]+sqrt(temp)
fluc[j]=fluc[j]+calc[j]*calc[j]
}
END {
N=ARGIND
for (i=1; i<=FNR; i++) {
stdev=sqrt(fluc[i]-calc[i]*calc[i])
print i-1,calc[i]/N,fluc[i]/N,stdev
}
}

Resources