awk: math operations of multi-column data in multiple CSV files - bash

I am working on bash script that loops multi-column data filles and executes integrated AWK code to operate with the multi-column data.
#!/bin/bash
home="$PWD"
# folder with the outputs
rescore="${home}"/rescore
# folder with the folders to analyse
storage="${home}"/results
while read -r d; do
awk -F ", *" ' # set field separator to comma, followed by 0 or more whitespaces
FNR==1 {
if (n) { # calculate the results of previous file
f= # apply this equation to rescore data using values of $3 and $2
f[suffix] = f # store the results in the array
n=$1 # take ID of the column
}
prefix=suffix=FILENAME
sub(/_.*/, "", prefix)
sub(/\/[^\/]+$/, "", suffix)
sub(/^.*_/, "", suffix)
n = 1 # count of samples
min = 0 # lowest value of $3 (assuming all $3 < 0)
}
FNR > 1 {
s += $3
s2 += $3 * $3
++n
if ($3 < min) min = $3 # update the lowest value
}
print "ID" prefix, rescoring
for (i in n)
printf "%s %.2f\n", i, f[i]
}' "${d}_"*/input.csv > "${rescore}/"${d%%_*}".csv"
done < <(find . -maxdepth 1 -type d -name '*_*_*' | awk -F '[_/]' '!seen[$2]++ {print $2}')
Briefly, the workflow should process each line of the input.csv located inside ${d} folder that correctly has been identified by my bash script:
# input.csv located in the folder 10V1_cne_lig12
ID, POP, dG
1, 142, -5.6500 # this is dG(min)
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
My AWK script is expected to process each line of each CSV file in order to reduce them to the two columns, keeping in the output: i) the number from the first column of the input.csv (contained ID of the processed line) + the name of the folder ($d) contained the CSV file as well as ii) the result of the math operation (f) applied on the numbers in POP and dG columns of the input.csv:
f(ID)= sqrt(((dG(ID)+10)/10)^2+((POP(ID)-240)/240))^2)
where dG(ID) is the value of dG ($3) of the "rescored" line of input.csv, and POP(ID) is its POP value ($2).Eventually output.csv contained information regarding 1 input.csv should be in the following format:
# output.csv
ID, rescore value
1 10V1_cne_lig12, f(ID1)
2 10V1_cne_lig12, f(ID2)
3 10V1_cne_lig12, f(ID3)
4 10V1_cne_lig12, f(ID4)
While bash part of my code (dealing with the looping of CSVs in the distinct directories) works correctly I am stuck with the AWK code, which does not assign correctly ID of the lines in order that I could apply demonstrated math operations using $2 and $3 columns of the line with precised ID.

given the input file: folder/file
ID, POP, dG
1, 142, -5.6500
2, 10, -5.5000
3, 2, -4.9500
4, 150, -4.1200
this script
$ awk -F', *' -v OFS=', ' '
FNR==1 {path=FILENAME; sub(/\/[^/]+$/,"",path); print $1,"rescore value"; next}
{print $1" "path, sqrt((($3+10)/10)^2+(($2-240)/240)^2)}' folder/file
will produce
ID, rescore value
1 folder, 0.596625
2 folder, 1.05873
3 folder, 1.11285
4 folder, 0.697402
Not sure what the rest of your code does, but I guess you can integrate it in.

Related

AWK for loop with two input files

I have the following files:
A-111.txt
A-311.txt
B-111.txt
B-311.txt
C-111.txt
C-312.txt
D-112.txt
D-311.txt
I want to merge lines of files with the same basename (same letter before the dash) if there is a match in column 4. I have many files so I want to do it in the loop.
So far I have this:
for f1 in *-1**.txt; do f2="${f1/-1/-3}"; awk -F"\t" 'BEGIN { OFS=FS } NR==FNR { a[$4,$4]=$0 ; next } ($4,$4) in a { print a[$4,$4],$0 }' "$f1" "$f2" > $f1_merged.txt; done
It works for files A and B as intended, but not for files C and D files.
Can someone help me improve the code, please?
EDIT - here's the above code formatted legibly:
for f1 in *-1**.txt; do
f2="${f1/-1/-3}"
awk -F"\t" '
BEGIN {
OFS = FS
}
NR == FNR {
a[$4, $4] = $0
next
}
($4, $4) in a {
print a[$4, $4], $0
}
' "$f1" "$f2" > $f1_merged.txt
done
EDIT - after Ed Morton kindly formatted my code, the error is:
awk: cmd. line:7: fatal: cannot open file 'C-311.txt' for reading (No such file or directory)
awk: cmd. line:7: fatal: cannot open file 'D-312.txt' for reading (No such file or directory)
EDIT-all lines not only the first one should be compared
Input file A-111.txt
ID
Chr
bp
db_SNP
REF
ALT
A-111
1
4367323
rs1490413
G
A
A-111
1
12070292
rs730123
G
A
A-111
22
47836412
rs2040411
G
A
A-111
22
49876931
rs4605480
T
C
Input file A-311.txt
ID
Chr
bp
db_SNP
REF
ALT
A-311
Y
17053771
rs17269816
C
T
A-311
Y
22665262
rs2196155
A
G
A-311
1
4367323
rs1490413
G
A
A-311
1
12070292
rs730123
G
A
Desired output file
ID
Chr
bp
db_SNP
REF
ALT
ID
Chr
bp
db_SNP
REF
ALT
A-111
1
4367323
rs1490413
G
A
A-311
1
4367323
rs1490413
G
A
A-111
1
12070292
rs730123
G
A
A-311
1
12070292
rs730123
G
A
Would you please try the following:
#!/bin/bash
prefix="ref_" # prefix to declare array variable names
declare -A bases # array to count files for the basename
for f in *-[0-9]*.txt; do # loop over the target files
base=${f%%-*} # extract the basename
declare -n ref="$prefix$base" # indirect reference to an array named "$base"
ref+=("$f") # create a list of filenames for the basename
(( bases[$base]++ )) # count the number of files for the basename
done
for base in "${!bases[#]}"; do # loop over the basenames
if (( ${bases[$base]} == 2 )); then # check if the number of files are two
declare -n ref="$prefix$base" # indirect reference
awk -F'\t' -v OFS='\t' '
NR==FNR { # read 1st file
f0[$4] = $0 # store the record keyed by $4
next
}
$4 in f0 { # read 2nd file and check if f0[$f4] is defined
print f0[$4], $0 # if match, merge the records and print
}' "${ref[0]}" "${ref[1]}" > "${base}_merged.txt"
fi
done
First extract the basenames such as "A", "B", .. then create a list
of associated filenames. For instance, the array "A" will be assigned to
('A-111.txt' 'A-311.txt'). At the same time, the array bases counts
the files for each basename.
Then loop over the basenames, make sure the number of associated files
are two, compare the 4th columns of the files. If they match, concatenate
the files to generate a new file.
The awk script searches the 4th field across the lines; if match, paste the lines of the two files.

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

Merging two files column and row-wise in bash

I would like to merge two files, column and row-wise but am having difficulty doing so with bash. Here is what I would like to do.
File1:
1 2 3
4 5 6
7 8 9
File2:
2 3 4
5 6 7
8 9 1
Expected output file:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1
This is just an example. The actual files are two 1000x1000 data matrices.
Any thoughts on how to do this? Thanks!
Or use paste + awk
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }'
Note that this script adds a trailing space after the last value. This can be avoided with a more complicated awk script or by piping the output through an additional command, e.g.
paste file1 file2 | awk '{ n=NF/2; for(i=1; i<=n; i++) printf "%s/%s ", $i, $(i+n); printf "\n"; }' | sed 's/ $//'
awk solution without additional sed. Thanks to Jonathan Leffler. (I knew it is possible but was too lazy to think about this.)
awk '{ n=NF/2; pad=""; for(i=1; i<=n; i++) { printf "%s%s/%s", pad, $i, $(i+n); pad=" "; } printf "\n"; }'
paste + perl version that works with an arbitrary number of columns without having to hold an entire file in memory:
paste file1.txt file2.txt | perl -MList::MoreUtils=pairwise -lane '
my #a = #F[0 .. (#F/2 - 1)]; # The values from file1
my #b = #F[(#F/2) .. $#F]; # The values from file2
print join(" ", pairwise { "$a/$b" } #a, #b); # Merge them together again'
It uses the non-standard but useful List::MoreUtils module; install through your OS package manager or favorite CPAN client.
Assumptions:
no blank lines in files
both files have the same number of rows
both files have the same number of fieldds
no idea how many rows and/or fields we'll have to deal with
One awk solution:
awk '
# first file (FNR==NR):
FNR==NR { for ( i=1 ; i<=NF ; i++) # loop through fields
{ line[FNR,i]=$(i) } # store field in array; array index = row number (FNR) + field number (i)
next # skip to next line in file
}
# second file:
{ pfx="" # init printf prefix as empty string
for ( i=1 ; i<=NF ; i++) # loop through fields
{ printf "%s%s/%s", # print our results:
pfx, line[FNR,i], $(i) # prefix, corresponding field from file #1, "/", current field
pfx=" " # prefix for rest of fields in this line is a space
}
printf "\n" # append linefeed on end of current line
}
' file1 file2
NOTES:
remove comments to declutter code
memory usage will climb as the size of the matrix increases (probably not an issue for the smallish fields and OPs comment about a 1000 x 1000 matrix)
The above generates:
1/2 2/3 3/4
4/5 5/6 6/7
7/8 8/9 9/1

How to merge two .txt files into one, matching each line by timestamp, using awk

Summary:
I currently have two .txt files imported from a survey system I am testing. Column 1 of each data file is a timestamp of the format "HHMMSS.SSSSSS". In file1, there is a second column of field intensity readings. In file2 there are two additional columns of positional information. I'm attempting to write a script that matches data points between these files by lining the timestamps up. The issue is that at no point are any of the timestamps the exact same value. The script must be able to match data points (lines in each .txt file) based the timestamp of its closest counterpart in the other file (i.e. the time 125051.354948 from file1 should "match" the nearest timestamp in file2, which is 125051.112784).
If anyone with a little bit more awk/sed/join/regex/Unix knowledge could point me in the right direction, I would be very appreciative.
What I have so far:
(Please note that the exact syntax shown here may not make sense for the sample .txt files attached in this question, there are more extensive versions of these files with more columns that were being used for testing scripts.)
I'm new to awk/Unix/shell scripting so please bear with me if some of these trial solutions don't work or don't make a whole lot of sense.
I have already attempted some solutions posted here on stack overflow using join, but it doesn't seem to want to properly sort or join either of these files:
${
join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1 file2)
join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1
file2)
} | sort -k 1
Result: only outputs a similar version of the original file2
I attempted to reconfigure existing awk solutions that were posted here as well:
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$3]=$2; next} {print $1, (v[$3] ?
v[$3] : 0)}' file1 file2 > file3
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$1]=$2; next} {print $1, (v[$1] ?
v[$1] : 0)}' file1 file2 > file3
Result: both of these awk commands result in the output of file2's
data with nothing from file1 included (or so it seems).
awk -F '
FNR == NR {
time[$3]
next
}
{ for(i in time)
if(index($3, i) == 1) {
print
next
}
}' file1 file2 > file3
Result: keeps returning a syntax error regarding the "." of ".txt"
I looked into integrating some sort of regex or split command to the script... but was confused as to how to proceed and didn't come up with anything of substance.
Sample Data
$ cat file1.txt
125051.354948 058712.429
125052.352475 058959.934
125054.354322 058842.619
125055.352671 058772.045
125057.351794 058707.281
125058.352678 058758.959
$ cat file2.txt
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125052.100811 4413.34371 07629.87605
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125055.107038 4413.34375 07629.87604
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125058.098475 4413.34378 07629.87606
125059.095787 4413.34376 07629.87602
Expected Result:
(Format: Column1File1 Column1File2 Column2File1 Column2File2 Column3File2)
$ cat file3.txt
125051.354948 125051.112784 058712.429 4413.34358 07629.87620
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
As shown, not every data point from each file will find a match. Only pairs of lines that have the most proximal timestamps to one another will be written over to the new file
As previously mentioned, current solutions result in file3 being entirely blank, or just containing information from one of the two files (but not both)
Please try the following:
awk '
# find the closest element in "a" to val and return the index
function binsearch(a, val, len,
low, high, mid) {
if (val < a[1])
return 1
if (val > a[len])
return len
low = 1
high = len
while (low <= high) {
mid = int((low + high) / 2)
if (val < a[mid])
high = mid - 1
else if (val > a[mid])
low = mid + 1
else
return mid
}
return (val - a[low]) < (a[high] - val) ? high : low
}
NR == FNR {
time[FNR] = $1
position[FNR] = $2
intensity[FNR] = $3
len++
next
}
{
i = binsearch(time, $1, len)
print $1 " " time[i] " " $2 " " position[i] " " intensity[i]
}
' file2.txt file1.txt
Result:
125051.354948 125051.112784 058712.429 4413.34369 07629.87606
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
Note that the 4th and 5th values in your expected result may be wrongly copy-and-pasted.
[How it works]
The key is the binsearch function which finds the closest value in the
array and returns the index to the array. I would not mention about
the algorithm in detail because it is a common "binary search" technique.
#!/bin/bash
if [[ $# -lt 2 ]]; then
echo "wrong args, it should be $0 file1 file2"
exit 0
fi
# clear blanks, add an extra column 'm' to file1, merge file1, file2, sort
{ awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 | \
\
awk '# record lines and fields in to a
{a[NR] = $0; a[NR,1] = $1; a[NR,2] = $2; a[NR,3] = $3}
END{
for(i=1; i<= NR; ++i){
# 3rd filed of file1 is "m"
if(a[i, 3] == "m"){
# get difference of column1 between current record ,previous record, next record
prevDiff = (i-1) in a && a[i-1,3] == "m" ? -1 : a[i,1] - a[i-1,1]
nextDiff = (i+1) in a && a[i+1,3] == "m" ? -1 : a[i+1,1] - a[i,1]
# compare differences, choose the close one and print.
if(prevDiff !=-1 && (nextVal == -1 || prevDiff < nextDiff))
print a[i,1], a[i-1, 1], a[i, 2], a[i-1, 2], a[i-1, 3]
else if(nextDiff !=-1 && (prevDiff == -1 || nextDiff < prevDiff))
print a[i,1], a[i+1, 1], a[i, 2], a[i+1, 2], a[i+1, 3]
else
print a[i]
}
}
}'
Out put of { awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 is:
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125051.354948 058712.429 m
125052.100811 4413.34371 07629.87605
125052.352475 058959.934 m
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125054.354322 058842.619 m
125055.107038 4413.34375 07629.87604
125055.352671 058772.045 m
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125057.351794 058707.281 m
125058.098475 4413.34378 07629.87606
125058.352678 058758.959 m
125059.095787 4413.34376 07629.87602

bash update huge csv file with values from another large csv file

I need to update selected rows of huge csv file (20M rows) with data from another big csv file (30K rows),
File to be updated is 1.csv
1120120031,55121
1120127295,55115
6135062894,55121
6135063011,55215
4136723818,55215
6134857289,55215
4430258714,55121
Updating file is 2.csv
112012 ,55615
6135062,55414
6135063,55514
995707 ,55721
Such as 1_MOD.csv
1120120031,55621
1120127295,55615
6135062894,55421
6135063011,55515
4136723818,55215
6134857289,55215
4430258714,55121
Modifications:
if $1 in 2.csv matches substring of $1 in 1.csv (rows 1 & 2) then
update $2 in 1.csv as per 3rd char in $2 of matched row 2.csv;
Match the Maximum size of strings (rows 3 & 4);
Unmatched rows remain unchanged (rows 5 to 7).
So far I managed to test sed in while loop, but script will take about 31 days to complete. I believe there is a better way, such as awk file 2.csv in Array and update 1.csv with that array, something that I could not do as my Awk knowledge is limited
Thanks
Using awk, reading in 2.csv, and using the first field as a pattern.
BEGIN {
FS = " *, *";
OFS = ",";
}
NR==FNR {
# Ensure there are no special characters in $1
if ($1 ~ /^[[:digit:]]+$/)
a[$1] = substr($2, 3, 1);
next;
} {
for (n in a)
if ($1 ~ "^"n) {
$2 = substr($2, 1, 2) a[n] substr($2, 4, length($2) - 3);
break;
}
} 1

Resources