Adding column values from multiple different files - bash

I have ~100 files and I would like to do an arithmetical operation (e.g. sum them up) on the second column of the files, such that I add the value of first row of one file to the first row value of second file and so on for all rows of column 2 in each file.
In my actual files I have ~30 000 rows so any kind of manual manipulation with the rows is not possible.
fileA
1 1
2 100
3 1000
4 15000
fileB
1 7
2 500
3 6000
4 20000
fileC
1 4
2 300
3 8000
4 70000
output:
1 12
2 900
3 15000
4 105000
I used this and ran it as: script.sh listofnames.txt (All the files have the same name but they are in different directories so I was referring to them with $line to the file with the list of directories names). This gives me a syntax error and I am looking for a way to define the "sum" otherwise.
while IFS='' read -r line || [[ -n "$line" ]]; do
awk '{"'$sum'"+=$3; print $1,$2,"'$sum'"}' ../$line/file.txt >> output.txt
echo $sum
done < "$1"

$ paste fileA fileB fileC | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
1 12
2 900
3 15000
4 105000
or if you wanted to do it all in awk:
$ awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' fileA fileB fileC
1 12
2 900
3 15000
4 105000
If you have a list of directories in a file named "foo" and every file you're interested in in every directory is named "bar" then you can do:
IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
cmd "${files[#]}"
where cmd is awk or paste or anything else you want to run on those files. Look:
$ cat foo
abc
def
ghi klm
$ IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
$ awk 'BEGIN{ for (i=1;i<ARGC;i++) print "<" ARGV[i] ">"; exit}' "${files[#]}"
<abc/bar>
<def/bar>
<ghi klm/bar>
So if your files are all named file.txt and your directory names are stored in listofnames.txt then your script would be:
IFS=$'\n' files=( $(awk '{print $0 "/file.txt"}' listofnames.txt) )
followed by whichever of these you prefer:
paste "${files[#]}" | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' "${files[#]}"

Related

Complex csv question: how to generate a final csv after comparing multiple csvs (following manner) using shell scripting?

assume
file1.csv
Schemaname.tablename.columns
exam1
exam2
filetomatch.csv
exam1
exam2
exam4
exam5
exam6
I used
awk 'NR==FNR{a[$1];next} ($1) in a' file1.csv filetomatch.csv >> result.csv (each time one csv is produced)
result
exam 1
exam 2
to match the results.
I have n number of files to comapre to filetomatch.csv
i need out put to be as follows
file matchedcolumns
file1 exam 1
exam 2
file2 exam 4
.
.
.
filen exam 2
exam 3
and so on..
How can i concatenate result.csvs everytime with first field as file name.
also is there a way to show the null columns as well
How can i add null values using this?
Example
File1 Column1
File1 Column1
File2 null
File3 column3
and so on
>> result.csv should be doing the concatenation for you.
for example, create test files
$ for i in {1..4}; do echo $i > file$i.txt; done
$ head file?.txt
==> file1.txt <==
1
==> file2.txt <==
2
==> file3.txt <==
3
==> file4.txt <==
4
run some awk script on all files, print the filename part of output and concatenate the results
$ for f in file{1..4}.txt; do awk '{print FILENAME, $0}' "$f" >> results.csv; done
$ cat results.csv
file1.txt 1
file2.txt 2
file3.txt 3
file4.txt 4
found this two useful:
awk 'NR==FNR{a[$1];next}($1) in a{ print FILENAME, ($1) }' file1.csv filetomatch.csv
Merge the commmon values in a column
awk -F, '{ if (f == $1) { for (c=0; c <length($1) ; c++) printf " "; print FS $2 FS $3 } else { print $0 } } { f = $1 }' file.csv

Using bash to query a large tab delimited file

I have a list of names and IDs (50 entries)
cat input.txt
name ID
Mike 2000
Mike 20003
Mike 20002
And there is a huge zipped file (13GB)
zcat clients.gz
name ID comment
Mike 2000 foo
Mike 20002 bar
Josh 2000 cake
Josh 20002 _
My expected output is
NR name ID comment
1 Mike 2000 foo
3 Mike 20002 bar
each $1"\t"$2 of clients.gz is a unique identifier. There might be some entries from input.txt that might be missing from clients.gz. Thus, I would like to add the NR column to my output to find out which are missing. I would like to use zgrep. awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?)
I know that zgrep 'Mike\t2000' does not work. The NR issue I can fix with awk FNR I imagine.
So far I have:
awk -v q="'"
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ key = $1 FS $2 }
NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
key in map { print map[key], $0 }
$ zcat clients.gz | awk -f tst.awk input.txt -
NR name ID comment
1 Mike 2000 foo
3 Mike 20002 bar
With GNU awk and bash:
awk 'BEGIN{FS=OFS="\t"}
# process input.txt
NR==FNR{
a[$1,$2]=$1 FS $2
line[$1,$2]=NR-1
next
}
# process <(zcat clients.gz)
{
$4=a[$1,$2]
if(FNR==1)
line[$1,$2]="NR"
if($4!="")
print line[$1,$2],$1,$2,$3
}' input.txt <(zcat clients.gz)
Output:
NR name ID comment
1 Mike 2000 foo
3 Mike 20002 bar
As one line:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1,$2]=$1 FS $2; line[$1,$2]=NR-1; next} {$4=a[$1,$2]; if(FNR==1) line[$1,$2]="NR"; if($4!="")print line[$1,$2],$1,$2,$3}' input.txt <(zcat clients.gz)
See: Joining two files based on two key columns awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
[EDIT]
I've misunderstood where the prepended line numbers come from. Corrected.
Would you try the following:
declare -A num # asscoiates each pattern to the line number
mapfile -t ary < <(tail -n +2 input.txt)
pat=$(IFS='|'; echo "${ary[*]}")
for ((i=0; i<${#ary[#]}; i++)); do num[${ary[i]}]=$((i+1)); done
printf "%s\t%s\t%s\t%s\n" "NR" "name" "ID" "comment"
zgrep -E -w "$pat" clients.gz | while IFS= read -r line; do
printf "%d\t%s\n" "${num[$(cut -f 1-2 <<<"$line")]}" "$line"
done
Output:
NR name ID comment
1 Mike 2000 foo
3 Mike 20002 bar
The second line and third generate a search pattern as Mike 2000|Mike 20003|Mike 20002 from input.txt.
The line for ((i=0; i<${#ary[#]}; i++)); do .. creates a map from
the pattern to the number.
The expression "${num[$(cut -f 1-2 <<<"$line")]}" retrieves the line
number from the 1st and 2nd fields of the output.
If the performance is not still satisfactory, please consider ripgrep which is much faster than grep or zgrep.

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

copying columns from different files into a single file using awk

I have about more than 500 files having two columns "Gene short name" and "FPKM" values. The number of rows is same and the "Gene short name" column is common in all the files. I want to create a matrix by keeping first column as gene short name (can be taken from any of the files) and rest other columns having the FPKM.
I have used this command which works well, but then, how can I use it for 500 files?
paste -d' ' <(awk -F'\t' '{print $1}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 72_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 75_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 78_genes.fpkm.txt) > col.txt
sample data (files are tab separated):
head 69_genes.fpkm.txt
gene_short_name FPKM
DDX11L1 0.196141
MIR1302-2HG 0.532631
MIR1302-2 0
WASH7P 4.51437
Expected outcome
gene_short_name FPKM FPKM FPKM FPKM
DDX11L1 0.196141 0.206591 0.0201256 0.363618
MIR1302-2HG 0.532631 0.0930007 0.0775838 0
MIR1302-2 0 0 0 0
WASH7P 4.51437 3.31073 3.23326 1.05673
MIR6859-1 0 0 0 0
FAM138A 0.505155 0.121703 0.105235 0
OR4G4P 0.0536387 0 0 0
OR4G11P 0 0 0 0
OR4F5 0.0390888 0.0586067 0 0
Also, I want to change the name "FPKM" to "filename_FPKM".
Given the input
$ cat a.txt
a 1
b 2
c 3
$ cat b.txt
a I
b II
c III
$ cat c.txt
a one
b two
c three
you can loop:
cut -f1 a.txt > result.txt
for f in a.txt b.txt c.txt
do
cut -f2 "$f" | paste result.txt - > tmp.txt
mv {tmp,result}.txt
done
$ cat result.txt
a 1 I one
b 2 II two
c 3 III three
In awk, using #Micha's data for clarity:
$ awk '
BEGIN { FS=OFS="\t" } # set the field separators
FNR==1 {
$2=FILENAME "_" $2 # on first record of each file rename $2
}
NR==FNR { # process the first file
a[FNR]=$0 # hash whole record to a
next
}
{ # process other files
a[FNR]=a[FNR] OFS $2 # add $2 to the end of the record
}
END { # in the end
for(i=1;i<=FNR;i++) # print all records
print a[i]
}' a.txt b.txt c.txt
Output:
a a.txt_1 b.txt_I c.txt_one
b 2 II two
c 3 III three

Substracting row-values from two different text files

I have two text files, and each file has one column with several rows:
FILE1
a
b
c
FILE2
d
e
f
I want to create a file that has the following output:
a - d
b - e
c - f
All the entries are meant to be numbers (decimals). I am completely stuck and do not know how to proceed.
Using paste seems like the obvious choice but unfortunately you can't specify a multiple character delimiter. To get around this, you can pipe the output to sed:
$ paste -d- file1 file2 | sed 's/-/ - /'
a - d
b - e
c - f
Paste joins the two files together and sed adds the spaces around the -.
If your desired output is the result of the subtraction, then you could use awk:
paste file1 file2 | awk '{ print $1 - $2 }'
given:
$ cat /tmp/a.txt
1
2
3
$ cat /tmp/b.txt
4
5
6
awk is a good bet to process the two files and do arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FN""]+$1 }' /tmp/a.txt /tmp/b.txt
5
7
9
Or, if you want the strings rather than arithmetic:
$ awk 'FNR==NR { a[FNR""] = $0; next } { print a[FNR""] " - "$0 }' /tmp/a.txt /tmp/b.txt
1 - 4
2 - 5
3 - 6
Another solution using while and file descriptors :
while read -r line1 <&3 && read -r line2 <&4
do
#printf '%s - %s\n' "$line1" "$line2"
printf '%s\n' $(($line1 - $line2))
done 3<f1.txt 4<f2.txt

Resources