copying columns from different files into a single file using awk - bash

I have about more than 500 files having two columns "Gene short name" and "FPKM" values. The number of rows is same and the "Gene short name" column is common in all the files. I want to create a matrix by keeping first column as gene short name (can be taken from any of the files) and rest other columns having the FPKM.
I have used this command which works well, but then, how can I use it for 500 files?
paste -d' ' <(awk -F'\t' '{print $1}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 69_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 72_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 75_genes.fpkm.txt) \
<(awk -F'\t' '{print $2}' 78_genes.fpkm.txt) > col.txt
sample data (files are tab separated):
head 69_genes.fpkm.txt
gene_short_name FPKM
DDX11L1 0.196141
MIR1302-2HG 0.532631
MIR1302-2 0
WASH7P 4.51437
Expected outcome
gene_short_name FPKM FPKM FPKM FPKM
DDX11L1 0.196141 0.206591 0.0201256 0.363618
MIR1302-2HG 0.532631 0.0930007 0.0775838 0
MIR1302-2 0 0 0 0
WASH7P 4.51437 3.31073 3.23326 1.05673
MIR6859-1 0 0 0 0
FAM138A 0.505155 0.121703 0.105235 0
OR4G4P 0.0536387 0 0 0
OR4G11P 0 0 0 0
OR4F5 0.0390888 0.0586067 0 0
Also, I want to change the name "FPKM" to "filename_FPKM".

Given the input
$ cat a.txt
a 1
b 2
c 3
$ cat b.txt
a I
b II
c III
$ cat c.txt
a one
b two
c three
you can loop:
cut -f1 a.txt > result.txt
for f in a.txt b.txt c.txt
do
cut -f2 "$f" | paste result.txt - > tmp.txt
mv {tmp,result}.txt
done
$ cat result.txt
a 1 I one
b 2 II two
c 3 III three

In awk, using #Micha's data for clarity:
$ awk '
BEGIN { FS=OFS="\t" } # set the field separators
FNR==1 {
$2=FILENAME "_" $2 # on first record of each file rename $2
}
NR==FNR { # process the first file
a[FNR]=$0 # hash whole record to a
next
}
{ # process other files
a[FNR]=a[FNR] OFS $2 # add $2 to the end of the record
}
END { # in the end
for(i=1;i<=FNR;i++) # print all records
print a[i]
}' a.txt b.txt c.txt
Output:
a a.txt_1 b.txt_I c.txt_one
b 2 II two
c 3 III three

Related

Most efficient way to use first column of file1 as prefix for all lines in file2

I have two files with equal amount of lines. I want to add each value of column 1 to become the prefix of each line in file 2 and separate each value of file2 by one whitespace.
File2 is very large and has more than 70 million columns.
Example:
Input file 2
10000
10019
Input file 1
Ind1
Ind2
Output
Ind1 1 0 0 0 0
Ind2 1 0 0 1 9
Q: How can this be done efficiently?
EDIT I : I already looked for solutions to add different prefixes to each line e.g. here but was unable to adjust the solution so that I can iterate over the values of the first column of another file.
EDIT II :
Using the answer from #Gilles I came up with this:
awk ' { print $1 } ' file1 <(sed 's/./& /g' file2) > output
EDIT III:
I tried the solution from #Gilles and it failed with a regex input buffer error:
paste -d ' ' file1 <(sed 's/./& /g' file2) > file3
sed: regex input buffer length larger than INT_MAX
I also tried the solution from #Ed Morton but this ended in an OOM error. AWK states that it needs 1.8TB RAM for this operation.
awk '{head=$0} (getline tail < "file2") > 0{gsub(/./," &",tail); print head tail}' file1 > file3
awk: cmd. line:1: (FILENAME=file1.txt FNR=1) fatal: builtin.c:3058:sub_common: buf: cannot reallocate 1949336035328 bytes of memory: Cannot allocate memory.
Using sed & paste:
$ paste -d '' file2 <(sed 's/./& /g' file1)
Ind1 1 0 0 0 0
Ind2 1 0 0 1 9
Using any awk:
$ awk '{head=$0} (getline tail < "file2") > 0{gsub(/./," &",tail); print head tail}' file1
Ind1 1 0 0 0 0
Ind2 1 0 0 1 9
or if the whole contents of file1 fit in memory:
$ awk 'NR==FNR{a[FNR]=$0; next} {gsub(/./," &"); print a[FNR], $0}' file1 file2

Bash: compare 2 files and show the unique content of one file with 'hierachy'

So basically, these are two files I need to compare
file1.txt
1 a
2 b
3 c
44 d
file2.txt
11 a
123 a
3 b
445 d
To show the unique lines in file 1, I use 'comm -23' command after 'sort -u' these 2 files. Additionally, I would like to make '11 a' '123 a' in file 2 become subsets of '1 a' in file 1, similarly, '445 d' is a subset of ' 44 d'. These subsets are considered the same as their superset. So the desired output is
2 b
3 c
I'm a beginner and my loop is way too slow... So here is my code
comm -23 <( awk {print $1,$2}' file1.txt | sort -u ) <( awk '{print $1,$2}' file2.txt | sort -u ) >output.txt
array=($( awk -F ',' '{print $1}' file1.txt ))
for i in "${array[#]}";do
awk -v pattern="$i" 'match($0, "^" pattern)' output.txt > repeat.txt
done
comm -23 <( cat output.txt | sort -u ) <( cat repeat.txt | sort -u )
Anyone got any good ideas?
Another question: Any ways I could show the row numbers from original file at output? For example,
(row num from file 1)
2 2 b
3 3 c
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR {
vals[$2][$1]
next
}
$2 in vals {
for (i in vals[$2]) {
if ( index(i,$1) == 1 ) {
next
}
}
}
{ print FNR, $0 }
$ awk -f tst.awk file2 file1
2 2 b
3 3 c

Adding column values from multiple different files

I have ~100 files and I would like to do an arithmetical operation (e.g. sum them up) on the second column of the files, such that I add the value of first row of one file to the first row value of second file and so on for all rows of column 2 in each file.
In my actual files I have ~30 000 rows so any kind of manual manipulation with the rows is not possible.
fileA
1 1
2 100
3 1000
4 15000
fileB
1 7
2 500
3 6000
4 20000
fileC
1 4
2 300
3 8000
4 70000
output:
1 12
2 900
3 15000
4 105000
I used this and ran it as: script.sh listofnames.txt (All the files have the same name but they are in different directories so I was referring to them with $line to the file with the list of directories names). This gives me a syntax error and I am looking for a way to define the "sum" otherwise.
while IFS='' read -r line || [[ -n "$line" ]]; do
awk '{"'$sum'"+=$3; print $1,$2,"'$sum'"}' ../$line/file.txt >> output.txt
echo $sum
done < "$1"
$ paste fileA fileB fileC | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
1 12
2 900
3 15000
4 105000
or if you wanted to do it all in awk:
$ awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' fileA fileB fileC
1 12
2 900
3 15000
4 105000
If you have a list of directories in a file named "foo" and every file you're interested in in every directory is named "bar" then you can do:
IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
cmd "${files[#]}"
where cmd is awk or paste or anything else you want to run on those files. Look:
$ cat foo
abc
def
ghi klm
$ IFS=$'\n' files=( $(awk '{print $0 "/bar"}' foo) )
$ awk 'BEGIN{ for (i=1;i<ARGC;i++) print "<" ARGV[i] ">"; exit}' "${files[#]}"
<abc/bar>
<def/bar>
<ghi klm/bar>
So if your files are all named file.txt and your directory names are stored in listofnames.txt then your script would be:
IFS=$'\n' files=( $(awk '{print $0 "/file.txt"}' listofnames.txt) )
followed by whichever of these you prefer:
paste "${files[#]}" | awk '{sum=0; for (i=2;i<=NF;i+=2) sum+=$i; print $1, sum}'
awk '{key[FNR]=$1; sum[FNR]+=$2} END{for (i=1; i<=FNR;i++) print key[i], sum[i]}' "${files[#]}"

Sort a file like another file

I have 2 text files :
1st file :
1 C
1 D
1 B
1 A
2nd file :
B
C
D
A
I want to sort first file like this:
1 B
1 C
1 D
1 A
Can you help me with a script in bash (or command ).
I solved the sort problem (i eliminated the first column ) and use this script
awk 'FNR == NR { lineno[$1] = NR; next}
{print lineno[$1], $0;}' ids.txt resultpartial.txt | sort -k 1,1n | cut -d' ' -f2-
Now I want to add ( first column like before)
1 .....
and what only to ignore the first file and do this?
echo -n > result-file.txt # empty result file if already created
while read line; do
echo "1 $line" >> result-file.txt
done < file2.txt
It would make sense when your files' format is specific.
Assuming that the "sort" field contains no duplicated values:
awk 'FNR==NR {line[$2] = $0; next} {print line[$1]}' file1 file2

I need to merge two files and create a new files

input 1
1 10611 2 122 C:0.983607 G:0.0163934
input 2
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07
here
1st and 2nd column are matching and values before ':' of 5th column of first file and 4th column of 2nd files are equel and 6th column(values before ':') of first and 5th column of second files are equel and output is creating based on this match.Will get the clear idea from input and output line and both files are .gz files
output
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
Here's one way using awk:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' input1 input2
Result:
1 10611 rs146752890 C G 100 PASS AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;REF=0.983607;ALT=0.0163934;
So for compressed files, try:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $0 ";" c[$1,$2,$4,$5] }' <(gzip -dc input1.gz) <(gzip -dc input2.gz) | gzip > output.gz
EDIT:
From the comments below, try this:
awk 'FNR==NR { split($5,a,":"); split($6,b,":"); c[$1,$2,a[1],b[1]]="REF=" a[2] ";ALT=" b[2] ";"; next } ($1,$2,$4,$5) in c { print $1, $2, $3, $4, $5, $6, $7, c[$1,$2,$4,$5] $8 ";" }' file1 file2
Result:
1 10611 rs146752890 C G 100 PASS REF=0.983607;ALT=0.0163934;AC=184;RSQ=0.8228;AVGPOST=0.9640;AN=2184;ERATE=0.0031;VT=SNP;AA=.;THETA=0.0127;LDAF=0.0902;SNPSOURCE=LOWCOV;AF=0.08;ASN_AF=0.08;AMR_AF=0.14;AFR_AF=0.08;EUR_AF=0.07;
This should work (assuming you have enough disk space to store the expanded .gz files):
zcat 1 | awk '{print $1$2,$0}' | sort > new1
zcat 2 | awk '{print $1$2,$0}' | sort > new2
join new1 new2 -11 -21 -o "2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 1.6 1.7"|sed 's/ C:/;REF=/'|sed 's/ G:/;ALT=/' > output

Resources