Multiple Files merge based on file1 keys - bash

I was trying to achieve merging of multiple files with main file keys.
My main file is like this
cat files.txt
Which has keys, want to compare....
1
2
3
4
5
6
7
8
9
10
11
Other inputs files like this
cat f1.txt
1 : 20
3 : 40
5 : 40
7 : 203
cat f2.txt
3 : 45
4 : 56
9 : 23
Want output like this ..
f1 f2 ....
1 20 NA
2 NA NA
3 40 45
4 56 NA
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 23 NA
10 NA NA
11 NA NA
tried this but not able to print the non-match keys
awk -F':' 'NF>1{a[$1] = a[$1]$2}END{for(i in a){print i""a[i]}}' files.txt *.txt
1 20
3 40 45
4 56
5 40
7 203
9 23
Please can someone guide me what is missing here ?

Complex GNU awk solution (will cover any number of files, considering system resources):
awk 'BEGIN{
PROCINFO["sorted_in"]="#ind_num_asc"; h=" ";
for(i=2;i<=ARGC;i++) h=(i==2)? h ARGV[i]: h OFS ARGV[i]; print h
}
NR==FNR{ a[$1]; next }{ b[ARGIND][$1]=$3 }
END{
for(i in a) {
printf("%d",i);
for(j in b) printf("%s%s",OFS,(i in b[j])? b[j][i] : "NA"); print ""
}
}' files.txt *.txt
An exemplary output:
f1 f2
1 20 NA
2 NA NA
3 40 45
4 NA 56
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 NA 23
10 NA NA
11 NA NA
PROCINFO["sorted_in"]="#ind_num_asc" - sorting mode (numerically in ascending order)
for(i=2;i<=ARGC;i++) h=(i==1)? h ARGV[i]: h OFS ARGV[i] - iterating through script arguments, collecting filenames.
ARGC and ARGV make the command-line arguments available to your program

$ cat awk-file
NR==FNR{
l=NR
next
}
NR==FNR+l{
split(FILENAME,f1,".")
a[$1]=$3
next
}
NR==FNR+l+length(a){
split(FILENAME,f2,".")
bwk -v OFS='\t' -f awk-file files.txt f1.txt f2.txt[$1]=$3
next
}
END{
print "",f1[1],f2[1]
for(i=1;i<=l;i++){
print i,(a[i]!="")?a[i]:"NR",(b[i]!="")?b[i]:"NR"
}
}
$ awk -v OFS='\t' -f awk-file files.txt f1.txt f2.txt
f1 f2
1 20 NR
2 NR NR
3 40 45
4 NR 56
5 40 NR
6 NR NR
7 203 NR
8 NR NR
9 NR 23
10 NR NR
11 NR NR
I modify the answer for your further question.
If you have 3rd, 4th files (assume to nth files), add n new blocks as followed,
NR==FNR+l+length(a)+...+length(n){
split(FILENAME,fn,".")
n[$1]=$3
}
And in your End block,
END{
print "",f1[1],f2[1],...,fn[1]
for(i=1;i<=l;i++){
print i,(a[i]!="")?a[i]:"NR",(b[i]!="")?b[i]:"NR",...,(n[i]!="")?n[i]:"NR"
}
}

$ cat tst.awk
ARGIND < (ARGC-1) { map[ARGIND,$1] = $NF; next }
FNR==1 {
printf "%-2s", ""
for (fileNr=1; fileNr<ARGIND; fileNr++) {
fileName = ARGV[fileNr]
sub(/\.txt$/,"",fileName)
printf "%s%s", OFS, fileName
}
print ""
}
{
printf "%-2s", $1
for (fileNr=1; fileNr<ARGIND; fileNr++) {
printf "%s%s", OFS, ((fileNr,$1) in map ? map[fileNr,$1] : "NA")
}
print ""
}
$ awk -f tst.awk f1.txt f2.txt files.txt
f1 f2
1 20 NA
2 NA NA
3 40 45
4 NA 56
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 NA 23
10 NA NA
11 NA NA
The above uses GNU awk for ARGIND, with other awks just add a line FNR==1{ARGIND++} at the start of the script.

Using awk and sort -n for sorting the output:
$ awk -F" *: *" '
NR==FNR {
a[$1]; next }
FNR==1 {
for(i in a)
a[i]=a[i] " NA"
h=h OFS FILENAME
}
{
match(a[$1]," NA")
a[$1]=substr(a[$1],1,RSTART-1) OFS $2 substr(a[$1],RSTART+RLENGTH)
}
END {
print h
for(i in a)
print i a[i]
}' files f1 f2 |sort -n
f1 f2
1 20 NA
2 NA NA
3 40 45
4 56 NA
5 40 NA
6 NA NA
7 203 NA
8 NA NA
9 23 NA
10 NA NA
11 NA NA
Pitfalls: 1. sort will fail with the header in certain situations. 2. Since NA is replaced with the value $2, your data can't have NA starting strings. That could probably be circumvented with replacing / NA( |$)/ but would probably cause a lot more checking in the code, so choose your NA carefully. :D
Edit:
Running it for, for example, four files:
$ awk '...' files f1 f2 f1 f2 | sort -n
1 20 20 NA NA
2 NA NA NA NA
3 40 45 40 45
4 56 56 NA NA
5 40 40 NA NA
6 NA NA NA NA
7 203 203 NA NA
8 NA NA NA NA
9 23 23 NA NA
10 NA NA NA NA
11 NA NA NA NA

Please use the below script to process.
FILESPATH has the list of your input files (f1.txt, f2.txt...).
INPUT has the input file (files.txt).
script.sh
FILESPATH=/home/ubuntu/work/test/
INPUT=/home/ubuntu/work/files.txt
i=0
while read line
do
FILES[ $i ]="$line"
(( i++ ))
done < <(ls $FILESPATH/*.txt)
for file in "${FILES[#]}"
do
echo -n " ${file##*/}"
done
echo ""
while IFS= read -r var
do
echo -n "$var "
for file in "${FILES[#]}"
do
VALUE=`grep "$var " $file | cut -d ' ' -f3`
if [ ! -z $VALUE ]; then
echo -n "$VALUE "
else
echo -n "NA "
fi
done
echo ""
done < "$INPUT"
========
you can use printf instead of echo to get better formatting of output.

This can be done via simple loop and echo statements.
#!/bin/bash
NA=" NA"
i=0
#print header module start
header[i]=" "
for file in `ls f[0-9].txt`;
do
first_part=`echo $file|cut -d. -f1`
i=$i+1
header[i]=$first_part
done
echo ${header[#]}
#print header module end
#print elements start
for element in `cat files.txt`;
do
var=$element
for file in `ls f[0-9].txt`;
do
var1=`grep -w ${element} $file`
if [[ ! -z $var1 ]] ; then
field2=`echo $var1|cut -d":" -f2`
var="$var$field2"
else
var="$var$NA"
fi
done
echo $var
done
#print elements end

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

Select first two columns from tab-delimited text file and and substitute with '_' character

I have a sample input file as follows
RF00001 1c2x C 3 118 77.20 1.6e-20 1 119 f29242
RF00001 1ffk 9 1 121 77.40 1.4e-20 1 119 8e2511
RF00001 1jj2 9 1 121 77.40 1.4e-20 1 119 f29242
RF00001 1k73 B 1 121 77.40 1.4e-20 1 119 8484c0
RF00001 1k8a B 1 121 77.40 1.4e-20 1 119 93c090
RF00001 1k9m B 1 121 77.40 1.4e-20 1 119 ebeb30
RF00001 1kc8 B 1 121 77.40 1.4e-20 1 119 bdc000
I need to extract the second and third columns from the text file and substitute the tab with '_'
Desired output file :
1c2x_C
1ffk_9
1jj2_9
1k73_B
1k8a_B
1k9m_B
1kc8_B
I am able to print the two columns by :
awk -F" " '{ print $2,$3 }' input.txt
but unable to substitute the tab with '_' with the following command
awk -F" " '{ print $2,'_',$3 }' input.txt
Could you please try following.
awk '{print $2"_"$3}' Input_file
2nd solution:
awk 'BEGIN{OFS="_"} {print $2,$3}' Input_file
3rd solution: Adding a sed solution.
sed -E 's/[^ ]* +([^ ]*) +([^ ]*).*/\1_\2/' Input_file

remove lines based on value of two columns

I have a huge file (my_file.txt) with ~ 8,000,000 lines that looks like this:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs374183434 0 NA -2.22383195384362
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
I want to find the duplicates based on the first three columns and then remove the line with a lower value in the 7th columns, the first part I can accomplish with:
awk -F"\t" '!seen[$2, $3]++' my_file.txt
But I don't know how to do the part about removing the duplicate with a lower value, the desired output would be this one:
1 13110 13110 rs540538026 0 NA -1.33177622457982
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
Speed is an issue so I could use awk, sed or another bash command
Thanks
$ awk '(i=$1 FS $2 FS $3) && !(i in seventh) || seventh[i] < $7 {seventh[i]=$7; all[i]=$0} END {for(i in a) print all[i]}' my_file.txt
1 13013178 13013178 rs11122075 0 NA -1.57404917386838
1 13116 13116 rs62635286 0 NA -2.87540758021667
1 13118 13118 rs200579949 0 NA -2.87540758021667
1 13110 13110 rs540538026 0 NA -1.33177622457982
Thanks to #fedorqui for the advanced indexing. :D
Explained:
(i=$1 FS $2 FS $3) && !(i in seventh) || $7 > seventh[i] { # set index to first 3 fields
# AND if index not yet stored in array
# OR the seventh field is greater than the previous value of the seventh field by the same index:
seventh[i]=$7 # new biggest value
all[i]=$0 # store that record
}
END {
for(i in all) # for all stored records of the biggest seventh value
print all[i] # print them
}

Print only '+' or '-' if string matches (with two conditions)

I would like to add two additional conditions to the actual code I have: print '+' if in File2 field 5 is greater than 35 and also field 7 is grater than 90.
Code:
while read -r line
do
grep -q "$line" File2.txt && echo "$line +" || echo "$line -"
done < File1.txt '
Input file 1:
HAPS_0001
HAPS_0002
HAPS_0005
HAPS_0006
HAPS_0007
HAPS_0008
HAPS_0009
HAPS_0010
Input file 2 (tab-delimited):
Query DEG_ID E-value Score %Identity %Positive %Matching_Len
HAPS_0001 protein:plasmid:149679 3.00E-67 645 45 59 91
HAPS_0002 protein:plasmid:139928 4.00E-99 924 34 50 85
HAPS_0005 protein:plasmid:134646 3.00E-98 915 38 55 91
HAPS_0006 protein:plasmid:111988 1.00E-32 345 33 54 86
HAPS_0007 - - 0 0 0 0
HAPS_0008 - - 0 0 0 0
HAPS_0009 - - 0 0 0 0
HAPS_0010 - - 0 0 0 0
Desired output (tab-delimited):
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
Thanks!
This should work:
$ awk '
BEGIN {FS = OFS = "\t"}
NR==FNR {if($5>35 && $7>90) a[$1]++; next}
{print (($1 in a) ? $0 FS "+" : $0 FS "-")}' f2 f1
HAPS_0001 +
HAPS_0002 -
HAPS_0005 +
HAPS_0006 -
HAPS_0007 -
HAPS_0008 -
HAPS_0009 -
HAPS_0010 -
join file1.txt <( tail -n +2 file2.txt) | awk '
$2 = ($5 > 35 && $7 > 90)?"+":"-" { print $1, $2 }'
You don't care about the second field in the output, so overwrite it with the appropriate sign for the output.

Shell script to find common values and write in particular pattern with subtraction math to range pattern

Shell script to find common values and write in particular pattern with subtraction math to range pattern
Shell script to get command values in two files and write i a pattern to new file AND also have the first value of the range pattern to be subtracted by 1
$ cat file1
2
3
4
6
7
8
10
12
13
16
20
21
22
23
27
30
$ cat file2
2
3
4
8
10
12
13
16
20
21
22
23
27
Script that works:
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 | sort | awk 'NR==1 {s=l=$1; next} $1!=l+1 {if(l == s) print l; else print s ":" l; s=$1} {l=$1} END {if(l == s) print l; else print s ":" l; s=$1}'
Script out:
2:4
8
10
12:13
16
20:23
27
Desired output:
1:4
8
10
11:13
16
19:23
27
Similar to sputnick's, except using comm to find the intersection of the file contents.
comm -12 <(sort file1) <(sort file2) |
sort -n |
awk '
function print_range() {
if (start != prev)
printf "%d:", start-1
print prev
}
FNR==1 {start=prev=$1; next}
$1 > prev+1 {print_range(); start=$1}
{prev=$1}
END {print_range()}
'
1:4
8
10
11:13
16
19:23
27
Try doing this :
awk 'NR==FNR{x[$1]=1} NR!=FNR && x[$1]' file1 file2 |
sort |
awk 'NR==1 {s=l=$1; next}
$1!=l+1 {if(l == s) print l; else print s -1 ":" l; s=$1}
{l=$1}
END {if(l == s) print l; else print s -1 ":" l; s=$1}'

Resources