Compare two file columns (unsorted files) - bash

Input File 1
A1 123 AA
B1 123 BB
C2 44 CC1
D1 12 DD1
E1 11 EE1
Input File 2
A sad21 1
DD1 124f2 2
CC 123tges 3
BB 124sdf 4
AA 1asrf 5
Output File
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2
Making of Output file
We check 3rd column of Input File 1 and 1st Col of Input File 2.
If they match , we print it in Output file.
Note :
The files are not sorted
I tried :
join -t, A B | awk -F "\t" 'BEGIN{OFS="\t"} {if ($3==$4) print $1,$2,$3,$4,$6}'
But this doesnot work as files are unsorted. so the condition ($3==$4) won't work all the time. Please help .

nawk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2
tested below:
> cat file1
A1 123 AA
B1 123 BB
C2 44 CC1
D1 12 DD1
E1 11 EE1
> cat file2
A sad21 1
DD1 124f2 2
CC 123tges 3
BB 124sdf 4
AA 1asrf 5
> awk 'FNR==NR{a[$3]=$0;next}{if($1 in a){p=$1;$1="";print a[p],$0}}' file1 file2
D1 12 DD1 124f2 2
B1 123 BB 124sdf 4
A1 123 AA 1asrf 5
>

You can use join, but you need to sort on the key field first and tell join that the key in the first file is column 3 (-1 3):
join -1 3 <(sort -k 3,3 file1) <(sort file2)
Will get you the correct fields, output (with column -t for output formatting):
AA A1 123 1asrf 5
BB B1 123 124sdf 4
DD1 D1 12 124f2 2
To get the same column ordering listed in the question, you need to specify the output format:
join -1 3 -o 1.1,1.2,1.3,2.2,2.3 <(sort -k 3,3 file1) <(sort file2)
i.e. file 1 fields 1 through 3 then file 2 fields 2 and 3. Output (again with column -t):
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2

perl -F'/\t/' -anle 'BEGIN{$f=1}if($f==1){$H{$F[2]}=$_;$f++ if eof}else{$l=$H{$F[0]};print join("\t",$l,#F[1..$#F]) if defined$l}' f1.txt f2.txt
or shorter
perl -F'/\t/' -anle'$f?($l=$H{$F[0]})&&print(join"\t",$l,#F[1..$#F]):($H{$F[2]}=$_);eof&&$f++' f1.txt f2.txt

One way using awk:
awk 'BEGIN { FS=OFS="\t" } FNR==NR { array[$1]=$2 OFS $3; next } { if ($3 in array) print $0, array[$3] }' file2.txt file1.txt
Results:
A1 123 AA 1asrf 5
B1 123 BB 124sdf 4
D1 12 DD1 124f2 2

This might work for you (GNU sed):
sed 's|\(\S*\)\(.*\)|/\\s\1$/s/$/\2/p|' file2 | sed -nf - file1

Related

Insert rows using awk

How can I insert a row using awk?
My file looks as:
1 43
2 34
3 65
4 75
I would like to insert three rows with "?" So my desire file looks as:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I am trying with the below script.
awk '{if(NR<=3){print "NR ?"}} {printf" " NR $2}' file.txt
Here's one way to do it:
$ awk 'BEGIN{s=" "; for(c=1; c<4; c++) print c s "?"}
{print c s $2; c++}' ip.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
$ awk 'BEGIN {printf "1 ?\n2 ?\n3 ?\n"} {printf "%d", $1 + 3; printf " %s\n", $2}' file.txt
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
You could also add the 3 lines before awk, e.g.:
{ seq 3; cat file.txt; } | awk 'NR <= 3 { $2 = "?" } $1 = NR' OFS='\t'
Output:
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
I would do it following way using GNU AWK, let file.txt content be
1 43
2 34
3 65
4 75
then
awk 'BEGIN{OFS=" "}NR==1{print 1,"?";print 2,"?";print 3,"?"}{print NR+3,$2}' file.txt
output
1 ?
2 ?
3 ?
4 43
5 34
6 65
7 75
Explanation: I set output field separator (OFS) to 7 spaces. For 1st row I do print three lines which consisting of subsequent number and ? sheared by output field separator. You might elect to do this using for loop, especially if you expect that requirement might change here. For every line I print number of row plus 4 (to keep order) and 2nd column ($2). Thanks to use of OFS, you would need to make only one change if requirement regarding number of spaces will be altered. Note that construct like
{if(condition){dosomething}}
might be written in GNU AWK in more concise manner as
(condition){dosomething}
(tested in gawk 4.2.1)

Bash: Output columns from array consisting of two columns

Problem
I am writing a bash script and I have an array, where each value consists of two columns. It looks like this:
for i in "${res[#]}"; do
echo "$i"
done
#Stream1
0 a1
1 b1
2 c1
4 d1
6 e1
#Stream2
0 a2
1 b2
3 c2
4 d2
9 f2
...
I would like to combine the output from this array into a larger table, and multiplex the indices. Furthermore, I would like to format the top row by inserting comment #Sec.
I would like the result to be something like this:
#Sec Stream1 Stream2
0 a1 a2
1 b1 b2
2 c1
3 c2
4 d1 d2
6 e1
9 f2
The insertion of #Sec and removal of the # behind the Streamkeyword is not necessary but desired if not too difficult.
Tried Solutions
I have tried piping to column and awk, but have not been able to produce the desired results.
EDIT
resis an array in a bash script. It is quite large, so I will only provide a short selection. Running echo "$( typeset -p res)"produces following output:
declare -a res='([1]="#Stream1
0 3072
1 6144
2 5120
3 1024
5 6144
..." [2]="#Stream2
0 3072
1 5120
2 4096
3 3072
53 3072
55 1024
57 2048")'
As for the 'result', my initial intention was to assign the resulting table to a variable and use it in another awk script to calculate the moving averages for specified indices, and plot the results. This will be done for ~20 different files. However I am open to other solutions.
The number of streams may vary from 10 to 50. Each stream having from 100 to 300 rows.
You may use this awk solution:
cat tabulate.awk
NF == 1 {
h = h OFS substr($1, 2)
++numSec
next
}
{
keys[$1]
map[$1,numSec] = $2
}
END {
print h
for (k in keys) {
printf "%s", k
for (i=1; i<=numSec; ++i)
printf "\t%s", map[k,i]
print ""
}
}
Then use it as:
awk -v OFS='\t' -v h='#Sec' -f tabulate.awk file
#Sec Stream1 Stream2
0 a1 a2
1 b1 b2
2 c1
3 c2
4 d1 d2
6 e1
9 f2

How to modify a bash script after each run?

The following Script1.sh uses 1st (key) and 2nd columns (values) in every file and prints some output based on some code in other_scripts.sh. script2.sh simply runs both script1.sh and others_script.sh together.
Now, is it possible to extend the similar process to 1st and 3rd columns (output.n2) and repeat the process (output.nn)?
Note: I have around 20k columns. Every file has same exact number of columns.
$ cat file1
s n1 n2 n3
s1 2 3 4
s2 3 4 5
s3 0 1 4
s4 9 8 7
$ cat file2
s n1 n2 n3
s1 12 13 14
s2 13 14 15
s3 10 11 14
s4 19 18 17
$ cat file3
s n1 n2 n3
s1 12 33 44
s2 13 43 54
s3 10 13 44
s4 19 83 74
$ cat filen
s n1 n2 n3
s1 25 33 40
s2 35 43 50
s3 50 13 40
s4 95 83 70
script1.sh
awk '{print $1"\t"$2}' file1 | awk '{print $1"\t""file1""\t"$2}' >> r.1
awk '{print $1"\t"$2}' file2 | awk '{print $1"\t""file2""\t"$2}' >> r.1
awk '{print $1"\t"$2}' file3 | awk '{print $1"\t""file3""\t"$2}' >> r.1
awk '{print $1"\t"$2}' filen | awk '{print $1"\t""filen""\t"$2}' >> r.1
other_scripts.sh
grep file r.1 |awk '{print $1"\t"$2"\t"$3*100}' > output.n1
rm r.1
script2.sh
sh script1.sh
sh other_scripts.sh
output.n1
s1 file1 200
s2 file1 300
s3 file1 0
s4 file1 900
s1 file2 1200
s2 file2 1300
s3 file2 1000
s4 file2 1900
s1 file3 1200
s2 file3 1300
s3 file3 1000
s4 file3 1900
s1 filen 2500
s2 filen 3500
s3 filen 5000
s4 filen 9500
Try with this script. This reproduce exactly the output you required with the input you provided.
#!/bin/bash
NUMBER_OF_COLUMNS=3
NUMBER_OF_FILES=4 # Assuming the files are like file{n}
for coln in `seq 1 $NUMBER_OF_COLUMNS`; do
for filen in `seq 1 $NUMBER_OF_FILES`; do
awk -v n=$coln -v filen=$filen 'NR>1{printf"%s\tfile%i\t%s\n", $1, filen, $(1+n)*100}' file$filen >> output.$coln
done
done
Copy it in a file, make it executable (chmod +x <name of the file>) and run the script (./<name of the file>) inside the directory containing all your files. Remember to put the right numbers as NUMBER_OF_COLUMNS and NUMBER_OF_FILES.

Awk - Control when my $# variables are expanded to merge two files with variable number of columns

My bash script is calling a awk script that nicely merges two files
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t"}
FNR==NR{hash1['"\$${mapfieldfile2}"']=$1 FS $3 FS $4 FS $5 FS $6;next}
('"\$${mapfieldfile1}"' in hash1){ print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
However I want to a more general version,where I don't have to hardcode the columns that I want to print, I simply want to print everything but my id column. Replacing $1 FS $3 FS $4 FS $5 FS $6 for $0 "almost" does the work, except that repeats the id column. I have been trying to dynamically create a a string similar to the $1 FS $3 FS $4 FS $5 FS $6 but I am getting literally the $1 $3 $4 $5 $6 strings in the merged file, as opposed to expanding their values. Also, smaller side effects: I am adding a tab in the middle and losing some headers, below is the code and example files.
I would like to find the solution to my merge and also understand what I am doing wrong and why my variables are not expanding.
I appreciate any help!
mapfieldfile1=1
mapfieldfile2=2
awk -v FS="\t" 'BEGIN {OFS="\t";strfields=""}
FNR==NR{for(i=1;i<=NF;i++) if(i!='"${mapfieldfile2}"') {strfields=strfields" "FS" $"i};
hash1['"\$${mapfieldfile2}"']=strfields;strfields="";next}
('"\$${mapfieldfile1}"' in hash1){print $0, hash1['"\$${mapfieldfile1}"']}' file2 file1
$cat file1
sampleid s1 s2 s3 s4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
$cat file2
a0 sampleid a1 a2 a3 a4
a0 1 a a a a4
a0 2 b b b a4
a0 3 c c c a4
a0 5 e e e a4
$cat first_code_result.txt (good one!)
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
$cat second_code_result.txt
sampleid s1 s2 s3 s4 $1 $3 $4 $5 $6
1 1 1 1 1 $1 $3 $4 $5 $6
2 2 2 2 2 $1 $3 $4 $5 $6
3 3 3 3 3 $1 $3 $4 $5 $6
Try this (untested):
awk -v mf1="$mapfieldfile1" -v mf2="$mapfieldfile2" '
BEGIN {FS=OFS="\t"}
FNR==NR{sub(/\t[^\t]+/,""); hash1[$mf2]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
Don't let shell variables expand within awk scripts, use a regexp to remove fields from the record and idk why the script you haven't shown us is printing literally $3, etc. but you must be including them in a string. You'd have to post that script for help debugging it.
Check where mf1 vs mf2 should appear, I got confused reading your scripts.
EDIT - I had to tweak it as above I was deleting $2 before using it:
$ awk -v mf1="1" -v mf2="2" '
BEGIN {FS=OFS="\t"}
FNR==NR{key=$mf2; sub(/\t[^\t]+/,""); hash1[key]=$0; next}
($mf1 in hash1){ print $0, hash1[$mf1]}
' file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
Note that the sub() above relies on the key field being $2 and FS being a tab. If you need a more general solution let us know.
Here's a version that'll do what you want for any key field values and will work in any awk, it just requires the FS to be a tab or some other fixed string (i.e. not a regexp):
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
key = $mf2
val = ""
nf = 0
for (i=1; i<=NF; i++) {
if (i != mf2) {
val = (nf++ ? val FS : "") $i
}
}
hash1[key] = val
next
}
$mf1 in hash1 { print $0, hash1[$mf1] }
$ awk -v mf1="1" -v mf2="2" -f tst.awk file2 file1
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
if your files are sorted already, the default output of join is what you want
$ join -t$'\t' -11 -22 file1 file2
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4
or, after prettying with column
$ join -t$'\t' -11 -22 file1 file2 | column -t
sampleid s1 s2 s3 s4 a0 a1 a2 a3 a4
1 1 1 1 1 a0 a a a a4
2 2 2 2 2 a0 b b b a4
3 3 3 3 3 a0 c c c a4

Mixing several files column by column in bash

I would like to merge four .txt files into in a unique file. However, the idea is not a simple concatenation, but otherwise an 'interlacement' between the input files where the file1 will be the first three columns and files 2-4 must be pasted column by column in a subsequent order. Thus we have:
file1:
file1 <- ' AX-1 1 125
AX-2 2 456
AX-3 3 3445'
file1 <- read.table(text=file1, header=F)
write.table(file1, "file1.txt", col.names=F, row.names=F, quote=F)
file2:
file2 <- ' AX-1 AA AB AA
AX-2 AA AA AB
AX-3 BB NA AB'
file2 <- read.table(text=file2, header=F)
write.table(file2, "file2.txt", col.names=F, row.names=F, quote=F)
file3:
file3 <- ' AX-1 0.20 -0.89 0.005
AX-2 0 -0.56 -0.003
AX-3 1.2 0.002 0.005'
file3 <- read.table(text=file3, header=F)
write.table(file3, "file3.txt", col.names=F, row.names=F, quote=F)
file4:
file4 <- ' AX-1 1 0 0.56
AX-2 0 0.56 0
AX-3 1 0 0.55'
file4 <- read.table(text=file34, header=F)
write.table(file4, "file4.txt", col.names=F, row.names=F, quote=F)
Where my expected out file could be something like:
out <- 'AX-1 1 125 AA 0.2 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AA 0.005 0.55'
out <- read.table(text=out, header=F)
write.table(out, "out.txt", col.names=F, row.names=F, quote=F)
Thus, in the out: the column 1-3 are the file1, the columns 4,7 and 10 came from file2, the columns 5,8 and 11 came from file3 and the columns 6,9 and 12 came from file4.
I have an idea how to do it in R, but my original files are too large and it will take a lot of time. I would be grateful if someone has an idea how to perform it directly in bash.
This should work:
$ join a1 a2 | join - a3 | join - a4 | awk '{printf "%s %s %s %s %s %s %s %s %s %s %s %s\n", $1, $2, $3, $4, $7, $10, $5, $8, $11, $6, $9, $12}'
AX-1 1 125 AA 0.20 1 AB -0.89 0 AA 0.005 0.56
AX-2 2 456 AA 0 0 AA -0.56 0.56 AB -0.003 0
AX-3 3 3445 BB 1.2 1 NA 0.002 0 AB 0.005 0.55
Try this:
paste file1 file2 file3 file4 | awk '{ print $1 " " $2 " " $3 " " $5 " " $9 " " $13 " " $6 " " $10 " " $14 " " $7 " " $11 " " $15 }'
this works if your files have ordered rows, join suggested by Mauro is better choice.

Resources