awk merge two columns by key, joining values - bash

These are my two imput files:
file1.txt
1 34
2 55
3 44
6 77
file2.txt
1 12
2 7
5 32
And I wish my output to be:
1 34 12
2 55 0
3 44 0
5 0 32
6 77 0
I need to do this in awk and although I was able to merge files, I do not know how to do it without losing info...
awk -F"\t" 'NR==FNR {h[$1] = $2; next }{print $1,$2,h[$2]}' file1.txt file2.txt > try.txt
awk '{ if ($3 !="") print $1,$2,$3; else print $1,$2,"0";}' try.txt > output.txt
And the output is:
1 34 12
2 55 7
3 44 0
6 77 0
Sorry, I know this must be very easy, but I am quite new in this world! Please I need help!!! Thanks in advance!!

this command gives you the desired output:
awk 'NR==FNR{a[$1]=$2;next}
{if($1 in a){print $0,a[$1];delete a[$1]}
else print $0,"0"}
END{for(x in a)print x,"0",a[x]}' file2 file1|sort -n|column -t
note that I used sort and column to sort & format the output.
output: (note I guess the 2 55 0 was a typo in your expected output)
1 34 12
2 55 7
3 44 0
5 0 32
6 77 0

Here is another way using join and awk:
join -a1 -a2 -o1.1 2.1 1.2 2.2 -e0 file1 file2 | awk '{print ($1?$1:$2),$3,$4}' OFS='\t'
1 34 12
2 55 7
3 44 0
5 0 32
6 77 0
-a switch allows to join on un-pairable lines.
-o builds our output format
-e allows to specify what should be printed for values that do not exist
awk just completes the final formatting.

Related

Reshape table and complete voids with NA (or -999) using bash

I'm trying to create a table based on the ASCII bellow. What I need is to arrange the numbers from the 2nd column in a matrix. The first and third columns of the ASCII give columns and rows in the new matrix. The new matrix needs to be fully populated, so it is necessary to complete missing positions on the new table with NA (or -999).
This is what I have
$ cat infile.txt
1 68 2
1 182 3
1 797 4
2 4 1
2 70 2
2 339 3
2 1396 4
3 12 1
3 355 3
3 1854 4
4 7 1
4 85 2
4 333 3
5 9 1
5 68 2
5 182 3
5 922 4
6 10 1
6 70 2
and what I would like to have:
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
I can only use standard UNIX commands (e.g. awk, sed, grep, etc).
So What I have so far...
I can mimic a 2d array in bash
irows=(`awk '{print $1 }' infile.txt`) # rows positions
jcols=(`awk '{print $3 }' infile.txt`) # columns positions
values=(`awk '{print $2 }' infile.txt`) # values
declare -A matrix # the new matrix
nrows=(`sort -k3 -n in.txt | tail -1 | awk '{print $3}'`) # numbers of rows
ncols=(`sort -k1 -n in.txt | tail -1 | awk '{print $1}'`) # numbers of columns
nelem=(`echo "${#values[#]}"`) # number of elements I want to pass to the new matrix
# Creating a matrix (i,j) with -999
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
matrix[$i,$j]=-999
done
done
and even print on the screen
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
printf " %i" ${matrix[$i,$j]}
done
echo
done
But when I tried to assign the elements, something gets wrong
for ((i=0;i<=$((nelem-1));i++)) do
matrix[${irows[$i]},${jcols[$i]}]=${values[$i]}
done
Thanks in advance for any help with this, really.
A solution in plain bash by simulating a 2D array with an associative array could be something like that (Notice that row and column counts are not hard coded and the code works with any permutation of input lines provided that each line has the format specified in the question):
$ cat printmat
#!/bin/bash
declare -A mat
nrow=0
ncol=0
while read -r col elem row; do
mat[$row,$col]=$elem
if ((row > nrow)); then nrow=$row; fi
if ((col > ncol)); then ncol=$col; fi
done
for ((row = 1; row <= nrow; ++row)); do
for ((col = 1; col <= ncol; ++col)); do
elem=${mat[$row,$col]}
if [[ -z $elem ]]; then elem=NA; fi
if ((col == ncol)); then elem+=$'\n'; else elem+=$'\t'; fi
printf "%s" "$elem"
done
done
$ ./printmat < infile.txt prints out
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
Any time you find yourself writing a loop in shell just to manipulate text you have the wrong approcah. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for many of the reasons why.
Using any awk in any shell on every UNIX box:
$ cat tst.awk
{
vals[$3,$1] = $2
numRows = ($3 > numRows ? $3 : numRows)
numCols = $1
}
END {
OFS = "\t"
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = ((rowNr,colNr) in vals ? vals[rowNr,colNr] : "NA")
printf "%s%s", val, (colNr < numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk infile.txt
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
here is one way to get you started. Note that this is not intended to be "the" answer but to encourage you to try to learn the toolkit.
$ join -a1 -e NA -o2.2 <(printf "%s\n" {1..4}"_"{1..6}) \
<(awk '{print $3"_"$1,$2}' file | sort -n) |
pr -6at
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
works, however, row and column counts are hard coded, which is not the proper way to do it.
Preferred solution will be filling up an awk 2D array with the data and print it in matrix form at the end.

How to use this awk command without affecting the header

Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?
To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.

filtering fields based on certain values

I wish you you all a very happy New Year.
I have a file that looks like this(example): There is no header and this file has about 10000 such rows
123 345 676 58 1
464 222 0 0 1
555 22 888 555 1
777 333 676 0 1
555 444 0 58 1
PROBLEM: I only want those rows where both field 3 and 4 have a non zero value i.e. in the above example row 1 & row 3 should be included and rest should be excluded. How can I do this?
The output should look like this:
123 345 676 58 1
555 22 888 555 1
Thanks.
awk is perfect for this kind of stuff:
awk '$3 && $4' input.txt
This will give you the output that you want.
$3 && $4 is a filter. $3 is the value of the 3rd field, $4 is the value of the forth. 0 values will be evaluated as false, anything else will be evaluated as true. If there can be negative values, than you need to write more precisely:
awk '$3 > 0 && $4 > 0' input.txt

printing selected rows from a file using awk

I have a text file with data in the following format.
1 0 0
2 512 6
3 992 12
4 1536 18
5 2016 24
6 2560 29
7 3040 35
8 3552 41
9 4064 47
10 4576 53
11 5088 59
12 5600 65
13 6080 71
14 6592 77
15 7104 83
I want to print all the lines where $1 > 1000.
awk 'BEGIN {$1 > 1000} {print " " $1 " "$2 " "$3}' graph_data_tmp.txt
This doesn't seem to give the output that I am expecting.What am I doing wrong?
You can do this :
awk '$1>1000 {print $0}' graph_data_tmp.txt
print $0 will print all the content of the line
If you want to print the content of the line after the 1000th line/ROW, then you could do the same by replacing $1 with NR. NR represents the number of rows.
awk 'NR>1000 {print $0}' graph_data_tmp.txt
All you need is:
awk '$1>1000' file

use bash to combine values of the same name

Input file:
AAA 2 3 4 5
BBB 3 4 5
AAA 23 21 34
BBB 4 5 62
I want the output to be:
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
I feel that I should use awk and sed but not sure how to realize it. Does anyone have any good ideas? Thanks.
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]* \)\(.*\)\n\1/\1\2/;ta;P;D'
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
or gnu awk;
awk '{if($1 in a){line=$0;sub(/[^ ]* /,"",line);a[$1]=a[$1]line;next};a[$1]=$0}END{n=asort(a);for(i=1;i<=n;i++)print a[i]}' file
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
Here is an awk 1 liner to solve above problem:
awk '{line=$2;for(i=3; i<=NF; i++) line=line " " $i; arr[$1]=arr[$1] " " line} END{for (val in arr) print val, arr[val]}' file
Using bash version 4's associative arrays
$ declare -A vals
$ while read key nums; do vals[$key]+="$nums "; done < filename
$ for key in "${!vals[#]}"; do printf "%s %s\n" "$key" "${vals[$key]}"; done
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62

Resources