Input file:
AAA 2 3 4 5
BBB 3 4 5
AAA 23 21 34
BBB 4 5 62
I want the output to be:
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
I feel that I should use awk and sed but not sure how to realize it. Does anyone have any good ideas? Thanks.
This might work for you:
sort -sk1,1 file | sed ':a;$!N;s/^\([^ ]* \)\(.*\)\n\1/\1\2/;ta;P;D'
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
or gnu awk;
awk '{if($1 in a){line=$0;sub(/[^ ]* /,"",line);a[$1]=a[$1]line;next};a[$1]=$0}END{n=asort(a);for(i=1;i<=n;i++)print a[i]}' file
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
Here is an awk 1 liner to solve above problem:
awk '{line=$2;for(i=3; i<=NF; i++) line=line " " $i; arr[$1]=arr[$1] " " line} END{for (val in arr) print val, arr[val]}' file
Using bash version 4's associative arrays
$ declare -A vals
$ while read key nums; do vals[$key]+="$nums "; done < filename
$ for key in "${!vals[#]}"; do printf "%s %s\n" "$key" "${vals[$key]}"; done
AAA 2 3 4 5 23 21 34
BBB 3 4 5 4 5 62
Related
I'm trying to create a table based on the ASCII bellow. What I need is to arrange the numbers from the 2nd column in a matrix. The first and third columns of the ASCII give columns and rows in the new matrix. The new matrix needs to be fully populated, so it is necessary to complete missing positions on the new table with NA (or -999).
This is what I have
$ cat infile.txt
1 68 2
1 182 3
1 797 4
2 4 1
2 70 2
2 339 3
2 1396 4
3 12 1
3 355 3
3 1854 4
4 7 1
4 85 2
4 333 3
5 9 1
5 68 2
5 182 3
5 922 4
6 10 1
6 70 2
and what I would like to have:
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
I can only use standard UNIX commands (e.g. awk, sed, grep, etc).
So What I have so far...
I can mimic a 2d array in bash
irows=(`awk '{print $1 }' infile.txt`) # rows positions
jcols=(`awk '{print $3 }' infile.txt`) # columns positions
values=(`awk '{print $2 }' infile.txt`) # values
declare -A matrix # the new matrix
nrows=(`sort -k3 -n in.txt | tail -1 | awk '{print $3}'`) # numbers of rows
ncols=(`sort -k1 -n in.txt | tail -1 | awk '{print $1}'`) # numbers of columns
nelem=(`echo "${#values[#]}"`) # number of elements I want to pass to the new matrix
# Creating a matrix (i,j) with -999
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
matrix[$i,$j]=-999
done
done
and even print on the screen
for ((i=0;i<=$((nrows-1));i++)) do
for ((j=0;j<=$((ncols-1));j++)) do
printf " %i" ${matrix[$i,$j]}
done
echo
done
But when I tried to assign the elements, something gets wrong
for ((i=0;i<=$((nelem-1));i++)) do
matrix[${irows[$i]},${jcols[$i]}]=${values[$i]}
done
Thanks in advance for any help with this, really.
A solution in plain bash by simulating a 2D array with an associative array could be something like that (Notice that row and column counts are not hard coded and the code works with any permutation of input lines provided that each line has the format specified in the question):
$ cat printmat
#!/bin/bash
declare -A mat
nrow=0
ncol=0
while read -r col elem row; do
mat[$row,$col]=$elem
if ((row > nrow)); then nrow=$row; fi
if ((col > ncol)); then ncol=$col; fi
done
for ((row = 1; row <= nrow; ++row)); do
for ((col = 1; col <= ncol; ++col)); do
elem=${mat[$row,$col]}
if [[ -z $elem ]]; then elem=NA; fi
if ((col == ncol)); then elem+=$'\n'; else elem+=$'\t'; fi
printf "%s" "$elem"
done
done
$ ./printmat < infile.txt prints out
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
Any time you find yourself writing a loop in shell just to manipulate text you have the wrong approcah. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for many of the reasons why.
Using any awk in any shell on every UNIX box:
$ cat tst.awk
{
vals[$3,$1] = $2
numRows = ($3 > numRows ? $3 : numRows)
numCols = $1
}
END {
OFS = "\t"
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = ((rowNr,colNr) in vals ? vals[rowNr,colNr] : "NA")
printf "%s%s", val, (colNr < numCols ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk infile.txt
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
here is one way to get you started. Note that this is not intended to be "the" answer but to encourage you to try to learn the toolkit.
$ join -a1 -e NA -o2.2 <(printf "%s\n" {1..4}"_"{1..6}) \
<(awk '{print $3"_"$1,$2}' file | sort -n) |
pr -6at
NA 4 12 7 9 10
68 70 NA 85 68 70
182 339 355 333 182 NA
797 1396 1854 NA 922 NA
works, however, row and column counts are hard coded, which is not the proper way to do it.
Preferred solution will be filling up an awk 2D array with the data and print it in matrix form at the end.
Well, I have several files and I'm having trouble to join them with awk.
Here are three sample files:
FileA
2 2 31771 15 5327353 4.73E-04 1 1
2 2 40642 23 27428869 -8.29E-04 1 1
2 2 21517 7 89002990 -2.41E-04 1 1
2 2 33787 16 44955568 2.86E-05 1 1
FileB
2 2 31771 15 5327353 5.07E-04 1 1
2 2 40642 23 27428869 5.45E-04 1 1
2 2 21517 7 89002990 1.85E-04 1 1
2 2 33787 16 44955568 3.73E-04 1 1
FileC
2 2 31771 15 5327353 4.28E-04 1 1
2 2 40642 23 27428869 -7.55E-04 1 1
2 2 21517 7 89002990 -2.01E-04 1 1
2 2 33787 16 44955568 3.09E-05 1 1
Each file has 8 columns, but I do not need columns 1, 2, 7 and 8. Columns 3, 4, and 5 are common to all files, these being perhaps the reference columns for joining the files, and column 6 would be the different information of each file. In general, my final file would look like this:
Finalfile
31771 15 5327353 4.73E-04 5.07E-04 4.28E-04
40642 23 27428869 -8.29E-04 5.45E-04 -7.55E-04
21517 7 89002990 -2.41E-04 1.85E-04 -2.01E-04
33787 16 44955568 2.86E-05 3.73E-04 3.09E-05
I tried the following commands:
awk 'NR==FNR{a[$3]=$6;next}{print $3"\t"$4"\t"$5"\t"$6"\t"a[$3]}' FileA FileB FileC > Finalfile
But unfortunately it only works with two files, and I get something like this:
2 2 31771 15 5327353 4.73E-04 5.07E-04
2 2 40642 23 27428869 -8.29E-04 5.45E-04
2 2 21517 7 89002990 -2.41E-04 1.85E-04
2 2 33787 16 44955568 2.86E-05 3.73E-04
Can someone please help? Remember that there are multiple files (10), not just three. Thank you very much!
Give this a try:
awk '{a[$3FS$4FS$5]=a[$3FS$4FS$5]""$6FS}END{for (i in a){print i, a[i]}}' file*
A cleaner version (Thanks #james-brown):
awk '{ a[$3 OFS $4 OFS $5 FS] = a[$3 OFS $4 OFS $5 FS] ( a[$3 OFS $4 OFS $5 FS] == "" ? "" : OFS) $6 }
END{ for (i in a){print i,a[i]} }' OFS="\t" file*
Output
33787 16 44955568 2.86E-05 3.73E-04 3.09E-05
21517 7 89002990 -2.41E-04 1.85E-04 -2.01E-04
40642 23 27428869 -8.29E-04 5.45E-04 -7.55E-04
31771 15 5327353 4.73E-04 5.07E-04 4.28E-04
paste + awk approach(with "pretty" output):
paste fileA fileB fileC | awk '{print $3,$4,$5,$6,$14,$22}' | column -tx
The output:
31771 15 5327353 4.73E-04 5.07E-04 4.28E-04
40642 23 27428869 -8.29E-04 5.45E-04 -7.55E-04
21517 7 89002990 -2.41E-04 1.85E-04 -2.01E-04
33787 16 44955568 2.86E-05 3.73E-04 3.09E-05
I have a bunch of files from simulation output, all with the same number of rows and fields.
What I need to do is to combine them, so that I get only one file with the numbers summed up, which basically resembles the addition of several matrices.
Example:
File1.txt
1 1 1
1 1 1
1 1 1
File2.txt
2 2 2
2 2 2
2 2 2
File3.txt
3 3 3
3 3 3
3 3 3
required output
6 6 6
6 6 6
6 6 6
I'm going to integrate this into some larger Shell-script, therefore I would prefer a solution in awk, though other languages are welcome as well.
awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}
END{for(i=1;i<=FNR;i++)
for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
input files could be more than 3
test with your data:
kent$ head f[1-3]
==> f1 <==
1 1 1
1 1 1
1 1 1
==> f2 <==
2 2 2
2 2 2
2 2 2
==> f3 <==
3 3 3
3 3 3
3 3 3
kent$ awk '{for(i=1;i<=NF;i++)a[FNR,i]=$i+a[FNR,i]}END{for(i=1;i<=FNR;i++)for(j=1;j<=NF;j++)printf "%s%s", a[i,j],(j==NF?"\n":FS)}' f1 f2 f3
6 6 6
6 6 6
6 6 6
Quick hack:
paste f1 f2 f3 | awk '{for(i=1;i<=m;i++)printf "%d%s",$i+$(i+m)+$(i+2*m),i==m?ORS:OFS}' m=3
These are my two imput files:
file1.txt
1 34
2 55
3 44
6 77
file2.txt
1 12
2 7
5 32
And I wish my output to be:
1 34 12
2 55 0
3 44 0
5 0 32
6 77 0
I need to do this in awk and although I was able to merge files, I do not know how to do it without losing info...
awk -F"\t" 'NR==FNR {h[$1] = $2; next }{print $1,$2,h[$2]}' file1.txt file2.txt > try.txt
awk '{ if ($3 !="") print $1,$2,$3; else print $1,$2,"0";}' try.txt > output.txt
And the output is:
1 34 12
2 55 7
3 44 0
6 77 0
Sorry, I know this must be very easy, but I am quite new in this world! Please I need help!!! Thanks in advance!!
this command gives you the desired output:
awk 'NR==FNR{a[$1]=$2;next}
{if($1 in a){print $0,a[$1];delete a[$1]}
else print $0,"0"}
END{for(x in a)print x,"0",a[x]}' file2 file1|sort -n|column -t
note that I used sort and column to sort & format the output.
output: (note I guess the 2 55 0 was a typo in your expected output)
1 34 12
2 55 7
3 44 0
5 0 32
6 77 0
Here is another way using join and awk:
join -a1 -a2 -o1.1 2.1 1.2 2.2 -e0 file1 file2 | awk '{print ($1?$1:$2),$3,$4}' OFS='\t'
1 34 12
2 55 7
3 44 0
5 0 32
6 77 0
-a switch allows to join on un-pairable lines.
-o builds our output format
-e allows to specify what should be printed for values that do not exist
awk just completes the final formatting.
I have a list like the below:
1 . Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 . Sam 3 4 56 6 89
3 . Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 . Pig 2 5 67 2 21
(except the real list is 40 million lines long).
There are repeated elements in the second column (i.e. the ".")
I want to replace these with unique identifers (e.g. ".1", ".2", ".3"...".n")
I tried to do this with a bash loop / sed combination, but it didn't work...
Failed attempt:
for i in 1..4
do
sed -i "s_//._//."$i"_"$i""
done
(Essentially, I was trying to get sed to replace each n th "." with ".n", but this didn't work).
Here's a way to do it with awk (assuming your file is called input:
$ awk '$2=="."{$2="."++counter}{print}' input
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21
The awk program replaces the second column ($2) by a string formed by concatenating . and a pre-incremented counter (++counter) if the second column was exactly .. It then prints out all the columns it got (with $2 modified or not) ({print}).
Plain bash alternative:
c=1
while read -r a b line ; do
if [ "$b" == "." ] ; then
echo "$a ."$((c++))" $line"
else
echo "$a $b $line"
fi
done < input
Since your question is tagged sed and bash, here are a few examples for completeness.
Bash only
Use parameter expansion. The second column will be unique, but not sequential:
i=1; while read line; do echo ${line/\./.$((i++))}; done < input
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .3 Sam 3 4 56 6 89
3 .4 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .6 Pig 2 5 67 2 21
Bash + sed
sed cannot increment variables, it has to be done externally.
For each line, increment $i if line contains a ., then let sed append $i after the .
i=0
while read line; do
[[ $line == *.* ]] && i=$((i+1))
sed "s#\.#.$i#" <<<"$line"
done < input
Output:
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21
you can use this command:
awk '{gsub(/\./,c++);print}' filename
Output:
1 0 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 2 Sam 3 4 56 6 89
3 3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 5 Pig 2 5 67 2 21