Transpose Columns in a single comma separated row conditionally - shell

I have an input file that looks like this:
aaa 111
aaa 222
aaa 333
bbb 444
bbb 555
I want to create a transposed output file that looks like this:
aaa 111,222,333
bbb 444,555
How can I do this using awk, sed, etc?

One way using awk:
$ awk '{a[$1]=a[$1]?a[$1]","$2:$2}END{for(k in a)print k,a[k]}' file
aaa 111,222,333
bbb 444,555
And if your implementation of awk doesn't support the ternary operator then:
$ awk 'a[$1]{a[$1]=a[$1]","$2;next}{a[$1]=$2}END{for(k in a)print k,a[k]}' file
aaa 111,222,333
bbb 444,555
Your new file does not cause any problems for the script, what output are you getting? I suspect it's probably a line ending issue. Run dos2unix file to fix the line ending.
$ cat file
APM00065101435 189
APM00065101435 190
APM00065101435 191
APM00065101435 390
190104555 00C7
190104555 00D1
190104555 00E1
190104555 0454
190104555 0462
$ awk '{a[$1]=a[$1]?a[$1]","$2:$2}END{for(k in a)print k,a[k]}' file
APM00065101435 189,190,191,390
190104555 00C7,00D1,00E1,0454,0462

Code for GNU sed:
I made a question for this and got a very good & useful answer from potong:
sed -r ':a;$!N;s/^(([^ ]+ ).*)\n\2/\1,/;ta;P;D' file
sed -r ':a;$!N;s/^((\S+\s).*)\n\2/\1,/;ta;P;D' file

Related

Remove lines contain pattern exclude another pattern bash

I have a file
$ cat File
ce5 xxx 123
ed9 myself,yyy,fail? -
f27 xxx,fail? 145
105 yyy,fail? -
I want to remove all the lines containing string ",fail?" but not "myself" in bash.
Expected output
$ cat File
ce5 xxx 123
ed9 myself,yyy,fail? -
I can grep the lines but not sure how to remove them
cat File | grep -v "myself" | grep ",fail?"
f27 xxx,fail? 145
105 yyy,fail? -
I think you can't do such things (easily) with grep.
Print mysql and don't print ,fail? with sed:
sed '/myself/n; /,fail\?/d' File
With awk:
awk '! /,fail\?/ || /myself/'

how to extract lines between two patterns only with awk?

$ awk '/abc/{flag=1; next} /edf/{flag=0} flag' file
flag will print $0, but I only need the first matching lines from two strings.
input:
abc
111
222
edf
333
444
abc
555
666
edf
output:
111
222
So I'm assuming you want to print out the matching lines only for 1st occurrence.
For that you can just use an additional variable and set it once flag goes 0
$ cat file
abc
111
222
edf
333
444
abc
555
666
edf
$ awk '/abc/{flag=1; next} /edf/{if(flag) got1stoccurence=1; flag=0} flag && !got1stoccurence' file
111
222
If you only want the first set of output, then:
awk '/abc/{flag=1; next} /edf/{if (flag == 1) exit} flag' file
Or:
awk '/abc/{flag++; next} /edf/{if (flag == 1) flag++} flag == 1' file
There are other ways to do it too, no doubt. The first is simple and to the point. The second is more flexible if you also want to process the first group of lines appearing between another pair of patterns.
Note that if the input file contains:
xyz
edf
pqr
abc
111
222
edf
It is important not to do anything about the first edf; it is an uninteresting line because no abc line has been read yet.
Using getline with while:
$ awk '/abc/ { while(getline==1 && $0!="edf") print; exit }' file
111
222
Look for /abc/ and once found records will be outputed in the while loop until edf is found.
$ awk '/edf/{exit} f; /abc/{f=1}' file
111
222
If it was possible for edf to appear before abc in your input then it'd be:
$ awk 'f{if (/edf/) exit; print} /abc/{f=1}' file
111
222

gawk use to replace a line containing a pattern with multiple lines using variable

I am trying to replace a line containing the Pattern using gawk, with a set of lines. Let's say, file aa contains
aaaa
ccxyzcc
aaaa
ddxyzdd
I'm using gawk to replace all lines containing xyz with a set of lines 111\n222, my changed contents would contain:
aaaa
111
222
aaaa
111
222
But, if I use:
gawk -v nm2="111\n222" -v nm1="xyz" '{ if (/nm1/) print nm2;else print $0}' "aa"
The changed content shows:
aaaa
ccxyzcc
aaaa
ddxyzdd
I need the entire lines those contain xyz i.e. lines ccxyzcc and ddxyzdd having to be replaced with 111 followed by 222. Please help.
The problem with your code was that /nm1/ tries to match nm1 as pattern not the value in nm1 variable
$ gawk -v nm2="111\n222" -v nm1="xyz" '$0 ~ nm1{print nm2; next} 1' aa
aaaa
111
222
aaaa
111
222
Thanks #fedorqui for suggestion, next can be avoided by simply overwriting content of input line matching the pattern with required text
gawk -v nm2="111\n222" -v nm1="xyz" '$0 ~ nm1{$0=nm2} 1' aa
Solution with GNU sed
$ nm1='xyz'
$ nm2='111\n222'
$ sed "/$nm1/c $nm2" aa
aaaa
111
222
aaaa
111
222
The c command would delete the line matching pattern and add the text given
When using awk's ~ operator, and you don't need to provide a literal regex on the right-hand side.
Your command as-such with the correction of improper syntax would be something like,
gawk -v nm2="111\n222" -v nm1="xyz" '{ if ( $0 ~ nm1 ) print nm2;else print $0}' input-file
which produces the output.
aaaa
111
222
aaaa
111
222
This is how I'd do it:
$ cat aa
aaaa
ccxyzcc
aaaa
ddxyzdd
$ awk '{gsub(/.*xyz.*/, "111\n222")}1' aa
aaaa
111
222
aaaa
111
222
$
Passing variables as patterns to awk is always a bit tricky.
awk -v nm2='111\n222' '{if ($1 ~ /xyz/){ print nm2 } else {print}}'
will give you the output, but the 'xyz' pattern is now fixed.
Passing nm1 as shell variable will also work:
nm1=xyz
awk -v nm2='111\n222' '{if ($1 ~ /'$nm1'/){ print nm2 } else {print}}' aa

How to print columns one after the other in bash?

Is there any better methods to print two or more columns into one column, for example
input.file
AAA 111
BBB 222
CCC 333
output:
AAA
BBB
CCC
111
222
333
I can only think of:
cut -f1 input.file >output.file;cut -f2 input.file >>output.file
But it's not good if there are many columns, or when I want to pipe the output to other commands like sort.
Any other suggestions? Thank you very much!
With awk
awk '{if(maxc<NF)maxc=NF;
for(i=1;i<=NF;i++){(a[i]!=""?a[i]=a[i]RS$i:a[i]=$i)}
}
END{
for(i=1;i<=maxc;i++)print a[i]
}' input.file
You can use a GNU awk array of arrays to store all the data and print it later on.
If the number of columns is constant, this works for any amount of columns:
gawk '{for (i=1; i<=NF; i++) # loop over columns
data[i][NR]=$i # store in data[column][line]
}
END {for (i=1;i<=NR;i++) # loop over lines
for (j=1;j<=NF;j++) # loop over columns
print data[i][j] # print the given field
}' file
Note NR stands for number of records (that is, number of lines here) and NF stands for number of fields (that is, the number of fields in a given line).
If the number of columns changes over rows, then we should use yet another array, in this case to store the number of columns for each row. But in the question I don't see a request for this, so I am leaving it for now.
See a sample with three columns:
$ cat a
AAA 111 123
BBB 222 234
CCC 333 345
$ gawk '{for (i=1; i<=NF; i++) data[i][NR]=$i} END {for (i=1;i<=NR;i++) for (j=1;j<=NF;j++) print data[i][j]}' a
AAA
BBB
CCC
111
222
333
123
234
345
If the number of columns is not constant, using an array to store the number of columns for each row helps to keep track of it:
$ cat sc.wk
{for (i=1; i<=NF; i++)
data[i][NR]=$i
columns[NR]=NF
}
END {for (i=1;i<=NR;i++)
for (j=1;j<=NF;j++)
print (i<=columns[j] ? data[i][j] : "-")
}
$ cat a
AAA 111 123
BBB 222
CCC 333 345
$ awk -f sc.wk a
AAA
BBB
CCC
111
222
333
123
-
345
awk '{print $1;list[i++]=$2}END{for(j=0;j<i;j++){print list[j];}}' input.file
Output
AAA
BBB
CCC
111
222
333
More simple solution would be
awk -v RS="[[:blank:]\t\n]+" '1' input.file
Expects tab as delimiter:
$ cat <(cut -f 1 asd) <(cut -f 2 asd)
AAA
BBB
CCC
111
222
333
Since the order is of no importance:
$ awk 'BEGIN {RS="[ \t\n]+"} 1' file
AAA
111
BBB
222
CCC
333
Ugly, but it works-
for i in {1..2} ; do awk -v p="$i" '{print $p}' input.file ; done
Change the {1..2} to {1..n} where 'n' is the number of columns in the input file
Explanation-
We're defining a variable p which itself is the variable i. i varies from 1 to n and at each step we print the 'i'th column of the file.
This will work for an arbitrary number fo space separated colums
awk '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file
If space is not the separateor ... let's suppose ":" is the separator
awk -F: '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file

Replace line after match

Given this file
$ cat foo.txt
AAA
111
BBB
222
CCC
333
I would like to replace the first line after BBB with 999. I came up with this command
awk '/BBB/ {f=1; print; next} f {$1=999; f=0} 1' foo.txt
but I am curious to any shorter commands with either awk or sed.
This might work for you (GNU sed)
sed '/BBB/!b;n;c999' file
If a line contains BBB, print that line and then change the following line to 999.
!b negates the previous address (regexp) and breaks out of any processing, ending the sed commands, n prints the current line and then reads the next into the pattern space, c changes the current line to the string following the command.
This is some shorter:
awk 'f{$0="999";f=0}/BBB/{f=1}1' file
f {$0="999";f=0} if f is true, set line to 999 and f to 0
/BBB/ {f=1} if pattern match set f to 1
1 print all lines, since 1 is always true.
can use sed also, it's shorter
sed '/BBB/{n;s/.*/999/}'
$ awk '{print (f?999:$0); f=0} /BBB/{f=1}' file
AAA
111
BBB
999
CCC
333
awk '/BBB/{print;getline;$0="999"}1' your_file
sed 's/\(BBB\)/\1\
999/'
works on mac

Resources