Shell - Concatenate rows in Column1 If Column 2 has duplicates - shell

I am a newbie for shell programming and currently facing a roadblock in arriving a solution,
I want to concatenate the column A values iff column B is same.
Here is the sample input,
Col A Col B
AAA www.google.com
BBB www.google.com
CCC www.gmail.com
DDD www.yahoo.com
Expected Output
Col A Col B
AAA,BBB www.google.com
CCC www.gmail.com
DDD www.yahoo.com
I am using the below Awk command to segregate the duplicate entries,
awk 'NR == 1 {p=$2; next} p == $2 { printf "%s,%s\n",$1,$2} {p=$2}' FS="," Input.csv
But I am not able to get the duplicates segregated.
Any suggestions or pointers will be highly appreciated.

In case you are not worried about the sequence of the output(like it should be same as shown Input_file) then following may help you on same.
awk 'FNR==1{print;next} {a[$2]=a[$2]?a[$2] "," $1:$1} END{for(i in a){print a[i],i}}' OFS="\t" Input_file
Output will be as follows:
Col A Col B
CCC www.gmail.com
DDD www.yahoo.com
AAA,BBB www.google.com

Related

Print all lines between two patterns, exclusive, first instance only (in sed, AWK or Perl) [duplicate]

This question already has answers here:
How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?
(9 answers)
Closed 3 years ago.
Using sed, AWK (or Perl), how do you print all lines between (the first instance of) two patterns, exclusive of the patterns?1
That is, given as input:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
Or possibly even:
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
I would expect, in both cases:
bbb
ccc
ddd
1 A number of users voted to close this question as a duplicate of this one. In the end, I provided a gist that proves they are different. The question is also superficially similar to a number of others, but there is no exact match, and none of them are of high quality, and, as I believe that this specific problem is the one most commonly faced, it deserves a clear formulation, and a set of correct, clear answers.
If you have GNU sed (tested using version 4.7 on Mac OS X), the simplest solution could be:
sed '0,/PATTERN1/d;/PATTERN2/Q'
Explanation:
The d command deletes from line 1 to the line matching /PATTERN1/ inclusive.
The Q command then exits without printing on the first line matching /PATTERN2/.
If the file has only once instance of the pattern, or if you don't mind extracting all of them, and you want a solution that doesn't depend on a GNU extension, this works:
sed -n '/PATTERN1/,/PATTERN2/{//!p}'
Explanation:
Note that the empty regular expression // repeats the last regular expression match.
With awk (assumes that PATTERN1 and PATTERN2 are always present in pairs and either of them do not occur inside a pair)
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
fff
PATTERN1
ggg
hhh
iii
PATTERN2
jjj
$ awk '/PATTERN2/{exit} f; /PATTERN1/{f=1}' ip.txt
bbb
ccc
ddd
/PATTERN1/{f=1} set flag if /PATTERN1/ is matched
/PATTERN2/{exit} exit if /PATTERN2/ is matched
f; print input line if flag is set
Generic solution, where the block required can be specified
$ awk -v b=1 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
bbb
ccc
ddd
$ awk -v b=2 '/PATTERN2/ && c==b{exit} c==b; /PATTERN1/{c++}' ip.txt
2
46
This might work for you (GNU sed);
sed -n '/PATTERN1/{:a;n;/PATTERN2/q;p;$!ba}' file
This prints only the lines between the first set of delimiters, or if the second delimiter does not exist, to the end of the file.
I attempted twice to answer, but the questions switched hold/duplicate statuses..
Borrowing input from #Sundeep and adding the answer which I shared in the question comments.
Using awk
awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' file
with Perl
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } '
Results:
$ cat ip.txt
aaa
PATTERN1
bbb
ccc
ddd
PATTERN2
eee
PATTERN1
2
46
PATTERN2
xyz
$
$ awk -v x=0 -v y=1 ' /PATTERN1/&&y { x=1;next } /PATTERN2/&&y { x=0;y=0; next } x ' ip.txt
bbb
ccc
ddd
$ perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if $x++ <1 } ' ip.txt
bbb
ccc
ddd
$
To make it generic
awk here y is the input
awk -v x=0 -v y=2 ' /PATTERN1/ { x++;next } /PATTERN2/ { if(x==y) exit } x==y ' ip.txt
2
46
perl check ++$x against the occurence.. here it is 2
perl -0777 -ne ' while( /PATTERN1.*?\n(.+?)^[^\n]*?PATTERN2/msg ) { print $1 if ++$x==2 } ' ip.txt
2
46
Adding more solutions(possible ways here, for fun :) and not at all claiming that these are better than usual ones) All tested and written in GNU awk. Also tested with given examples only.
1st Solution:
awk -v RS="" -v FS="PATTERN2" -v ORS="" '$1 ~ /\nPATTERN1\n/{sub(/.*PATTERN1\n/,"",$1);print $1}' Input_file
2nd solution:
awk -v RS="" -v ORS="" 'match($0,/PATTERN1[^(PATTERN2)]*/){val=substr($0,RSTART,RLENGTH);gsub(/^PATTERN1\n|^$\n/,"",val);print val}' Input_file
3rd solution:
awk -v RS="" -v OFS="\n" -v ORS="" 'sub(/PATTERN2.*/,"") && sub(/.*PATTERN1/,"PATTERN1"){$1=$1;sub(/^PATTERN1\n/,"")} 1' Input_file
In all above codes output will be as follows.
bbb
ccc
ddd
Using GNU sed:
sed -nE '/PATTERN1/{:s n;/PATTERN2/q;p;bs}'
-n will prune all but lines between PATTERN1 and PATTERN2 including both, because there will be p printout command.
every sed range check if it's true will execute only one the next, so {} grouping is mandated..
Drop PATTERN1 by n command (means next), if reach the first PATTERN2 outrightly quit otherwise print the line then and continue the next line within that boundary.

How to print columns one after the other in bash?

Is there any better methods to print two or more columns into one column, for example
input.file
AAA 111
BBB 222
CCC 333
output:
AAA
BBB
CCC
111
222
333
I can only think of:
cut -f1 input.file >output.file;cut -f2 input.file >>output.file
But it's not good if there are many columns, or when I want to pipe the output to other commands like sort.
Any other suggestions? Thank you very much!
With awk
awk '{if(maxc<NF)maxc=NF;
for(i=1;i<=NF;i++){(a[i]!=""?a[i]=a[i]RS$i:a[i]=$i)}
}
END{
for(i=1;i<=maxc;i++)print a[i]
}' input.file
You can use a GNU awk array of arrays to store all the data and print it later on.
If the number of columns is constant, this works for any amount of columns:
gawk '{for (i=1; i<=NF; i++) # loop over columns
data[i][NR]=$i # store in data[column][line]
}
END {for (i=1;i<=NR;i++) # loop over lines
for (j=1;j<=NF;j++) # loop over columns
print data[i][j] # print the given field
}' file
Note NR stands for number of records (that is, number of lines here) and NF stands for number of fields (that is, the number of fields in a given line).
If the number of columns changes over rows, then we should use yet another array, in this case to store the number of columns for each row. But in the question I don't see a request for this, so I am leaving it for now.
See a sample with three columns:
$ cat a
AAA 111 123
BBB 222 234
CCC 333 345
$ gawk '{for (i=1; i<=NF; i++) data[i][NR]=$i} END {for (i=1;i<=NR;i++) for (j=1;j<=NF;j++) print data[i][j]}' a
AAA
BBB
CCC
111
222
333
123
234
345
If the number of columns is not constant, using an array to store the number of columns for each row helps to keep track of it:
$ cat sc.wk
{for (i=1; i<=NF; i++)
data[i][NR]=$i
columns[NR]=NF
}
END {for (i=1;i<=NR;i++)
for (j=1;j<=NF;j++)
print (i<=columns[j] ? data[i][j] : "-")
}
$ cat a
AAA 111 123
BBB 222
CCC 333 345
$ awk -f sc.wk a
AAA
BBB
CCC
111
222
333
123
-
345
awk '{print $1;list[i++]=$2}END{for(j=0;j<i;j++){print list[j];}}' input.file
Output
AAA
BBB
CCC
111
222
333
More simple solution would be
awk -v RS="[[:blank:]\t\n]+" '1' input.file
Expects tab as delimiter:
$ cat <(cut -f 1 asd) <(cut -f 2 asd)
AAA
BBB
CCC
111
222
333
Since the order is of no importance:
$ awk 'BEGIN {RS="[ \t\n]+"} 1' file
AAA
111
BBB
222
CCC
333
Ugly, but it works-
for i in {1..2} ; do awk -v p="$i" '{print $p}' input.file ; done
Change the {1..2} to {1..n} where 'n' is the number of columns in the input file
Explanation-
We're defining a variable p which itself is the variable i. i varies from 1 to n and at each step we print the 'i'th column of the file.
This will work for an arbitrary number fo space separated colums
awk '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file
If space is not the separateor ... let's suppose ":" is the separator
awk -F: '{for (A=1;A<=NF;A++) printf("%s\n",$A);}' input.file | sort -u > output.file

Print lines whose 1st and 4th column differ

I have a file with a bunch of lines of this form:
12 AAA 423 12 BBB beta^11 + 3*beta^10
18 AAA 1509 18 BBB -2*beta^17 - beta^16
18 AAA 781 12 BBB beta^16 - 5*beta^15
Now I would like to print only lines where the 1st and the 4th column differ (the columns are space-separated) (the values AAA and BBB are fixed). I know I can do that by getting all possible values in the first column and then use:
for i in $values; do
cat file.txt | grep "^$i" | grep -v " $i BBB"
done
However, this runs through the file as many times as how many different values appear in the first column. Is there a way how to do that simply in one pass only? I think I can do the comparison, my main problem is that I have no idea how to extract the space-separated columns.
This is something quite straight forward for awk:
awk '$1 != $4' file
With awk, you refer to the first field with $1, the second with $2 and so on. This way, you can compare the first and the forth with $1 != $4. If this is true (that is, $1 and $4 differ), awk performs its default action: print the current line.
For your sample input, this works:
$ awk '$1 != $4' file
18 AAA 781 12 BBB beta^16 - 5*beta^15
Note you can define a different field separator with -v FS="...". This way, you can tell awk that your lines contain fields tab / comma / ... separated. All together it would be like this: awk -v FS="\t" '$1 != $4' file.

How can I extract a subset from a column/field using awk?

I wondered how can I extract a subset from a column/field using awk?
Here is the input file test.txt:
aaa bbb ccc=0.7707;ddd=0.21
I would like to be able to extract figure "0.21" from the 3rd column, and output it with the 1st and 2nd columns:
aaa bbb 0.21
I have tried and used the code below but failed:
awk 'BEGIN { OFS = "\t" } { $4 = /^ddd=(+\d)/ ; print $1,$2,$4 }' test.txt
Please help!
Many thanks,
TP
You can specify multiple delimiters using the -F flag or setting FS in the BEGIN block. For example:
echo "aaa bbb ccc=0.7707;ddd=0.21" | awk -F "[ =]" '{ print $1, $2, $NF }'
Results:
aaa bbb 0.21
You could use gsub:
awk 'BEGIN { OFS = "\t" } { gsub(/.*=/, "", $3); print $1,$2,$3 }' text.txt
For your input, it'd give:
aaa bbb 0.21
Another awk
awk '{split($3,a,"=");print $1,$2,a[3]}'
aaa bbb 0.21

tail file till last duplicate line in a file using bash

Hey everyone! How to in simple way find line number of last duplicate in file
I need take tale till last duplicate Example
hhhh
str1
str2
hhhh
str1
hhh
**str1
str2
str3**
I need only bold till hhh(str1,str2,str3).Thanks in advance!
Give this a try:
awk '{if (a[$0]) accum = nl = ""; else {a[$0]=1;accum = accum nl $0; nl = "\n"}} END { print accum}' inputfile
Given this input:
aaa
b
c
aaa
d
e
f
aaa
b
aaa
g
h
i
This is the output:
g
h
i
taking the sample from Dennis,
$ gawk -vRS="aaa" 'END{print}' file
g
h
i
here's another way if you don't know before hand, although not as elegant as one awk script.
var=$(sort file| uniq -c|sort -n | tail -1| awk '{print $2}')
gawk -vRS="$var" 'END{print}' file
still, this will only get the duplicate that occurs the most frequency. it does not get the "last duplicate" , whatever that means.

Resources