Extract lines having same second column but different third column - sorting

I have a file having strings in 3 columns as below.
a b x
a b y
a b z
a c x
a d y
I want to extract all the lines having same second column but different third column. The output I am expecting for the above example is
a b x
a b y
a b z
I tried uniq -f2 and sort -u -k2, But it isn't working as I expect. Any suggestions please.

awk '
seen[$2]++ {
if (!seen[$2,$3]++) {
printf "%s%s\n", first[$2], $0
}
delete first[$2]
next
}
{ first[$2] = $0 ORS }
' file
a b x
a b y
a b z
Note that the above will work in any awk, for any values in your input file, does not retain the whole of the input file in memory, doesn't rely on any external tools for pre/post processing, and will produce the output lines in exactly the same order they appeared in the input.

awk to the rescue!
Need to make sure all records are unique first
$ sort file | uniq |
awk '{c[$2]++; a[$2]=a[$2]?a[$2]RS$0:$0}
END{for(k in a) if(c[k]>1) print a[k]}'
a b x
a b y
a b z
Explanation: keep the counter of second field occurrences and aggregate the records. At the end print the records for which the counter is greater than one.

Related

Sort & Uniq the values on specific column

I am having a data separated by : delimeted
AA:w_c;w_c;r_c:1;3
BB:sync;sync:4
CC:t_wak;t_wak:6;7;8
I need to print only one value in column 2 that to unique value. If there are more than one unique value then it need to print in another file.
I tried this:
#!/bin/bash
sort -u -t : -k2,2 file >> txt
awk -F: '{gsub(";"," ",$3)}1' txt
Output:
BB:sync;sync:4
CC t_wak;t_wak 6 7 8
AA w_c;w_c;r_c 1 3
Actually I am trying to to do sort and uniq the values in column 2 and copying that output to another file called "txt". Then I am using AWk to replace the ; with space in column 3 seems above code is not working.
Desired Output 1:
BB:sync:4
CC:t_wak:6 7 8
The above two values are the actual output we need to get to print because in column 2 it contains only one value.
The below one needs to print in another file because in column 2 it contains more than one value.
Desired output 2:
AA:w_c;r_c:1;3
w_c
r_c
In column 2 it should have only one value, if there are more than one then need to print in another file by stating them as shown above.
This quick solution should work for the example:
awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' input
Let's do a test:
kent$ awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' f
BB:sync:4
CC:t_wak:6;7;8
kent$ cat output2.txt
AA:w_c;r_c:1;3
If you want to have each value in col2 in the output2.txt:
awk 'BEGIN{FS=OFS=":";out2="output2.txt"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > out2
for(x in u)
print x > out2
next
}
}7' input
Then you'll get:
kent$ cat output2.txt
AA:w_c;r_c:1;3
w_c
r_c

Combining 2 lines together but "interlaced"

I have 2 lines from an output as follow:
a b c
x y z
I would like to pipe both lines from the last command into a script that would combine them "interlaced", like this:
a x b y c z
The solution should work for a random number of columns from the output, such as:
a b c d e
x y z x y
Should result in:
a x b y c z d x e y
So far, I have tried using awk, perl, sed, etc... but without success. All I can do, is to put the output into one line, but it won't be "interlaced":
$ echo -e 'a b c\nx y z' | tr '\n' ' ' | sed 's/$/\n/'
a b c x y z
Keep fields of odd numbered records in an array, and update the fields of even numbered records using it. This will interlace each pair of successive lines in input.
prog | awk 'NR%2{split($0,a);next} {for(i in a)$i=(a[i] OFS $i)} 1'
Here's a 3 step solution:
$ # get one argument per line
$ printf 'a b c\nx y z' | xargs -n1
a
b
c
x
y
z
$ # split numbers of lines by 2 and combine them side by side
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' '
a x
b y
c z
$ # combine all input lines into single line
$ printf 'a b c\nx y z' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z
$ printf 'a b c d e\nx y z 1 2' | xargs -n1 | pr -2ts' ' | paste -sd' '
a x b y c z d 1 e 2
Could you please try following, it will join every 2 lines in "interlaced" fashion as follows.
awk '
FNR%2!=0 && FNR>1{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
delete a
}
}
{
for(i=1;i<=NF;i++){
a[i]=(a[i]?a[i] OFS:"")$i}
}
END{
for(j=1;j<=NF;j++){
printf("%s%s",a[j],j==NF?ORS:OFS)
}
}' Input_file
Here is a simple awk script
script.awk
NR == 1 {split($0,inArr1)} # read fields frrom 1st line into arry1
NR == 2 {split($0,inArr2); # read fields frrom 2nd line into arry2
for (i = 1; i <= NF; i++) printf("%s%s%s%s", inArr1[i], OFS, inArr2[i], OFS); # ouput interlace fields from arr1 and arr2
print; # terminate output line.
}
input.txt
a b c d e
x y z x y
running:
awk -f script.awk input.txt
output:
a x b y c z d x e y x y z x y
Multiline awk solution:
interlaced.awk
{
a[NR] = $0
}
END {
split(a[1], b)
split(a[2], c)
for (i in b) {
printf "%s%s %s", i==1?"":OFS, b[i], c[i]
}
print ORS
}
Run it like this:
foo_program | awk -f interlaced.awk
Perl will do the job. It was invented for this type of task.
echo -e 'a b c\nx y z' | \
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
 
a x b y c z
You can of course print out the meshed output any way you want.
Check out http://metacpan.org/pod/List::MoreUtils#mesh
You could even make it into a shell function for easy use:
function meshy {
perl -MList::MoreUtils=mesh -e \
'#f=mesh #{[split " ", <>]}, #{[split " ", <>]}; print "#f"'
}
$ echo -e 'X Y Z W\nx y z w' |meshy
X x Y y Z z W w
$
Ain't Perl grand?
This might work for you (GNU sed):
sed -E 'N;H;x;:a;s/\n(\S+\s+)(.*\n)(\S+\s+)/\1\3\n\2/;ta;s/\n//;s// /;h;z;x' file
Process two lines at time. Append two lines in the pattern space to the hold space which will introduce a newline at the front of the two lines. Using pattern matching and back references, nibble away at the front of each of the two lines and place the pairs at the front. Eventually, the pattern matching fails, then remove the first newline and replace the second by a space. Copy the amended line to hold space, clean up the pattern space ready for the next couple of line (if any) and print.

Matching contents of one file with another and returning second column

So I have two txt files
file1.txt
s
j
z
z
e
and file2.txt
s h
f a
j e
k m
z l
d p
e o
and what I want to do is match the first letter of file1 with the first letter of file 2 and return the second column of file 2. so for example excepted output would be
h
e
l
l
o
I'm trying to use join file1.txt file2.txt but that just prints out the entire second file. not sure how to fix this. Thank you.
This is an awk classic:
$ awk 'NR==FNR{a[$1]=$2;next}{print a[$1]}' file2 file1
h
e
l
l
o
Explained:
$ awk '
NR==FNR { # processing file2
a[$1]=$2 # hash records, first field as key, second is the value
next
} { # second file
print a[$1] # output, change the record with related, stored one
}' file2 file1

How to repeat lines in bash and paste with different columns?

is there a short way in bash to repeat the first line of a file as often as needed to paste it with another file in a kronecker product type (for the mathematicians of you)?
What I mean is, I have a file A:
a
b
c
and a file B:
x
y
z
and I want to merge them as follows:
a x
a y
a z
b x
b y
b z
c x
c y
c z
I could probably write a script, read the files line by line and loop over them, but I am wondering if there a short one-line command that could do the same job. I can't think of one and as you can see, I am also lacking some keywords to search for. :-D
Thanks in advance.
You can use this one-liner awk command:
awk 'FNR==NR{a[++n]=$0; next} {for(i=1; i<=n; i++) print $0, a[i]}' file2 file1
a x
a y
a z
b x
b y
b z
c x
c y
c z
Breakup:
NR == FNR { # While processing the first file in the list
a[++n]=$0 # store the row in array 'a' by the an incrementing index
next # move to next record
}
{ # while processing the second file
for(i=1; i<=n; i++) # iterate over the array a
print $0, a[i] # print current row and array element
}
alternative to awk
join <(sed 's/^/_\t/' file1) <(sed 's/^/_\t/' file2) | cut -d' ' -f2-
add a fake key for join to have all records of file1 to match all records of file2, trim afterwards

complex line copying&modifying on-the-fly with grep or sed

Is there a way to do the followings with either grep, or sed: read each line of a file, and copy it twice and modify each copy:
Original line:
X Y Z
A B C
New lines:
Y M X
Y M Z
B M A
B M C
where X, Y, Z, M are all integers, and M is a fixed integer (i.e. 2) we inject while copying! I suppose a solution (if any) will be so complex that people (including me) will start bleeding after seeing it!
$ awk -v M=2 '{print $2,M,$1; print $2,M,$3;}' file
Y 2 X
Y 2 Z
B 2 A
B 2 C
How it works
-v M=2
This defines the variable M to have value 2.
print $2,M,$1
This prints the second column, followed by M, followed by the first column.
print $2,M,$3
This prints the second column, followed by M, followed by the third column.
Extended Version
Suppose that we want to handle an arbitrary number of columns in which we print all columns between first and last, followed by M, followed by the first, and then print all columns between first and last, followed by M, followed by the last. In this case, use:
awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file
As an example, consider this input file:
$ cat file2
X Y1 Y2 Z
A B1 B2 C
The above produces:
$ awk -v M=2 '{for (i=2;i<NF;i++)printf "%s ",$i; print M,$1; for (i=2;i<NF;i++)printf "%s ",$i; print M,$NF;}' file2
Y1 Y2 2 X
Y1 Y2 2 Z
B1 B2 2 A
B1 B2 2 C
The key change to the code is the addition of the following command:
for (i=2;i<NF;i++)printf "%s "
This command prints all columns from the i=2, which is the column after the first to i=NF-1 which is the column before the last. The code is otherwise similar.
Sure; you can write:
sed 's/\(.*\) \(.*\) \(.*\)/\2 M \1\n\2 M \3/'
With bash builtin commands:
m=2; while read a b c; do echo "$b $m $a"; echo "$b $m $c"; done < file
Output:
Y 2 X
Y 2 Z
B 2 A
B 2 C

Resources