Print lines where first column matches, second column different - bash

In a text file, how do I print out only the lines where the first column is duplicate but 2nd column is different? I want to reconcile these differences. Possibly using awk/sed/bash?
Input:
Jon AAA
Jon BBB
Ellen CCC
Ellen CCC
Output:
Jon AAA
Jon BBB
Note that the real file is not sorted.
Thanks for any help.

this line should do: (I broke the one-liner into 3 lines for better reading)
awk '!($1 in a) {a[$1]=$2;next}
$1 in a && $2!=a[$1]{p[$1 FS $2];p[$1 FS a[$1]]}
END{for(x in p)print x}' file
the 1st line save $1 $2 into array, if it was checked first time
line2: for existing $1 and different $2, put them (the two lines) into an array p, so that same $1,$2 combination won't be print multiple times.
print the index of array p

sort file | uniq -u
Will only print the unique lines.

This might work for you:
sort file | uniq -u | rev | uniq -Df1 | rev
This sorts the file, removes any duplicate lines, reverses the line, removes and unique lines that don't have the same key (keeps duplicates where the 2nd field is the same) and the reverses the line to its original position.
This will drop duplicate lines and lines with singleton keys.

Just a normal unique sort should work
awk '!a[$0]++' test

Related

get unique combination of values of two columns

I can get the unique values from col using below command
cut -d',' -f3 file.txt | uniq -c.
This gives me unique values in field 3.
But if I want to get unique combination of two fields, how can I get that ?
input
A,B,C
B,C,D
D,B,C
H,C,D
K,C,D
output
2 B,C
3 C,D
You can specify range of fields using -f 2-3 or -f 2,3
cut -d',' -f2-3 file.txt | sort | uniq -c
uniq does not detect repeated lines unless they are adjacent. Input should be sorted before using uniq command
Output
2 B,C
3 C,D
Another option you may find provides greater flexibility in processing the input is awk. You can use a concatenation of the fields at issue as an index for an array to sum the occurrences of each unique combination of fields and then output the results using the END rule, e.g.
awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' file
Example Use/Output
With your example file in input you would have:
$ awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' input
3 C,D
2 B,C
awk arrays are associative rather than indexed, but you can preserve the order of appearance using a 3rd array if needed. Or you can simply pipe the output to sort for whatever order you like.

match awk column value to a column in another file

I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C
Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files
I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.
You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

Checking number prefix

I have some trouble in my script.
I am currently using:
awk '{anum=substr($1,3,22); sub(/^0+/, "", anum); print anum}' file1 | grep -nf file2 | cut -d: -f1 | awk 'FNR==NR{a[$1];next};FNR in a' - file1
file1
5000000000009855892590xxxx xxx
5000000000000068582654xxxx xxx
5000000000009855892580xxxx xxx
5000000000000765432100xxxx xxx
file2
9855892588
985589259
8265
76543210
I am getting the output using the two files below (file1 and file2):
5000000000009855892590xxxx xxx
5000000000000068582654xxxx xxx
5000000000000765432100xxxx xxx
But my expected output is just:
5000000000009855892590xxxx xxx
5000000000000765432100xxxx xxx
My problem is that it captures 8265 in the middle of 5000000000000068582654xxxx which is wrong. What else can I use in replacement of grep -nf to meet my condition? Should the numbers in file2 match the prefix or whole number of 3rd to 22nd digit of file1 (w/o leading zeros).
This will work for your example but as I'm not really sure of exactly how you determine whats valid or not it may not be very robust.
gawk 'NR==FNR{a[$1]=$1;next}{match($0,/0+([1-9][0-9]+)0/,b)}a[b[1]]' file{2,1}
5000000000009855892590xxxx xxx
5000000000000765432100xxxx xxx
It creates an array of all the first fields in the first file(file2), then matches a string that i have guessed is your valid string, in the second file. Next if the string has been saved in the array it prints the line.
Not gawk version
awk 'NR==FNR{a[$1]=$1;next}{n=substr($1,3,22);sub(/^0+/, "", n)
for(i in a)if(n~"^"a[i])print}' test2 test
Same start as the other, then remove the start of the line as OP has done, next for each saved element, check if the newly created line starts with it.

How to remove duplicates by column (inverse ordering)

I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!
This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.
awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac

Resources