I can get the unique values from col using below command
cut -d',' -f3 file.txt | uniq -c.
This gives me unique values in field 3.
But if I want to get unique combination of two fields, how can I get that ?
input
A,B,C
B,C,D
D,B,C
H,C,D
K,C,D
output
2 B,C
3 C,D
You can specify range of fields using -f 2-3 or -f 2,3
cut -d',' -f2-3 file.txt | sort | uniq -c
uniq does not detect repeated lines unless they are adjacent. Input should be sorted before using uniq command
Output
2 B,C
3 C,D
Another option you may find provides greater flexibility in processing the input is awk. You can use a concatenation of the fields at issue as an index for an array to sum the occurrences of each unique combination of fields and then output the results using the END rule, e.g.
awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' file
Example Use/Output
With your example file in input you would have:
$ awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' input
3 C,D
2 B,C
awk arrays are associative rather than indexed, but you can preserve the order of appearance using a 3rd array if needed. Or you can simply pipe the output to sort for whatever order you like.
I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C
Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files
I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.
You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -
I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.
I've looking for this in here, but did not found the exact case. Sorry if it is duplicated, but I couldn't find it.
I have a huge file in Debian that contains 4 columns separated by "#", with the following format:
username#source#date#time
For example:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I want to print unique rows based on the first two columns, and if duplicates found, it has to print the last event based on date/time. With the list above, the result should be:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
I have tested it using two commands:
cat file | sort -u -t# -k1,2
cat file | sort -r -u -t# -k1,2
But both of them print the following:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-07#14:31:40 --> Wrong line, it is older than the duplicate one
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
Is there any way to do it?
Thanks!
This should work
tac file | awk -F# '!a[$1,$2]++' | tac
Output
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
First, you need sort the input file to ensure the order of lines, e.g. for duplicate username#source you will get ordered times. Best is sort reverse, so last event comes first. This can be done with an simple sort, like:
sort -r < yourfile
This will produce from your input the next:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A222222#Juniper#2014-08-07#14:31:40
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
reverse-ordered lines, where for the each username#source combination the latest event comes first.
next, you need somewhat filter the sorted lines, to get only the first event. This can be done, with several tools, like awk or uniq or perl and such,
So, the solution
sort -r <yourfile | uniq -w16
or
sort -r <yourfile | awk -F# '!seen[$1,$2]++'
or
sort -r yourfile | perl -F'#' -lanE 'say $_ unless $seen{"$F[0],$F[1]"}++'
all the above will print the next
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Windows#2014-08-08#10:27:30
A111111#Juniper#2014-08-10#14:32:55
Finally you can re-sort the unique lines as you want and needed.
awk -F\# '{ p = ($1 FS $2 in a ); a[$1 FS $2] = $0 }
!p { keys[++k] = $1 FS $2 }
END { for (k = 1; k in keys; ++k) print a[keys[k]] }' file
Output:
A222222#Windows#2014-08-18#10:47:16
A222222#Juniper#2014-08-08#09:15:34
A111111#Juniper#2014-08-10#14:32:55
A111111#Windows#2014-08-08#10:27:30
If you know for a fact that the first column is always 7 chars long, and second column also 7 chars long, you can extract unique lines considering only the first 16 characters with:
uniq file -w 16
Since you want the latter duplicate, you can reverse the data using tac prior to uniq and then reverse the output again:
tac file | uniq -w 16 | tac
Update: As commented below, uniq needs the lines to be sorted. In which case this starts to become contrived, and the awk based suggestions are better. Something like this would still work though:
sort -s -t"#" -k1,2 file | tac | uniq -w 16 | tac