Unix: Get the latest entry from the file - shell

I have a file where there are name and time. I want to keep the entry only with the latest time. How do I do it?
for example:
>cat user.txt
"a","03-May-13
"b","13-May-13
"a","13-Aug-13
"a","13-May-13
I am using command sort -u user.txt. It is giving the following output:
"a","11-May-13
"a","13-Aug-13
"a","13-May-13
"b","13-May-13
but I want the following output.
"a","13-Aug-13
"b","13-May-13
Can someone help?
Thanks.

Try this:
sort -t, -k2 user.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Explanation:
Sort the entries by the date field in ascending order, pipe the sorted result to awk, which simply uses the first field as a key, so only the last entry of the entries with an identical key will be kept and finally output.
EDIT
Okay, so I can't sort the entries lexicographically. the date need to be converted to timestamp so it can be compared numerically, use the following:
awk -F",\"" '{ cmd=" date --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
If you are using MacOS, use gdate instead:
awk -F",\"" '{ cmd=" gdate --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","

I think you need to sort year, month and day.
Can you try this
awk -F"\"" '{print $2"-"$4}' data.txt | sort -t- -k4 -k3M -k2 | awk -F- '{kv[$1]=$2"-"$3"-"$4}END{for(k in kv){print k,kv[k]}}'

For me this is doing the job. I am sorting on the Month and then applying the logic that #neevek used. Till now I am unable to find a case that fails this. But I am not sure if this is a full proof solution.
sort -t- -k2 -M user1.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Can someone tell me if this solution has any issues?

How about this?
grep `cut -d'"' -f4 user.txt | sort -t- -k 3 -k 2M -k 1n | tail -1` user.txt
Explaining: using sort as you have done, get the latest entry with tail -1, extract that date (second column when cutting with a comma delimiter) and then sort and grep on that.
edit: fixed to sort via month.

Related

Cut and sort delimited dates from stdout via pipe

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz
To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

Use of Awk filter to get the students records details in descending order of total score

Student details are stored in a file system as follows:
Roll_no,name,socre1,score2
101,ABC,50,55
102,XYZ,48,54
103,CWE,42,34
104,ZSE,65,72
105,FGR,31,45
106,QWE,68,45
Q.Write the unix command to display Roll_no and name of the student whose total score is greater than 100 the student details are to be displayed sorted in descending order of the total score.
total score as to be calculated as follows :-
totalscore=score1+score2
file also content the header(Roll_no,name,socre1,score2)
My solution:
awk 'BEGIN {FS=",";OFS=" "} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,h[i]}}' stu.txt| sort -rk 2n
I am not getting how to get sorting according to total score?
please help guys!
output:-
104 ZSE
106 QWE
101 ABC
102 XYZ
Could you please try following. To keep it simple in calculation(1st get total of numbers for all lines which are greater than 100 Then sort it reverse order by total as per OP then print only first 2 columns by cut)
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -t, -nr -k3 |
cut -d',' -f 1-2
OR in case you want output in space delimiter in output then try following.
awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file |
sort -nr -k3 |
cut -d' ' -f 1-2
Explanation: Adding detailed explanation for above.
awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file | ##Starting awk program setting FS, OFS as comma. Then checking 3rd+4th column sum is greater than 100 then printing 1st, 2nd field along with sum of 3rd and 4th field here. Now passing its output as input to next command.
sort -t, -nr -k3 | ##Sorting output with setting delimiter as comma and sorting it reverse order witg 3rd column here, sending output as input to next command.
cut -d',' -f 1-2 ##Getting first 2 fields by setting delimiter comma here, to get name and roll number here.
OR
sort -t, -nr -k3 < <(awk 'BEGIN{FS=OFS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d',' -f 1-2
OR in case you need output as space delimited then try following.
sort -nr -k3 < <(awk 'BEGIN{FS=","} $3+$4>100{print $1,$2,$3+$4}' Input_file) |
cut -d' ' -f 1-2
$ awk 'BEGIN {OFS=FS=","}
NR==1 {print $0, "total"; next}
{if(($5=$3+$4)>100) print | "sort -t, -k5nr"}' file
Roll_no,name,socre1,score2,total
104,ZSE,65,72,137
106,QWE,68,45,113
101,ABC,50,55,105
102,XYZ,48,54,102
without header and individual scores
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100{print $1,$2,$3}' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
or
$ awk 'BEGIN{OFS=FS=","}
NR>1 && ($3+=$4)>100 && NF--' file | sort -t, -k3nr
104,ZSE,137
106,QWE,113
101,ABC,105
102,XYZ,102
without the final score and not comma delimited
$ awk -F, 'NR>1 && ($3+=$4)>100 && NF--' file | sort -k3nr | cut -d' ' -f1,2
104 ZSE
106 QWE
101 ABC
102 XYZ
reads as written
if line number is greater than one (skip header) AND
if field 3 + field 4 > 100 (assigned back to field 3) then
if both conditions are satisfied decrement field count so that last field won't be printed.
sort the results based on the third field,
remove the last field.
you were close:
awk 'BEGIN {FS=OFS=","} {if(NR>1){if($3+$4>100){s[$1]=$2}}} END{for (i in s) {print i,s[i]}}' stu.txt| sort -rk 2n

uniq sort parsing

I have one file with field separated by ";", like this:
test;group;10.10.10.10;action2
test2;group;10.10.13.11;action1
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test5;group2;10.10.10.12;action5
test6;group4;10.10.13.11;action8
I would like to identify all non-unique IP addresses (3rd column). With the example the extract should be:
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
Sorted by IP address (3rd column).
Ssing simple commands like cat, uniq, sort, awk (not Perl, not Python, only shell).
Any idea?
$ awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file|sort -t";" -k3
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
awk picks all duplicated ($3) lines
sort sorts by ip
You can also try this solution using grep, cut, sort, uniq, and a casual process substitution in the middle.
grep -f <(cut -d ';' -f3 file | sort | uniq -d) file | sort -t ';' -k3
It is not really elegant (I actually prefer the awk answer given above), but I think worth sharing, since it accomplishes what you want.
here is another awk assisted pipeline
$ awk -F';' '{print $0 "\t" $3}' file | sort -sk2 | uniq -Df1 | cut -f1
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
single pass, so special caching; also keeps the original order (stable sorting). Assumes tab doesn't appear in the fields.
This is very similar to Kent's answer, but with a single pass through the file. The tradeoff is memory: you need to store the lines to keep. This uses GNU awk for the PROCINFO variable.
awk -F';' '
{count[$3]++; lines[$3] = lines[$3] $0 ORS}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in count)
if (count[key] > 1)
printf "%s", lines[key]
}
' file
The equivalent perl
perl -F';' -lane '
$count{$F[2]}++; push #{$lines{$F[2]}}, $_
} END {
print join $/, #{$lines{$_}}
for sort grep {$count{$_} > 1} keys %count
' file
awk + sort + uniq + cut:
$ awk -F ';' '{print $0,$3}' <file> | sort -k2 | uniq -D -f1 | cut -d' ' -f1
sort + awk
$ sort -t';' -k3,3 | awk -F ';' '($3==k){c++;b=b"\n"$0}($3!=k){if (c>1) print b;c=1;k=$3;b=$0}END{if(c>1)print b}
awk
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{for (i in k) if(k[i]>1) for(j=1;j<=k[i];j++) print b[i"_"j] } <file>
This buffers the full file (same as sort does) and keeps track how many times a key k is appearing. At the end, if the key appears more then ones, print the full set.
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
If you want it sorted :
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{ asorti(k,l);
for (i in l) if(k[l[i]]>1) for(j=1;j<=k[l[i]];j++) print b[l[i]"_"j] } <file>

How to count duplicates in Bash Shell

Hello guys I want to count how many duplicates there are in a column of a file and put the number next to them. I use awk and sort like this
awk -F '|' '{print $2}' FILE | sort | uniq -c
but the count (from the uniq -c) appears at the left side of the duplicates.
Is there any way to put the count on the right side instead of the left, using my code?
Thanks for your time!
Though I believe you shouls show us your Input_file so that we could create a single command or so for this requirement, since you have't shown Input_file so trying to solve it with your command itself.
awk -F '|' '{print $2}' FILE | sort | uniq -c | awk '{for(i=2;i<=NF;i++){printf("%s ",$i)};printf("%s%s",$1,RS)}'
You can just use awk to reverse the output like below:
awk -F '|' '{print $2}' FILE | sort | uniq -c | awk {'print $2" "$1'}
awk -F '|' '{print $2}' FILE | sort | uniq -c| awk '{a=$1; $1=""; gsub(/^ /,"",$0);print $0,a}'
You can use awk to calculate the amount of duplicates, so your command can be simplified as followed,
awk -F '|' '{a[$2]++}END{for(i in a) print i,a[i]}' FILE | sort
Check this command:
awk -F '|' '{c[$2]++} END{for (i in c) print i, c[i]}' FILE | sort
Use awk to do the counting is enough. If you do not want to sort by browser, remove the pipe and sort.

Find the first or third column

The following command is working as expected. What I need to find is the thread id that is available in the first or third column.
# tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)"
2856057 Connect root#localhost on
111116 5:14:01 2856094 Connect root#localhost on
If the line starts with the date, select the third column i.e. 2856094 or the first column i.e. 2856057
Expected output:
2856057
2856094
Another way to look at it is that you always take the fourth column when counting from the right:
awk '{ print $(NF-3) }'
Otherwise, if the date is really the only reliable indicator, try this:
awk -v Date=$(date "+%y%m%d") '$1 == Date { print $3; next } { print $1 }'
If your data really is that regular (i.e. all the columns are fixed width), then you could use cut:
tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)" | cut -c17-23
This might work for you:
tail -1000 general.log | sed -e '/abc\|slave_user/d;/ Connect.*/!d;s///;s/.* //'
Use the awk inbuilt variable NF to capture the number of fields. If they equal to 6 then print 3 column else print 1st column.
awk 'NF==6{ print $3;next } { print $1 }' INPUT_FILE
Without knowing the format of the file, maybe try:
$ tail -1000 general.log | grep Connect | egrep -v "(abc|slave_user)" | awk '{if ($3 == "root#localhost"){print $1;}else{print $3}}'
Or maybe this would work which is simpler:
$ awk '/Connect/ {if ($3 == "root#localhost"){print $1;}else{print $3}}' general.log
I tried. If I'm wrong, or there is a better way, I to will learn it in time. :)
Maybe this using int() ??????
$ awk '/Connect/ {if (!int($3)){print $1;}else{print $3}}' general.log

Resources