How to count duplicates in Bash Shell - bash

Hello guys I want to count how many duplicates there are in a column of a file and put the number next to them. I use awk and sort like this
awk -F '|' '{print $2}' FILE | sort | uniq -c
but the count (from the uniq -c) appears at the left side of the duplicates.
Is there any way to put the count on the right side instead of the left, using my code?
Thanks for your time!

Though I believe you shouls show us your Input_file so that we could create a single command or so for this requirement, since you have't shown Input_file so trying to solve it with your command itself.
awk -F '|' '{print $2}' FILE | sort | uniq -c | awk '{for(i=2;i<=NF;i++){printf("%s ",$i)};printf("%s%s",$1,RS)}'

You can just use awk to reverse the output like below:
awk -F '|' '{print $2}' FILE | sort | uniq -c | awk {'print $2" "$1'}

awk -F '|' '{print $2}' FILE | sort | uniq -c| awk '{a=$1; $1=""; gsub(/^ /,"",$0);print $0,a}'

You can use awk to calculate the amount of duplicates, so your command can be simplified as followed,
awk -F '|' '{a[$2]++}END{for(i in a) print i,a[i]}' FILE | sort

Check this command:
awk -F '|' '{c[$2]++} END{for (i in c) print i, c[i]}' FILE | sort
Use awk to do the counting is enough. If you do not want to sort by browser, remove the pipe and sort.

Related

How do I remove the header in the df command?

I'm trying to write a bash command that will sort all volumes by the amount of data they have used and tried using
df | awk '{print $1 | "sort -r -k3 -n"}'
Output:
map
devfs
Filesystem
/dev/disk1s5
/dev/disk1s2
/dev/disk1s1
But this also shows the header called Filesystem.
How do I remove that?
For your specific case, i.e. using awk, #codeforester answer (using awk NR (Number of Records) variable) is the best.
In a more general case, in order to remove the first line of any output, you can use the tail -n +N option in order to output starting with line N:
df | tail -n +2 | other_command
This will remove the first line in df output.
Skip the first line, like this:
df | awk 'NR>1 {print $1 | "sort -r -k3 -n"}'
I normally use one of these options, if I have no reason to use awk:
df | sed 1d
The 1d option to sed says delete the first line, then print everything else.
df | tail -n+2
the -n+2 option to tail say start looking at line 2 and print everything until End-of-Input.
I suspect sed is faster than awk or tail, but I can't prove it.
EDIT
If you want to use awk, this will print every line except the first:
df | awk '{if (FNR>1) print}'
FNR is the File Record Number. It is the line number of the input. If it is greater than 1, print the input line.
Count the lines from the output of df with wc and then substract one line to output a headerless df with tail ...
LINES=$(df|wc -l)
LINES=$((${LINES}-1))
df | tail -n ${LINES}
OK - I see oneliner - Here is mine ...
DF_HEADERLESS=$(LINES=$(df|wc -l); LINES=$((${LINES}-1));df | tail -n ${LINES})
And for formated output lets printf loop over it...
printf "%s\t%s\t%s\t%s\t%s\t%s\n" ${DF_HEADERLESS} | awk '{print $1 | "sort -r -k3 -n"}'
This might help with GNU df and GNU sort:
df -P | awk 'NR>1{$1=$1; print}' | sort -r -k3 -n | awk '{print $1}'
With GNU df and GNU awk:
df -P | awk 'NR>1{array[$3]=$1} END{PROCINFO["sorted_in"]="#ind_num_desc"; for(i in array){print array[i]}}'
Documentation: 8.1.6 Using Predefined Array Scanning Orders with gawk
Removing something from a command output can be done very simply, using grep -v, so in your case:
df | grep -v "Filesystem" | ...
(You can do your awk at the ...)
When you're not sure about caps, small caps, you might add -i:
df | grep -i -v "FiLeSyStEm" | ...
(The switching caps/small caps are meant as a clarification joke :-) )

How can I send the last column of the first line to standard output?

For example
The file TEMPFILE.TXT contains this:
PROC-|STUFF_THINGS|MORE STUFF|PING|AUTOSYS
PROC-|ASTUFF_THINGS_XX_2|Print-Wire|AUTONON
I only want to print AUTOSYS to standard output.
Use awk:
awk -F'|' 'NR==1 {print $NF; exit}' file
If you don't mind hardcoding the number of columns, then:
head -1 file | cut -d'|' -f5
Column-count agnostic approach, but more round-about and expensive:
head -1 file | rev | cut -f1 -d'|' | rev
In all these, we are only reading the first line of the file.
You can try :
while read line ;do echo "${line##*|}";break;done

uniq sort parsing

I have one file with field separated by ";", like this:
test;group;10.10.10.10;action2
test2;group;10.10.13.11;action1
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test5;group2;10.10.10.12;action5
test6;group4;10.10.13.11;action8
I would like to identify all non-unique IP addresses (3rd column). With the example the extract should be:
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
Sorted by IP address (3rd column).
Ssing simple commands like cat, uniq, sort, awk (not Perl, not Python, only shell).
Any idea?
$ awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file|sort -t";" -k3
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
awk picks all duplicated ($3) lines
sort sorts by ip
You can also try this solution using grep, cut, sort, uniq, and a casual process substitution in the middle.
grep -f <(cut -d ';' -f3 file | sort | uniq -d) file | sort -t ';' -k3
It is not really elegant (I actually prefer the awk answer given above), but I think worth sharing, since it accomplishes what you want.
here is another awk assisted pipeline
$ awk -F';' '{print $0 "\t" $3}' file | sort -sk2 | uniq -Df1 | cut -f1
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
single pass, so special caching; also keeps the original order (stable sorting). Assumes tab doesn't appear in the fields.
This is very similar to Kent's answer, but with a single pass through the file. The tradeoff is memory: you need to store the lines to keep. This uses GNU awk for the PROCINFO variable.
awk -F';' '
{count[$3]++; lines[$3] = lines[$3] $0 ORS}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in count)
if (count[key] > 1)
printf "%s", lines[key]
}
' file
The equivalent perl
perl -F';' -lane '
$count{$F[2]}++; push #{$lines{$F[2]}}, $_
} END {
print join $/, #{$lines{$_}}
for sort grep {$count{$_} > 1} keys %count
' file
awk + sort + uniq + cut:
$ awk -F ';' '{print $0,$3}' <file> | sort -k2 | uniq -D -f1 | cut -d' ' -f1
sort + awk
$ sort -t';' -k3,3 | awk -F ';' '($3==k){c++;b=b"\n"$0}($3!=k){if (c>1) print b;c=1;k=$3;b=$0}END{if(c>1)print b}
awk
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{for (i in k) if(k[i]>1) for(j=1;j<=k[i];j++) print b[i"_"j] } <file>
This buffers the full file (same as sort does) and keeps track how many times a key k is appearing. At the end, if the key appears more then ones, print the full set.
test2;group;10.10.13.11;action1
test6;group4;10.10.13.11;action8
test;group;10.10.10.10;action2
test3;group3;10.10.10.10;action3
tes4;group;10.10.10.10;action4
If you want it sorted :
$ awk -F ';' '{b[$3"_"++k[$3]]=$0; }
END{ asorti(k,l);
for (i in l) if(k[l[i]]>1) for(j=1;j<=k[l[i]];j++) print b[l[i]"_"j] } <file>

Bash string replace on command result

I have a simple bash script which is getting the load average using uptime and awk, for example
LOAD_5M=$(uptime | awk -F'load averages:' '{ print $2}' | awk '{print $2}')
However this includes a ',' at the end of the load average
e.g.
0.51,
So I have then replaced the comma with a string replace like so:
LOAD_5M=${LOAD_5M/,/}
I'm not an awk or bash wizzkid so while this gives me the result I want, I am wondering if there is a succinct way of writing this, either by:
Using awk to get the load average without the comma, or
Stripping the comma in a single line
You can do that in same awk command:
uptime | awk -F 'load averages?: *' '{split($2, a, ",? "); print a[2]}'
1.32
The 5 min load is available in /proc/loadavg. You can simply use cut:
cut -d' ' -f2 /proc/loadavg
With awk you can issue:
awk '{print $2}' /proc/loadavg
If you are not working on Linux the file /proc/loadavg will not being present. In this case I would suggest to use sed, like this:
uptime | sed 's/.*, \(.*\),.*,.*/\1/'
uptime | awk -F'load average:' '{ print $2}' | awk -F, '{print $2}'
0.38
(My uptime output has 'load average:' singular)
The load average numbers are always the last 3 fields in the 'uptime' output so:
IFS=' ,' read -a uptime_fields <<<"$(uptime)"
LOAD_5M=${uptime_fields[#]: -2:1}

Unix: Get the latest entry from the file

I have a file where there are name and time. I want to keep the entry only with the latest time. How do I do it?
for example:
>cat user.txt
"a","03-May-13
"b","13-May-13
"a","13-Aug-13
"a","13-May-13
I am using command sort -u user.txt. It is giving the following output:
"a","11-May-13
"a","13-Aug-13
"a","13-May-13
"b","13-May-13
but I want the following output.
"a","13-Aug-13
"b","13-May-13
Can someone help?
Thanks.
Try this:
sort -t, -k2 user.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Explanation:
Sort the entries by the date field in ascending order, pipe the sorted result to awk, which simply uses the first field as a key, so only the last entry of the entries with an identical key will be kept and finally output.
EDIT
Okay, so I can't sort the entries lexicographically. the date need to be converted to timestamp so it can be compared numerically, use the following:
awk -F",\"" '{ cmd=" date --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
If you are using MacOS, use gdate instead:
awk -F",\"" '{ cmd=" gdate --date " $2 " +%s "; cmd | getline ts; close(cmd); print ts, $0, $2}' user.txt | sort -k1 | awk -F"[, ]" '{a[$2]=$3}END{for(e in a){print e, a[e]}}' OFS=","
I think you need to sort year, month and day.
Can you try this
awk -F"\"" '{print $2"-"$4}' data.txt | sort -t- -k4 -k3M -k2 | awk -F- '{kv[$1]=$2"-"$3"-"$4}END{for(k in kv){print k,kv[k]}}'
For me this is doing the job. I am sorting on the Month and then applying the logic that #neevek used. Till now I am unable to find a case that fails this. But I am not sure if this is a full proof solution.
sort -t- -k2 -M user1.txt | awk -F, '{a[$1]=$2}END{for(e in a){print e, a[e]}}' OFS=","
Can someone tell me if this solution has any issues?
How about this?
grep `cut -d'"' -f4 user.txt | sort -t- -k 3 -k 2M -k 1n | tail -1` user.txt
Explaining: using sort as you have done, get the latest entry with tail -1, extract that date (second column when cutting with a comma delimiter) and then sort and grep on that.
edit: fixed to sort via month.

Resources