How can I count and display only the words that are repeated more than once using unix commands? - bash

I am trying to count and display only the words that are repeated more than once in a file. The basic idea is:
You are given a file with names and characters like commas, colons, slashes, etc..
Use the cut command to display only the first names in the file (other commands are also allowed).
Count and then display only the names repeated more than once.
I got to the point of counting and displaying all the names. However, I haven't found a way to display and to count only those names repeated more than once.
Here is a section of the file:
user1:x:80:200:Mia,Spurs:/home/user1:/bin/bash
user2:x:80:200:Martha,Dalton:/home/user2:/bin/bash
user3:x:80:200:Lucy,Carlson:/home/user3:/bin/bash
user4:x:80:200:Carl,Bingo:/home/user4:/bin/bash
Here is what I have been able to do:
Daniel#Daniel-MacBook-Pro Files % cut -d ":" -f 5-5 file1 | cut -d "," -f 1-1 | sort -n | uniq -c
1 Mia
3 Martha
1 Lucy
1 Carl
1 Jessi
1 Joke
1 Jim
2 Race
1 Sem
1 Shirly
1 Susan
1 Tim

You can filter out the rows with count 1 with grep.
cut -d ":" -f 5 file1 | cut -d "," -f 1 | sort | uniq -c | grep -v '^ *1 '

Related

awk length is counting +1

I'm trying, as an exercise, to output how many words exist in the dictionary for each possible length.
Here is my code:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
Here is the output:
...
1799 5
427 4
81 3
1 2
My problem is that awk length count one more letter for each word in my file. The right output should have been:
1799 4
427 3
81 2
1 1
I checked my file and it does not contain any space after the word:
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
...
So I guess awk is counting the newline as a character, despite the fact it is not supposed to.
Is there any solution? Or something I'm doing wrong?
I'm gonna venture a guess. Isn't your awk expecting "U*X" style newlines (LF), but your dico.txt has Windows style (CR+LF). That easily give you the +1 on all lengths.
I took your four words:
$ cat dico.txt
ABAISSA
ABAISSABLE
ABAISSABLES
ABAISSAI
And ran your line:
$ awk '{print length}' dico.txt | sort -nr | uniq -c
1 11
1 10
1 8
1 7
So far so good. Now the same, but dico.txt with windows newlines:
$ cat dico.txt | todos > dico_win.txt
$ awk '{print length}' dico_win.txt | sort -nr | uniq -c
1 12
1 11
1 9
1 8

Sort output in bash script by number of occurances

So I have a text being outputted that has http status codes in one column and an ip adress in the other. I wan't to sort this by number of occurances so that
1 2 1 3 4 5 4 4
Looks like
4 4 4 1 1 2 3 5
This is for the second column of status codes, the ip adresses dont need to be sorted in any particular order
Since 4 is the most common one it should be first and then 1 and so forth.
However all that I can find is how to use uniq for example in order to count the occurances, thereby removing duplicates and prefixing a number to each row.
The regular sort command does not support this as far as i can tell as well.
Any help would be appreciated
You can still use sort | uniq -c, then interpret the number of occurrences by printing the number the given times by looping:
tr ' ' '\n' < file \
| sort | uniq -c | sort -k1,1nr -k2n \
| while read times status ; do
for i in $(seq 1 $times); do
printf '%s ' $status
done
done

Minimal two column numeric input data for `sort` example, with distinct permutations

What's the least number of rows of two-column numeric input needed to produce four unique sort outputs for the following four options:
1. -sn -k1 2. -sn -k2 3. -sn -k1 -k2 4. -sn -k2 -k1 ?
Here's a 6 row example, (with 4 unique outputs):
6 5
3 7
6 3
2 7
4 4
5 2
As a convenience, a function to count those four outputs given 2 columns of numbers, (requires the moreutils pee command), which prints the number of unique outputs:
# Usage: foo c1_1 c2_1 c1_2 c2_2 ...
foo() { echo "$#" | tr -s '[:space:]' '\n' | paste - - | \
pee "sort -sn -k1 | md5sum" \
"sort -sn -k2 | md5sum" \
"sort -sn -k1 -k2 | md5sum" \
"sort -sn -k2 -k1 | md5sum" | \
sort -u | wc -l ; }
So to count the unique permutations of this input:
8 5
3 5
8 4
Run this:
foo 8 5 3 1 8 3
Output:
2
(Only two unique outputs. Not enough...)
Note: This question was inspired by the obscurity of the current version of the sort manual, specifically COLUMNS=65 man sort | grep -A 17 KEYDEF | sed 3,18d. The info sort page's treatment of KEYDEFs is much better.
KEYDEFs are more useful than they might first seem. The -u or --unique switch works nicely with the KEYDEFs, and in effect allows sort to delete unwanted redundant lines, and therefore can furnish a more concise substitute for certain sed or awk scripts and similar pipelines.
I can do it in 3 by varying the whitespace:
1 1
2 1
1 2
Your foo function doesn't produce this kind of output, but since it was only a "convenience" and not a part of the question proper, I declare this answer correct and minimal!
Sneakier version:
2 1
11 1
2 2
(The last line contains a tab; the others don't.)
With the -s option, I can't exploit non-numeric comparisons, but then I can exploit the stability of the sort:
1 2
2 1
1 1
The 1 1 line goes above both of the others if both fields are compared numerically, regardless of which comparison is done first. The ordering of the two comparisons determines the ordering of the other two lines.
On the other hand, if one of the fields isn't used for comparison, the 1 1 line stays below one of the other lines (and which one that is depends on which field is used for comparison).

Finding all punctuation in a text file & print count

I have come close to counting all occurrences of punctuation, however punctuation characters that are right next to each other get counted as one.
Like so:
cat filename.txt |
tr -sc '[:punct:]' '\n' |
sort |
uniq -c |
sort -bnr`
Which prints something like this:
15 ,
9 !
5 .
2 ;
2 !"
2 '
1 -
1 --
1 :
1 ?
It is clearly only counting punctuation, but how would I separate those that are right next to each other?
This:
tr -sc '[:punct:]' '\n'
Basically what you do here is replace all the non-punctuation characters with \n. So when there is no such character between two punctuation chars , you get them next to each other
You want something like that:
cat filename.txt | tr -cd [:punct:] | fold -w 1 | sort | uniq -c | sort -bnr

bash - extracting lines that contain only 3 columns

I have a file that include the following lines :
2 | blah | blah
1 | blah | blah
3 | blah
2 | blah | blah
1
1 | high | five
3 | five
I wanna extract only the lines that has 3 columns (3 fields, 2 seperators...)
I wanna pipe it to the following commands :
| sort -nbsk1 | cut -d "|" -f1 | uniq -d
So after all I will get only :
2
1
Any suggestions ?
It's a part of homework assignment, we are not allowed to use awk\sed and some more commands.. (grep\tr and whats written above can be used)
Thanks
since you said grep is allowed:
grep -E '^([^|]*\|){2}[^|]*$' file
grep '.*|.*|.*' will select lines with at least three fields and two separators.

Resources