Bash - Count number of occurences in textfile and display in descending order - bash

I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?

You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone

How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr

Related

remove whitespace from piped output

In a textfile i have some tags with the notation :foo. To get an overview of my tags in the file, I want to get a listing of all this tags.
This is done via
grep -o -e ":[a-z]*\( \|$\)" file.txt | sort | uniq
Now I get duplicates because of the whitespace or newline character at the end.
:movie <-- only newline
:movie <-- whitespace and newline
:read
:read
I want to avoid the duplicates. But I could not figure out how. I tried with | tr -d '[:space:]', but this leads only to a concatenation of all pipe output...
Example of the file.txt
Avengers: Infinity War :movie
Yojimbo 1961 :movie nippon
Some test lines (there is a space after the first :space, you can see it if you highlight the data with your mouse):
$ cat file
with :space
with :space too
without :space
test: this
With grep, sort and uniq:
$ grep -o ":[a-z]\+" file | sort | uniq
:space
With awk (well, gawk and mawk at least):
$ awk 'BEGIN{RS="[" FS "|" RS "]+"}/:[a-z]/&&!a[$0]++' file
:space
Each word is its own record and we pick the first instance of every colon-starting word. RS="[" FS "|" RS "]+" could be written otherwise but it is in this form to emphasize any combination of FS and RS.
You can use Perl regexp and word matching:
grep -oP ':\w+' file.txt | sort | uniq
or, just match non-space characters:
grep -o ':[^ ]*' file.txt | sort | uniq
Since you haven't provided the sample Input_file so couldn't test it as well as I don't have zsh with me. Try following and let me know if this helps you.
awk '/:[a-z]*/{sub(/ +$/,"");} !a[$0]++' Input_file | sort
You can try with sed
sed 's/.*\(:[a-z]*\).*/\1/' file.txt | sort | uniq

One liner command to extract #of ID occurrences in a very long file

I have the following really huge file (million lines) with the following format:
Timestamp, ID, GUID
Example:
2014-04-14 23:59:59,754 2294 123B24C6452231DC1770FE37E6F3D51168
2014-04-14 23:59:59,757 102254 B9E0CE6C9F67745326F9FD07C5B31B4E1D65
ID is a number which can be any from single digit and up to 6 digits.
GUID has a constant length (as above).
I would like to get #of occurrences for each ID in the file.
Output should looks something like:
Count, ID
8 2294
15 102254
...
I am trying to get this with a single grep using uniq and sort without much succeess.
Appreciate help.
If there are single spaces in between the fields (as in your example) rather than commas (as in your format), then you could use:
cut -d' ' -f3 hugefile | sort | uniq -c
Another alternative, if the separator might be several spaces:
awk '{print $3}' hugefile | sort | uniq -c
You could also do all the work inside the awk program (untested):
awk '{c[$3]++} END { for (n in c) print c[n], n }' hugefile
You can use this,
grep -Po '(?<= )[0-9]+ ' yourfile | sort | uniq -c

Connecting Wget and Sed Commands in One Script?

I use 3 commands (wget/sed/and a tr/sort) that all work in command line to produce a most-common words list. I use commands sequentially, saving output from sed to use in the tr/sort command. Now I need to graduate to writing a script that combines these 3 commands. So, 1) wget downloads a file, that I put into 2) sed -e 's/<[^>]*>//g' wget-file.txt, and that output > goes to 3)
cat sed-output.txt | tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c |
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt
I'm aware of the problem/debate about using regex to remove HTML tags, but these 3 commands are working for me for the moment. So thanks in helping pull this together.
Using awk.
wget -O- http://down.load/file| awk '{ gsub(/<[^>]*>/,"") # remove the content in label <>
$0=tolower($0) # convert all to lowercase
gsub(/[^a-z]]*/," ") # remove all non-letter chars and replaced by space
for (i=1;i<=NF;i++) a[$i]++ # save each word in array a, and sum it.
}END{for (i in a) print a[i],i|"sort -nr|head -100"}' # print the result, sort it, and get the top 100 records only
This command should do the job:
wget -O- http://down.load/file | sed -e 's/<[^>]*>//g' | \
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | \
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

Unix uniq command to CSV file

I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.
I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt
The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:
9784 the
6368 and
4211 for
2929 to
What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:
9784,the
6368,and
4211,for
2929,to
Even better would be:
the,9784
and,6368
for,4211
to,2929
Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?
Use awk as follows:
> cat input
9784 the
6368 and
4211 for
2929 to
> cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929
You full pipeline will be:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt
use sed to replace the spaces with comma
cat extra_set.txt | sort -i | uniq -c | sort -nr | sed 's/^ *//g' | sed 's/ /\, /'

Bash: creating a pipeline to list top 100 words

Ok, so I need to create a command that lists the 100 most frequent words in any given file, in a block of text.
What I have at the moment:
$ alias words='tr " " "\012" <hamlet.txt | sort -n | uniq -c | sort -r | head -n 10'
outputs
$ words
14 the
14 of
8 to
7 and
5 To
5 The
5 And
5 a
4 we
4 that
I need it to output in the following format:
the of to and To The And a we that
((On that note, how would I tell it to print the output in all caps?))
And I need to change it so that I can pipe 'words' to any file, so instead of having the file specified within the pipe, the initial input would name the file & the pipe would do the rest.
Okay, taking your points one by one, though not necessarily in order.
You can change words to use standard input just by removing the <hamlet.txt bit since tr will take its input from standard input by default. Then, if you want to process a specific file, use:
cat hamlet.txt | words
or:
words <hamlet.txt
You can remove the effects of capital letters by making the first part of the pipeline:
tr '[A-Z]' '[a-z]'
which will lower-case your input before doing anything else.
Lastly, if you take that entire pipeline (with the suggested modifications above) and then pass it through a few more commands:
| awk '{printf "%s ", $2}END{print ""}'
This prints the second argument of each line (the word) followed by a space, then prints an empty string with terminating newline at the end.
For example, the following script words.sh will give you what you need:
tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf "%s ", $2}END{print ""}'
(on one line: I've split it for readability) as per the following transcript:
pax> echo One Two two Three three three Four four four four | ./words.sh
four three two
You can achieve the same end with the following alias:
alias words="tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf \"%s \", \$2}END{print \"\"}'"
(again, one line) but, when things get this complex, I prefer a script, if only to avoid interminable escape characters :-)

Resources