Unix uniq command to CSV file - bash

I have a text file (list.txt) containing single and multi-word English phrases. My goal is to do a word count for each word and write the results to a CSV file.
I have figured out the command to write the amount of unique instances of each word, sorted from largest to smallest. That command is:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | less > output.txt
The problem is the way the new file (output.txt) is formatted. There are 3 leading spaces, followed by the number of occurrences, followed by a space, followed by the word. Then on to a next line. Example:
9784 the
6368 and
4211 for
2929 to
What would I need to do in order to get the results in a more desired format, such as CSV? For example, I'd like it to be:
9784,the
6368,and
4211,for
2929,to
Even better would be:
the,9784
and,6368
for,4211
to,2929
Is there a way to do this with a Unix command, or do I need to do some post-processing within a text editor or Excel?

Use awk as follows:
> cat input
9784 the
6368 and
4211 for
2929 to
> cat input | awk '{ print $2 "," $1}'
the,9784
and,6368
for,4211
to,2929
You full pipeline will be:
$ tr 'A-Z' 'a-z' < list.txt | tr -sc 'A-Za-z' '\n' | sort | uniq -c | sort -n -r | awk '{ print $2 "," $1}' > output.txt

use sed to replace the spaces with comma
cat extra_set.txt | sort -i | uniq -c | sort -nr | sed 's/^ *//g' | sed 's/ /\, /'

Related

bash sort / uniq -c: how to use tab instead of space as delimiter in output?

I have a file strings.txt listing strings, which I am processing like this:
sort strings.txt | uniq -c | sort -n > uniq.counts
So the resulting file uniq.counts will list uniq strings sorted in the ascending order by their counts, so something like this:
1 some string with spaces
5 some-other,string
25 most;frequent:string
Note that strings in strings.txt may contain spaces, commas, semicolons and other separators, except for the tab. How can I get uniq.counts to be in this format:
1<tab>some string with spaces
5<tab>some-other,string
25<tab>most;frequent:string
You can do:
sort strings.txt | uniq -c | sort -n | sed -E 's/^ *//; s/ /\t/' > uniq.counts
sed will first remove all leading spaces at the beginning of the line (before counts) and then it will replace space after count to tab character.
You can simply pipe the output of the sort, etc to sed before writing to uniq.counts, e.g. add:
| sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts
The full expression would be:
$ sort strings.txt | uniq -c | sort -n | \
sed -e 's/^\([0-9][0-9]*\)\(.*$\)/\1\t\2/' > uniq.counts
(line continuation included for clarity)
With GNU sed:
sort strings.txt | uniq -c | sort -n | sed -r 's/([0-9]) /\1\t/' > uniq.counts
Output to uniq.counts:
1 some string with spaces
5 some-other,string
25 most;frequent:string
If you want to edit your file "in place" use sed's option -i.

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

How do I pipe commands inside for loop in bash

I am writing a bash script to iterate through file lines with given value.
The command I am using to list the possible values is:
cat file.csv | cut -d';' -f2 | sort | uniq | head
When I use it in for loop like this it stops working:
for i in $( cat file.csv | cut -d';' -f2 | sort | uniq | head )
do
//do something else with these lines
done
How can I use piped commands in for loop?
You can use this awk command to get sum of 3rd column for each unique value of 2nd columns:
awk -F ';' '{sums[$2]+=$3} END{for (i in sums) print i ":", sums[i]}' file.csv
Input data:
asd;foo;0
asd;foo;2
asd;bar;1
asd;foo;4
Output:
foo: 6
bar: 1

Connecting Wget and Sed Commands in One Script?

I use 3 commands (wget/sed/and a tr/sort) that all work in command line to produce a most-common words list. I use commands sequentially, saving output from sed to use in the tr/sort command. Now I need to graduate to writing a script that combines these 3 commands. So, 1) wget downloads a file, that I put into 2) sed -e 's/<[^>]*>//g' wget-file.txt, and that output > goes to 3)
cat sed-output.txt | tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c |
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt
I'm aware of the problem/debate about using regex to remove HTML tags, but these 3 commands are working for me for the moment. So thanks in helping pull this together.
Using awk.
wget -O- http://down.load/file| awk '{ gsub(/<[^>]*>/,"") # remove the content in label <>
$0=tolower($0) # convert all to lowercase
gsub(/[^a-z]]*/," ") # remove all non-letter chars and replaced by space
for (i=1;i<=NF;i++) a[$i]++ # save each word in array a, and sum it.
}END{for (i in a) print a[i],i|"sort -nr|head -100"}' # print the result, sort it, and get the top 100 records only
This command should do the job:
wget -O- http://down.load/file | sed -e 's/<[^>]*>//g' | \
tr -cs A-Za-z\' '\n' | tr A-Z a-z | sort | uniq -c | \
sort -k1,1nr -k2 | sed ${1:-100}q > words-list.txt

How to sort in bash?

I have several lines in file, just like below:
/adbc/eee/ddd/baa/
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/fff/ddd/c/
/adbc/ccc/ddd/bf/
/adbc/ccc/ddd/bc/
The sort algorithm must first get the string before last /, that is:
baa
avfff
b
c
bf
bc
and then sort it by the first character, and then the length of the string, and then alphabetically.
The expected result is
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/
You could use awk in a pre-processing step to add 3 columns all based on the field of the interest, feed to sort, then use cut to discard the extra fields
awk -F'/' -v OFS="/" '{x=substr($(NF-1), 1, 1);
print(x, length($(NF-1)), $(NF-1), $0)}' file.txt |
sort -k1,1 -k2,2n -k3,3 -t'/' |
cut -f4- -d'/'
/adbc/fff/ddd/ccc/avfff/
/adbc/ccc/ddd/b/
/adbc/ccc/ddd/bc/
/adbc/ccc/ddd/bf/
/adbc/eee/ddd/baa/
/adbc/fff/ddd/c/
cat sortthisfile | while read line
do
field=$( echo $line | sed -e 's:/$::' -e 's:.*/::' )
firstchar=${field:0:1}
fieldlen=${#field}
echo "${firstchar},${fieldlen},${field},${line}"
done | sort-k1,1 -k2,2n -k3,3 -t, | sed 's:.*,/::'
Obviously, sortthisfile is the name of your file.

Resources