Bash: creating a pipeline to list top 100 words - bash

Ok, so I need to create a command that lists the 100 most frequent words in any given file, in a block of text.
What I have at the moment:
$ alias words='tr " " "\012" <hamlet.txt | sort -n | uniq -c | sort -r | head -n 10'
outputs
$ words
14 the
14 of
8 to
7 and
5 To
5 The
5 And
5 a
4 we
4 that
I need it to output in the following format:
the of to and To The And a we that
((On that note, how would I tell it to print the output in all caps?))
And I need to change it so that I can pipe 'words' to any file, so instead of having the file specified within the pipe, the initial input would name the file & the pipe would do the rest.

Okay, taking your points one by one, though not necessarily in order.
You can change words to use standard input just by removing the <hamlet.txt bit since tr will take its input from standard input by default. Then, if you want to process a specific file, use:
cat hamlet.txt | words
or:
words <hamlet.txt
You can remove the effects of capital letters by making the first part of the pipeline:
tr '[A-Z]' '[a-z]'
which will lower-case your input before doing anything else.
Lastly, if you take that entire pipeline (with the suggested modifications above) and then pass it through a few more commands:
| awk '{printf "%s ", $2}END{print ""}'
This prints the second argument of each line (the word) followed by a space, then prints an empty string with terminating newline at the end.
For example, the following script words.sh will give you what you need:
tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf "%s ", $2}END{print ""}'
(on one line: I've split it for readability) as per the following transcript:
pax> echo One Two two Three three three Four four four four | ./words.sh
four three two
You can achieve the same end with the following alias:
alias words="tr '[A-Z]' '[a-z]' | tr ' ' '\012' | sort -n | uniq -c | sort -r
| head -n 3 | awk '{printf \"%s \", \$2}END{print \"\"}'"
(again, one line) but, when things get this complex, I prefer a script, if only to avoid interminable escape characters :-)

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Bash - Count number of occurences in textfile and display in descending order

I want to count the amount of the same words in a text file and display them in descending order.
So far I have :
cat sample.txt | tr ' ' '\n' | sort | uniq -c | sort -nr
Which is mostly giving me satisfying output except the fact that it includes special characters like commas, full stops, ! and hyphen.
How can I modify existing command to not include special characters mentioned above?
You can use tr with a composite string of the letters you wish to delete.
Example:
$ echo "abc, def. ghi! boss-man" | tr -d ',.!'
abc def ghi boss-man
Or, use a POSIX character class knowing that boss-man for example would become bossman:
$ echo "abc, def. ghi! boss-man" | tr -d [:punct:]
abc def ghi bossman
Side note: You can have a lot more control and speed by using awk for this:
$ echo "one two one! one. oneone
two two three two-one three" |
awk 'BEGIN{RS="[^[:alpha:]]"}
/[[:alpha:]]/ {seen[$1]++}
END{for (e in seen) print seen[e], e}' |
sort -k1,1nr -k2,2
4 one
4 two
2 three
1 oneone
How about first extracting words with grep:
grep -o "\w\+" sample.txt | sort | uniq -c | sort -nr

How to remove all but the last 3 parts of FQDN?

I have a list of IP lookups and I wish to remove all but the last 3 parts, so:
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
would become
163data.com.cn
I have spent hours searching for clues, including parameter substitution, but the closest I got was:
$ string="98.254.237.114.broad.lyg.js.dynamic.163data.com.cn"
$ string1=${string%.*.*.*}
$ echo $string1
Which gives me the inverted answer of:
98.254.237.114.broad.lyg.js.dynamic
which is everything but the last 3 parts.
A script to do a list would be better than just the static example I have here.
Using CentOS 6, I don't mind if it by using sed, cut, awk, whatever.
Any help appreciated.
Thanks, now that I have working answers, may I ask as a follow up to then process the resulting list and if the last part (after last '.') is 3 characters - eg .com .net etc, then to just keep the last 2 parts.
If this is against protocol, please advise how to do a follow up question.
if parameter expansion inside another parameter expansion is supported, you can use this:
$ s='98.254.237.114.broad.lyg.js.dynamic.163data.com.cn'
$ # removing last three fields
$ echo "${s%.*.*.*}"
98.254.237.114.broad.lyg.js.dynamic
$ # pass output of ${s%.*.*.*} plus the extra . to be removed
$ echo "${s#${s%.*.*.*}.}"
163data.com.cn
can also reverse the line, get required fields and then reverse again.. this makes it easier to use change numbers
$ echo "$s" | rev | cut -d. -f1-3 | rev
163data.com.cn
$ echo "$s" | rev | cut -d. -f1-4 | rev
dynamic.163data.com.cn
$ # and easy to use with file input
$ cat ip.txt
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
foo.bar.123.baz.xyz
a.b.c.d.e.f
$ rev ip.txt | cut -d. -f1-3 | rev
163data.com.cn
123.baz.xyz
d.e.f
echo $string | awk -F. '{ if (NF == 2) { print $0 } else { print $(NF-2)"."$(NF-1)"."$NF } }'
NF signifies the total number of field separated by "." and so we want the last piece (NF), last but 1 (NF-1) and last but 2 (NF-2)
$ echo $string | awk -F'.' '{printf "%s.%s.%s\n",$(NF-2),$(NF-1),$NF}'
163data.com.cn
Brief explanation,
Set the field separator to .
Print only last 3 field using the awk parameter $(NF-2), $(NF-1),and $NF.
And there's also another option you may try,
$ echo $string | awk -v FPAT='[^.]+.[^.]+.[^.]+$' '{print $NF}'
163data.com.cn
It sounds like this is what you need:
awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
e.g.
$ echo "$string" | awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
163data.com.cn
but with just 1 sample input/output it's just a guess.
wrt your followup question, this might be what you're asking for:
$ echo "$string" | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
163data.com.cn
$ echo 'www.google.com' | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
google.com
Version which uses only bash:
echo $(expr "$string" : '.*\.\(.*\..*\..*\)')
To use it with a file you can iterate with xargs:
File:
head list.dat
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
98.254.34.56.broad.kkk.76onepi.co.cn
98.254.237.114.polst.a65dal.com.cn
iterating the whole file:
cat list.dat | xargs -I^ -L1 expr "^" : '.*\.\(.*\..*\..*\)'
Notice: it won't be very efficient in large scale, so you need to consider by your own whether it is good enough for you.
Regexp explanation:
.* \. \( .* \. .* \. .* \)
\___| | | | |
| \------------------------/> brakets shows which part we extract
| | |
| \-------/> the \. indicates the dots to separate specific number of words
|
|
-> the rest and the final dot which we are not interested in (out of brakets)
details:
http://tldp.org/LDP/abs/html/string-manipulation.html -> Substring Extraction

Bash: Formatting multi-line function results alongside each other

I have three functions that digest an access.log file on my server.
hitsbyip() {
cat $ACCESSLOG | awk '{ print $1 }' | uniq -c | sort -nk1 | uniq
}
hitsbyhour() {
cat $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c
}
hitsbymin() {
hr=$1
grep "2015:${hr}" $ACCESSLOG | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":"$3}' | sort -nk1 -nk2 | uniq -c
}
They all work fine when used on their own. All three output 2 small colums of data.
Now I am looking to create another function called report which will simply use printf and its formatting possibilities to print 3 columns of data with header, each of them the result of my three first functions. Something like that:
report() {
printf "%-30b %-30b %-30b\n" `hitsbyip` `hitsbyhour` `hitsbymin 10`
}
The thing is that the format is not what i want; it prints out the columns horizontaly instead of side by side.
Any help would be greatly appreciated.
Once you use paste to combine the output of the three commands into a single stream, then you can operate line-by-line to format those outputs.
while IFS=$'\t' read -r by_ip by_hour by_min; do
printf '%-30b %-30b %-30b\n' "$by_ip" "$by_hour" "$by_min"
done < <(paste <(hitsbyip) <(hitsbyhour) <(hitsbymin 10))
Elements to note:
<() syntax is process substitution; it generates a filename (typically of the form /dev/fd/## on platforms with such support) which will, when read, yield the output of the command given.
paste takes a series of filenames and puts the output of each alongside the others.
Setting IFS=$'\t' while reading ensures that we read content as tab-separated values (the format paste creates). See BashFAQ #1 for details on using read.
Putting quotes on the arguments to printf ensures that we pass each value assembled by read as a single value to printf, rather than letting them be subject to string-splitting and glob-expansion as unquoted values.

How to sort groups of lines?

In the following example, there are 3 elements that have to be sorted:
"[aaa]" and the 4 lines (always 4) below it form a single unit.
"[kkk]" and the 4 lines (always 4) below it form a single unit.
"[zzz]" and the 4 lines (always 4) below it form a single unit.
Only groups of lines following this pattern should be sorted; anything before "[aaa]" and after the 4th line of "[zzz]" must be left intact.
from:
This sentence and everything above it should not be sorted.
[zzz]
some
random
text
here
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
And neither should this one and everything below it.
to:
This sentence and everything above it should not be sorted.
[aaa]
bla
blo
blu
bli
[kkk]
1
44
2
88
[zzz]
some
random
text
here
And neither should this one and everything below it.
Maybe not the fastest :) [1] but it will do what you want, I believe:
for line in $(grep -n '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -n +$line sections.txt | head -n 5
done
Here's a better one:
for pos in $(grep -b '^\[.*\]$' sections.txt |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
[1] The first one is something like O(N^2) in the number of lines in the file, since it has to read all the way to the section for each section. The second one, which can seek immediately to the right character position, should be closer to O(N log N).
[2] This takes you at your word that there are always exactly five lines in each section (header plus four following), hence head -n 5. However, it would be really easy to replace that with something which read up to but not including the next line starting with a '[', in case that ever turns out to be necessary.
Preserving start and end requires a bit more work:
# Find all the sections
mapfile indices < <(grep -b '^\[.*\]$' sections.txt)
# Output the prefix
head -c+${indices[0]%%:*} sections.txt
# Output sections, as above
for pos in $(printf %s "${indices[#]}" |
sort -k2 -t: |
cut -f1 -d:); do
tail -c +$((pos+1)) sections.txt | head -n 5
done
# Output the suffix
tail -c+$((1+${indices[-1]%%:*})) sections.txt | tail -n+6
You might want to make a function out of that, or a script file, changing sections.txt to $1 throughout.
Assuming that other lines do not contain a [ in them:
header=`grep -n 'This sentence and everything above it should not be sorted.' sortme.txt | cut -d: -f1`
footer=`grep -n 'And neither should this one and everything below it.' sortme.txt | cut -d: -f1`
head -n $header sortme.txt #print header
head -n $(( footer - 1 )) sortme.txt | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
#cat sortme.txt | head -n $(( footer - 1 )) | tail -n +$(( header + 1 )) | tr '\n[' '[\n' | sort | tr '\n[' '[\n' | grep -v '^\[$' #sort lines between header & footer
tail -n +$footer sortme.txt #print footer
Serves the purpose.
Note that the main sort work is done by 4th command only. Other lines are to reserve header & footer.
I am also assuming that, between header & first "[section]" there are no other lines.
This might work for you (GNU sed & sort):
sed -i.bak '/^\[/!b;N;N;N;N;s/\n/UnIqUeStRiNg/g;w sort_file' file
sort -o sort_file sort_file
sed -i -e '/^\[/!b;R sort_file' -e 'd' file
sed -i 's/UnIqUeStRiNg/\n/g' file
Sorted file will be in file and original file in file.bak.
This will present all lines beginning with [ and following 4 lines, in sorted order.
UnIqUeStRiNg can be any unique string not containing a newline, e.g. \x00

Resources