How to (optimally) pick a single normalized random word from a file with bash / sed / shuf? - bash

I'm looking to remove any non-alphabetic (English) characters and make the output lower-case from /usr/share/dict/words. Here's what I have so far:
sed "$(shuf -i "1-$(cat /usr/share/dict/words | wc -l)" -n 1)q;d" /usr/share/dict/words | tr '[:upper:]' '[:lower:]' | sed 's/[^-a-z]//g'
This works fine but is it possible to do it all in the one sed command?
EDIT: The American word file looks like this:
A
A's
AMD
AMD's
AOL
AOL's
AWS
AWS's
Aachen
Aachen's
I'm looking to make this lower-case and remove any non-alphabetic characters (as mentioned in my original question). The solution I have works fine but I'm hoping to reduce the number of commands (maybe just sed?). Output of the above would then be:
a
as
amd
amds
aol
aols
aws
awss
aachen
aachens

You don't need sed and wc -- shuf can shuffle the lines of a file.
tr can remove non-alphas, so again don't need sed
shuf -n1 /usr/share/dict/words | tr -dc '[:alpha:]' | tr '[:upper:]' '[:lower:]'

This single awk command should do the job:
awk '{gsub(/[^[:alpha:]]+/, ""); print tolower($0)}' file
a
as
amd
amds
aol
aols
aws
awss
aachen
aachens

This might work for you (GNU sed and shuf):
shuf -n1 /usr/share/dict/words | sed 's/[^[:alpha:]-]//g;s/.*/\L&/'
Choose a random line, remove any non-alpha (except hyphen) characters and lowercase the result.

Related

Counting number of different words in a txt file in Bash

Well, I do not know much about programming at bash, I'm new at it so I'm struggling to find a code to iterate all the lines in a txt file, and count how many words are different.
Example: If a txt file has "Nory was a Catholic because her mother was a Catholic"
So the result must be 7
$ grep -o '[^[:space:]]*' file | sort -u | wc -l
7
Sure. I assume you are ok with defining "words" as things that are separated by space? In which case, try something like this:
cat filename | sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" | sort -u | wc -l
This command says:
Dump contents of filename
Replace multiple spaces with a single space
Replace spaces with newline
Sort and "uniquify" the list
Print out the count of lines
Per the comment, you can technically get away without using cat if you'd like, with the following:
sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" filename | sort -u | wc -l
Further, from another comment, you could optionally use tr (importantly with it's -s flag to handle repeated spaces) instead of sed with something like:
tr -s " " "\n" < filename | sort -u | wc -l
The moral of the story is there are several ways this kind of thing can be accomplished, not to mention the other full answers that are given here :-) My personal favorite answer at this point is Ed Morton's which I've upvoted accordingly.
You could also lowercase the text so words compares regardless of casing.
Also filter words with the [:alnum:] character class, rather than [a-zA-Z0-9_] that is only valid for US-ASCII, and will fail dramatically with Greek or Turkish.
#!/usr/bin/env bash
echo "The uniq words are the words that appears at least once, regardless of casing." |
# Turn text to lowercase
tr '[:upper:]' '[:lower:]' |
# Split alphanumeric with newlines
tr -sc '[:alnum:]' '\n' |
# Sort uniq words
sort -u |
# Count lines of unique words
wc -l
I would do it like so, with comments:
echo "Nory was a Catholic because her mother was a Catholic" |
# tr replace
# -s - squeeze
# -c - complementary
# [a-zA-Z0-9_] - all letters, number and underscore
# but complementary set, so all non letters, not numbers and not underscores.
# replace them by newline
tr -sc '[a-zA-Z0-9_]' '\n' |
# and sort unique and display count
sort -u | wc -l
Tested on repl bash.
Decided to use [a-zA-Z0-9_], because this is how GNU sed \w extension matches a word.
cat yourfile.txt | xargs -n1 | sort | uniq -c > youroutputfile.txt
xargs -n1 = put one word per line
sort = sorts
uniq -c = counts occurrences of distinct values
source

Asterisk in bash variable

I've a file that contains info that I'm retrieving such way
Command
cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'
Output
HLA-A*30:02,HLA-B*18:01,HLA-C*05:01
But I'm trying to save this in variable, the asterisk and a letter disappears, I've tried several ways, adding/removing commas etc and I'm yet not able to print it properly.
hla=`cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'`
echo $hla
HLA-05:01,HLA-18:01,HLA-30:02
echo "$hla"
HLA-05:01,HLA-18:01,HLA-30:02
There are multiple errors here, most of which will be aptly diagnosed by http://shellcheck.net/ without any human intervention.
You really should single-quote your regular expressions unless you specifically require the shell to perform wildcard expansion and whitespace tokenization on the regex before executing the command.
The obsolescent `command` in backticks introduces some unfortunate additional shell handling on the string inside the backticks. The solution since the 1990s is to prefer the $(command) syntax for command substitution, which does not exhibit this problem.
The cat is useless; grep knows full well how to read a file.
Try this refactored code:
hla=$(grep -o '[A-Z]*[0-9]*:[0-9]*' 2018_02_15_09_01_08_result.tsv |
sort -u | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//')
echo "$hla"
The double quotes around the variable interpolation in the echo are necessary and useful; notice also the line wraps for legibility and the use of sort -u in preference over sort | uniq (and generally try to reduce the number of processes -- once I understand what the sed | tr | sed does I can probably propose a simplification for that, too). Perhaps the simplest fix would be to refactor all of this into a single Awk script, but without access to the input, it's hard to tell you in more detail what that might look like.
(Also, are you really sure you need to capture the value to a variable? Often variable=value; echo "$variable" is just an obscure and inefficient way to say echo "value". And variable=$(command); echo "$variable" is better written simply command and capturing the command's standard output just so you can print it to standard output is a pure waste of cycles, unless you are planning to do something more with that variable's value.)
I've solved it by saving the output of the command with a redirection:
cat 2018_02_15_09_01_08_result.tsv |
grep -o [A-Z]\\*[0-9]*:[0-9]* |
sort | uniq |
sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//' > out_file
hla=`cat out_file`
echo $hla
which gets me the expected HLA-A*30:02,HLA-B*18:01,HLA-C*05:01. Not the ideal solution, but it works.

reverse a file in Unix shell

I have a file parse.txt
parse.txt contains the following
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0
and I want the output.txt file as (to reverse the order from last to first)using parse.txt
julie/hello/3.0,julie/hello/2.0,whitney/hello/1.0,remo/hello3/3.0,remo/hello2/2.0,remo/hello/1.0
I have tried the following code:
tail -r parse.txt
You can use the surprisingly helpful tac from GNU Coreutils.
tac -s "," parse.txt > newparse.txt
tac by default will "cat" the file to standard out, reversing the lines. By specifying the separator using the -s flag, you can simply reverse your fields as desired.
(You may need to do a post-processing step to get the commas to work out correctly, which can be another step in your pipeline.)
I like the tac solution; it's tight and elegant, but as Micah pointed out, tac is part of GNU Coreutils, which means that it's not available by default in FreeBSD, OSX, Solaris, etc.
This can be done in pure bash, no external tools required.
#!/usr/bin/env bash
unset comma
read foo < parse.txt
bar=(${foo//,/ })
for (( count="${#bar[#]}"; --count >= 0; )); do
printf "%s%s" "$comma" "${bar[$count]}"
comma=","
done
This obviously only handles one line, per your sample input. You can wrap it in something if you need to handle multiple lines of input.
The logic here is that we can convert the input into an array by replacing commas with spaces. Of course, if our input data included spaces, this would have to be adjusted. Once we have the array, we simply step backwards through it, printing each record.
Note that this does not include a terminating newline. If you want one, you can add it with:
printf '\n'
as a final line.
perl -F, -lane 'print join ",", reverse #F' parse.txt > output.txt
You can use this awk command:
awk -v RS=, '{a[++i]=$1} END{for (k=i; k>=1; k--) printf a[k] (k>1?RS:ORS)}' parse.txt
julie/hello/3.0,julie/hello/2.0,whitney/hello/1.0,remo/hello3/3.0,remo/hello2/2.0,remo/hello/1.0
The question is tagged unix and you have mentioned tail -r which suggests you might not be using Linux (with full GNU toolchain), but instead some "real" Unix (BSD variant), e.g. osx.
As such, the tac command is not available, but as mentioned in the question, tail -r is. So you can use the following:
$ tr ',' '\n' < parse.txt | tail -r | tr '\n' ',' | sed 's/,$//'
julie/hello/3.0,julie/hello/2.0,whitney/hello/1.0,remo/hello3/3.0,remo/hello2/2.0,remo/hello/1.0
$
Notes:
This only works for files that have one line, as we are relying on converting commas to newlines and back. If there is more than one line, then the newlines in between will get converted to commas by the second tr.
The final sed is to remove a trailing comma, that was converted from a trailing newline inserted by tail
Emulating tac with sed:
tr , '\n' <parse.txt | sed '1!G; h; $!d' | paste -sd ,
Alternatively, if you don't have paste:
tr , '\n' <parse.txt | sed '1!G; h; $!d' | tr '\n' , | sed 's/,$//'
Output:
julie/hello/3.0,julie/hello/2.0,whitney/hello/1.0,remo/hello3/3.0,remo/hello2/2.0,remo/hello/1.0
You can use any language to do that
xargs ruby -e "puts ARGV[0].split(',').reverse.join(',')" < parse.txt
Reverse can be done by tac (from cat). As commented this will reverse the lines not what the OP asked for.
tac filename
You can still you tac if you provide line by line and reverse by not linefeed delimiter but the field separator, here ,.
echo "a,b,c" | tr '\n' ',' | tac -s "," | sed 's/,$/\n/'

Add Tab Separator to Grep

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា
អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា
អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក
នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន
មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ
ឥត​ចេញ​ថ្លៃ​ទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!
You should be able to replace the whole lengthy sed command with this:
tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '
Comments:
's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines
You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:
sed 's/^ *\([0-9]\+\) /\1\t/'
If you want the AWK command to output a tab:
awk 'BEGIN{OFS='\t'} {print $2, $1}'
What about writing awk to file with "<"?
The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile
#!/bin/sh
sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?
Try this:
tr '\n' ',' < input.txt > output.txt
With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.
In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt
tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}
Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.
Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'
printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt
python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).
cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

Resources