bash shell script for mac to generate word list from a file? - macos

Is there a shell script that runs on a mac to generate a word list from a text file, listing the unique words? Even better if it could sort by frequency....
sorry forgot to mention, yeah i prefer a bash one as i'm using mac now...
oh, my file is in french... (basically i'm reading a novel and learning french, so i try to generate a word list help myself). hope this is not a problem?

If I understood you correctly, you need something like that:
cat <filename> | sed -e 's/ /\n/g' | sort | uniq -c

This command will do
cat file.txt | tr "\"' " '\n' | sort -u
Here sort -u will not work on Macintosh machines. In that case use sort | uniq -c instead. (Thanks to Hank Gay)
cat file.txt | tr "\"' " '\n' | sort | uniq -c

Just answer my question to dot down the final version i'm using:
tr -cs "[:alpha:]" "\n" < FileIn.txt | sort | uniq -c | awk '{print $2","$1}' >> FileOut.csv
some notes:
tr can be used directly to do replacement.
since i'm interested creating a word list for my french vocabulary, i used [:alpha:]
awk is used to insert a comma, so that the output is a csv file, easier for me to upload...
thanks again for everyone helping me.
sorry i didn't put it clearly at the beginning that i'm using a mac and expect a bash script.
cheers.

Related

Asterisk in bash variable

I've a file that contains info that I'm retrieving such way
Command
cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'
Output
HLA-A*30:02,HLA-B*18:01,HLA-C*05:01
But I'm trying to save this in variable, the asterisk and a letter disappears, I've tried several ways, adding/removing commas etc and I'm yet not able to print it properly.
hla=`cat 2018_02_15_09_01_08_result.tsv | grep -o [A-Z]\\*[0-9]*:[0-9]* | sort | uniq | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//'`
echo $hla
HLA-05:01,HLA-18:01,HLA-30:02
echo "$hla"
HLA-05:01,HLA-18:01,HLA-30:02
There are multiple errors here, most of which will be aptly diagnosed by http://shellcheck.net/ without any human intervention.
You really should single-quote your regular expressions unless you specifically require the shell to perform wildcard expansion and whitespace tokenization on the regex before executing the command.
The obsolescent `command` in backticks introduces some unfortunate additional shell handling on the string inside the backticks. The solution since the 1990s is to prefer the $(command) syntax for command substitution, which does not exhibit this problem.
The cat is useless; grep knows full well how to read a file.
Try this refactored code:
hla=$(grep -o '[A-Z]*[0-9]*:[0-9]*' 2018_02_15_09_01_08_result.tsv |
sort -u | sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//')
echo "$hla"
The double quotes around the variable interpolation in the echo are necessary and useful; notice also the line wraps for legibility and the use of sort -u in preference over sort | uniq (and generally try to reduce the number of processes -- once I understand what the sed | tr | sed does I can probably propose a simplification for that, too). Perhaps the simplest fix would be to refactor all of this into a single Awk script, but without access to the input, it's hard to tell you in more detail what that might look like.
(Also, are you really sure you need to capture the value to a variable? Often variable=value; echo "$variable" is just an obscure and inefficient way to say echo "value". And variable=$(command); echo "$variable" is better written simply command and capturing the command's standard output just so you can print it to standard output is a pure waste of cycles, unless you are planning to do something more with that variable's value.)
I've solved it by saving the output of the command with a redirection:
cat 2018_02_15_09_01_08_result.tsv |
grep -o [A-Z]\\*[0-9]*:[0-9]* |
sort | uniq |
sed -e 's/^/HLA-/' |tr '\n' ',' | sed '$ s/.$//' > out_file
hla=`cat out_file`
echo $hla
which gets me the expected HLA-A*30:02,HLA-B*18:01,HLA-C*05:01. Not the ideal solution, but it works.

using cut on a line having multiple instances of the same delimiter - unix

I am trying to write a generic script which can have different file name inputs.
This is just a small part of my bash script.
for example, lets say folder 444-55 has 2 files
qq.filter.vcf
ee.filter.vcf
I want my output to be -
qq
ee
I tried this and it worked -
ls /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf | sort | cut -f1 -d "." | xargs -n 1 basename
But lets say I have a folder like this -
/data2/delivery/Stack_overflow/de.1111_2222_3333_23/secondary/444-55/*.filter.vcf
My script's output would then be
de
de
How can I make it generic?
Thank you so much for your help.
Something like this in a script will "cut" it:
for i in /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf
do
basename "$i" | cut -f1 -d.
done | sort
advantages:
it does not parse the output of ls, which is frowned upon
it cuts after having applied the basename treatment, and the cut ignores the full path.
it also sorts last so it's guaranteed to be sorted according to the prefix
Just move the basename call earlier in the pipeline:
printf "%s\n" /data2/delivery/Stack_overflow/1111_2222_3333_23/secondary/444-55/*.filter.vcf |
xargs -n 1 basename |
sort |
cut -f1 -d.

How to find most frequent string in file

I have a question about bash script, lets say there is file witch contains lines, each line will have path to a file and a date, the problem is how to find most frequent path.
Thanks in advance.
Here's a suggestion
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
# \_____________________/ \__/ \_____/ \______/ \_______/
# select the file column sort print sort on print top
# files counts count result
Example use:
$ cat file.txt
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileB jan:17:13:46:27:2015
/home/admin/fileC jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
/home/admin/fileA jan:17:13:46:27:2015
$ cut -d' ' -f1 file.txt | sort | uniq -c | sort -rn | head -n1
3 /home/admin/fileA
You can strip out 3 from the final result by another cut.
Reverse the lines, cut the begginning (the date), reverse them again, then sort and count unique lines:
cat file.txt | rev | cut -b 22- | rev | sort | uniq -c
If you're absolutely sure you won't have whitespace in your paths, you can avoid rev altogether:
cat file.txt | cut -d " " -f 1 | sort | uniq -c
If the output is too long to inspect visually, aioobe's suggestion of following this with sort -rn | head -n1 will serve you well
It's worth noticing, as aioobe mentioned, that many unix commands optionally take a file argument. By using it, you can avoid the extra cat command in the beginning, by supplying its argument to the next command:
cat file.txt | rev | ... vs rev file.txt | ...
While I personally find the first option both easier to remember and understand, the second is preferred by many (most?) people, as it saves up system resources (specifically, the memory and references used by an additional process) and can have better performance in some specific use cases. Wikipedia's cat article discusses this in detail.

How to remove the last character from a bash grep output

COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2`
outputs something like this
"Abc Inc";
What I want to do is I want to remove the trailing ";" as well. How can i do that? I am a beginner to bash. Any thoughts or suggestions would be helpful.
This will remove the last character contained in your COMPANY_NAME var regardless if it is or not a semicolon:
echo "$COMPANY_NAME" | rev | cut -c 2- | rev
I'd use sed 's/;$//'. eg:
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | sed 's/;$//'`
foo="hello world"
echo ${foo%?}
hello worl
I'd use head --bytes -1, or head -c-1 for short.
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | head --bytes -1`
head outputs only the beginning of a stream or file. Typically it counts lines, but it can be made to count characters/bytes instead. head --bytes 10 will output the first ten characters, but head --bytes -10 will output everything except the last ten.
NB: you may have issues if the final character is multi-byte, but a semi-colon isn't
I'd recommend this solution over sed or cut because
It's exactly what head was designed to do, thus less command-line options and an easier-to-read command
It saves you having to think about regular expressions, which are cool/powerful but often overkill
It saves your machine having to think about regular expressions, so will be imperceptibly faster
I believe the cleanest way to strip a single character from a string with bash is:
echo ${COMPANY_NAME:: -1}
but I haven't been able to embed the grep piece within the curly braces, so your particular task becomes a two-liner:
COMPANY_NAME=$(grep "company_name" file.txt); COMPANY_NAME=${COMPANY_NAME:: -1}
This will strip any character, semicolon or not, but can get rid of the semicolon specifically, too.
To remove ALL semicolons, wherever they may fall:
echo ${COMPANY_NAME/;/}
To remove only a semicolon at the end:
echo ${COMPANY_NAME%;}
Or, to remove multiple semicolons from the end:
echo ${COMPANY_NAME%%;}
For great detail and more on this approach, The Linux Documentation Project covers a lot of ground at http://tldp.org/LDP/abs/html/string-manipulation.html
Using sed, if you don't know what the last character actually is:
$ grep company_name file.txt | cut -d '=' -f2 | sed 's/.$//'
"Abc Inc"
Don't abuse cats. Did you know that grep can read files, too?
The canonical approach would be this:
grep "company_name" file.txt | cut -d '=' -f 2 | sed -e 's/;$//'
the smarter approach would use a single perl or awk statement, which can do filter and different transformations at once. For example something like this:
COMPANY_NAME=$( perl -ne '/company_name=(.*);/ && print $1' file.txt )
don't have to chain so many tools. Just one awk command does the job
COMPANY_NAME=$(awk -F"=" '/company_name/{gsub(/;$/,"",$2) ;print $2}' file.txt)
In Bash using only one external utility:
IFS='= ' read -r discard COMPANY_NAME <<< $(grep "company_name" file.txt)
COMPANY_NAME=${COMPANY_NAME/%?}
Assuming the quotation marks are actually part of the output, couldn't you just use the -o switch to return everything between the quote marks?
COMPANY_NAME="\"ABC Inc\";" | echo $COMPANY_NAME | grep -o "\"*.*\""
you can strip the beginnings and ends of a string by N characters using this bash construct, as someone said already
$ fred=abcdefg.rpm
$ echo ${fred:1:-4}
bcdefg
HOWEVER, this is not supported in older versions of bash.. as I discovered just now writing a script for a Red hat EL6 install process. This is the sole reason for posting here.
A hacky way to achieve this is to use sed with extended regex like this:
$ fred=abcdefg.rpm
$ echo $fred | sed -re 's/^.(.*)....$/\1/g'
bcdefg
Some refinements to answer above. To remove more than one char you add multiple question marks. For example, to remove last two chars from variable $SRC_IP_MSG, you can use:
SRC_IP_MSG=${SRC_IP_MSG%??}
cat file.txt | grep "company_name" | cut -d '=' -f 2 | cut -d ';' -f 1
I am not finding that sed 's/;$//' works. It doesn't trim anything, though I'm wondering whether it's because the character I'm trying to trim off happens to be a "$". What does work for me is sed 's/.\{1\}$//'.

How to reverse lines of a text file?

I'm writing a small shell script that needs to reverse the lines of a text file. Is there a standard filter command to do this sort of thing?
My specific application is that I'm getting a list of Git commit identifiers, and I want to process them in reverse order:
git log --pretty=oneline work...master | grep -v DEBUG: | cut -d' ' -f1 | reverse
The best I've come up with is to implement reverse like this:
... | cat -b | sort -rn | cut -f2-
This uses cat to number every line, then sort to sort them in descending numeric order (which ends up reversing the whole file), then cut to remove the unneeded line number.
The above works for my application, but may fail in the general case because cat -b only numbers nonblank lines.
Is there a better, more general way to do this?
In GNU coreutils, there's tac(1)
There is a command for your purpose:
tail -r file.txt
Prints the lines of file.txt in reverse order!
The -r flag is non-standard, may not work on all systems, works e.g. on macOS.
Beware: Amount of lines limited. Works mostly, but when working with huge files be careful and check.
Answer is not 42 but tac.
Edit: Slower but more memory consuming using sed
sed 'x;1!H;$!d;x'
and even longer
perl -e'print reverse<>'
Similar to the sed example above, using perl - maybe more memorable (depending on how your brain is wired):
perl -e 'print reverse <>'
cat -b only numbers nonblank lines"
If that's the only issue you want to avoid, then why not use "cat -n" to number all the lines?
: "#(#)$Id: reverse.sh,v 1.2 1997/06/02 21:45:00 johnl Exp $"
#
# Reverse the order of the lines in each file
awk ' { printf("%d:%s\n", NR, $0);}' $* |
sort -t: +0nr -1 |
sed 's/^[0-9][0-9]*://'
Works like a charm for me...
In this case, just use --reverse:
$ git log --reverse --pretty=oneline work...master | grep -v DEBUG: | cut -d' ' -f1
rev <name of your text file.txt>
You can even do this:
echo <whatever you want to type>|rev
awk '{a[i++]=$0}END{for(;i-->0;)print a[i]}'
More faster than sed and compatible for embed devices like openwrt.

Resources