Grep Spellchecker - shell

I am trying to write a simple shell script that takes a text file as input and checks all non-punctuated words against a dictionary (english.txt). It should return all non-matching (misspelled) words. I am using grep but it does not seem to successfully match all the lines in english.txt. I have included my code below.
#!/bin/bash
cat $1 |
tr ' \t' '\n\n' |
sed -e "/'/d" |
tr -d '[:punct:]' |
tr -cd '[:alpha:]\n' |
sed -e "/^$/d" |
grep -v -i -w -f english.txt

Related

GNU parallel with custom script doing string comparison

The follwoing script.sh compares part of a string (coming from stdin by cating a csv file) to a defined string and reports the differences in a certain format
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done < "${1:-/dev/stdin}"
It is intendet to be executed on a number of rows from a very large file in the format
XYZ,ABMDEFG
and it works well when i use it in a pipe:
cat large_file | ./find_something.sh
However, when I try to use it with parallel, i get this error:
$ cat large_file | parallel ./find_something.sh
./find_something.sh: line 9: XYZ, ABMDEFG : No such file or directory
What is causing this? Is parallel supposed to work for something like this, if I want to redirect the output to a single file afterwards?
Less important side note: I'm rather proud of my string comparison method, but if someone has a faster way to get from comparing ABCDEFG and XYZ,ABMDEFG to obtain XYZ,C3M I'd be happy to hear that, too.
Edit:
I should have said, I also want to preserve the order of each line in the output, corresponding to the input. Is that possible using parallel?
Your script accepts its input from a file (defaulting to stdin), whereas parallel will pass input as arguments, not via stdin. In that sense, parallel is closer to xargs.
Presumably, you want each of the lines in large_file to be processed as a unit, possibly in parallel.
That means you need your script to only process one such line at a time, and let parallel call your script many times, once for each line.
So your script should look like this:
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
line="$1"
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
Then you can redirect to a file as follows:
cat large_file | parallel ./find_something.sh > output_file
-k keeps the order.
#!/usr/bin/env bash
doit() {
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done
}
export -f doit
cat large_file | parallel --pipe -k doit
#or
parallel --pipepart -a large_file --block -10 -k doit

grep search with filename as parameter

I'm working on a shell script.
OUT=$1
here, the OUT variable is my filename.
I'm using grep search as follows:
l=`grep "$pattern " -A 15 $OUT | grep -w $i | awk '{print $8}'|tail -1 | tr '\n' ','`
The issue is that the filename parameter I must pass is test.log.However, I have the folder structure :
test.log
test.log.001
test.log.002
I would ideally like to pass the filename as test.log and would like it to search it in all log files.I know the usual way to do is by using test.log.* in command line, but I'm facing difficulty replicating the same in shell script.
My efforts:
var-$'.*'
l=`grep "$pattern " -A 15 $OUT$var | grep -w $i | awk '{print $8}'|tail -1 | tr '\n' ','`
However, I did not get the desired result.
Hopefully this will get you closer:
#!/bin/bash
for f in "${1}*"; do
grep "$pattern" -A15 "$f"
done | grep -w $i | awk 'END{print $8}'

echo -e cat: argument line too long

I have bash script that would merge a huge list of text files and filter it. However I'll encounter 'argument line too long' error due to the huge list.
echo -e "`cat $dir/*.txt`" | sed '/^$/d' | grep -v "\-\-\-" | sed '/</d' | tr -d \' | tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' | sort -us -o $output
I have seen some similar answers here and i know i could rectify it using find and cat the files 1st. However, i would i like to know what is the best way to run a one liner code using echo -e and cat without breaking the code and to avoid the argument line too long error. Thanks.
First, with respect to the most immediate problem: Using find ... -exec cat -- {} + or find ... -print0 | xargs -0 cat -- will prevent more arguments from being put on the command line to cat than it can handle.
The more portable (POSIX-specified) alternative to echo -e is printf '%b\n'; this is available even in configurations of bash where echo -e prints -e on output (as when the xpg_echo and posix flags are set).
However, if you use read without the -r argument, the backslashes in your input string are removed, so neither echo -e nor printf %b will be able to process them later.
Fixing this can look like:
while IFS= read -r line; do
printf '%b\n' "$line"
done \
< <(find "$dir" -name '*.txt' -exec cat -- '{}' +) \
| sed [...]
grep -v '^$' $dir/*.txt | grep -v "\-\-\-" | sed '/</d' | tr -d \' \
| tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' \
| sort -us -o $output
If you think about it some more you can probably get rid of a lot more stuff and turn it into a single sed and sort, roughly:
sed -e '/^$/d' -e '/\-\-\-/d' -e '/</d' -e 's/\'\\\/<>(){}!?~;.:+`*-_ͱ//g' \
-e 's/ / /g' -e 's/^[ \t]*//' $dir/*.txt | sort -us -o $output

Bash script builds correct $cmd but fails to execute complex stream

This short script scrapes some log files daily to create a simple extract. It works from the command line and when I echo the $cmd and copy/paste, it also works. But it will breaks when I try to execute from the script itself.
I know this is a nightmare of patterns that I could probably improve, but am I missing something simple to just execute this correctly?
#!/bin/bash
priorday=$(date --date yesterday +"%Y-%m-%d")
outputfile="/home/CCHCS/da14/$priorday""_PROD_message_processing_times.txt"
cmd="grep 'Processed inbound' /home/rules/care/logs/RootLog* | cut -f5,6,12,16,18 -d\" \" | grep '^"$priorday"' | sed 's/\,/\./' | sed 's/ /\t/g' | sed -r 's/([0-9]+\-[0-9]+\-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort >$outputfile"
printf "command to execute:\n"
echo $cmd
printf "\n"
$cmd
ouput:
./make_log_extract.sh command to execute: grep 'Processed inbound' /home/rules/care/logs/RootLog.log /home/rules/care/logs/RootLog.log.1
/home/rules/care/logs/RootLog.log.10
/home/rules/care/logs/RootLog.log.11
/home/rules/care/logs/RootLog.log.12
/home/rules/care/logs/RootLog.log.2
/home/rules/care/logs/RootLog.log.3
/home/rules/care/logs/RootLog.log.4
/home/rules/care/logs/RootLog.log.5
/home/rules/care/logs/RootLog.log.6
/home/rules/care/logs/RootLog.log.7
/home/rules/care/logs/RootLog.log.8
/home/rules/care/logs/RootLog.log.9 | cut -f5,6,12,16,18 -d" " | grep
'^2014-01-30' | sed 's/\,/./' | sed 's/ /\t/g' | sed -r
's/([0-9]+-[0-9]+-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort
/home/CCHCS/da14/2014-01-30_PROD_message_processing_times.txt
grep: 5,6,12,16,18: No such file or directory
As grebneke comments, do not store the command and then execute it.
What you can do is to execute it but firstly print it: Bash: Print each command before executing?
priorday=$(date --date yesterday +"%Y-%m-%d")
outputfile="/home/CCHCS/da14/$priorday""_PROD_message_processing_times.txt"
set -o xtrace # <-- set printing mode "on"
grep 'Processed inbound' /home/rules/care/logs/RootLog* | cut -f5,6,12,16,18 -d\" \" | grep '^"$priorday"' | sed 's/\,/\./' | sed 's/ /\t/g' | sed -r 's/([0-9]+\-[0-9]+\-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort >$outputfile"
set +o xtrace # <-- revert to normal

Use each line of piped output as parameter for script

I have an application (myapp) that gives me a multiline output
result:
abc|myparam1|def
ghi|myparam2|jkl
mno|myparam3|pqr
stu|myparam4|vwx
With grep and sed I can get my parameters as below
myapp | grep '|' | sed -e 's/^[^|]*//' | sed -e 's/|.*//'
But then want these myparamx values as paramaters of a script to be executed for each parameter.
myscript.sh myparam1
myscript.sh myparam2
etc.
Any help greatly appreciated
Please see xargs. For example:
myapp | grep '|' | sed -e 's/^[^|]*//' | sed -e 's/|.*//' | xargs -n 1 myscript.sh
May be this can help -
myapp | awk -F"|" '{ print $2 }' | while read -r line; do /path/to/script/ "$line"; done
I like the xargs -n 1 solution from Dark Falcon, and while read is a classical tool for such kind of things, but just for completeness:
myapp | awk -F'|' '{print "myscript.sh", $2}' | bash
As a side note, speaking about extraction of 2nd field, you could use cut:
myapp | cut -d'|' -f 1 # -f 1 => second field, starting from 0

Resources