How to find all non-dictionary words in a file in bash/zsh? - bash

I'm trying to find all words in a file that don't exist in the dictionary. If I look for a single word the following works
b=ther; look $b | grep -i "^$b$" | ifne -n echo $b => ther
b=there; look $b | grep -i "^$b$" | ifne -n echo $b => [no output]
However if I try to run a "while read" loop
while read a; do look $a | grep -i "^$a$" | ifne -n echo "$a"; done < <(tr -s '[[:punct:][:space:]]' '\n' <lotr.txt |tr '[:upper:]' '[:lower:]')
The output seems to contain all (?) words in the file. Why doesn't this loop only output non-dictionary words?

Regarding ifne
If stdin is non-empty, ifne -n reprints stdin to stdout. From the manpage:
-n Reverse operation. Run the command if the standard input is empty
Note that if the standard input is not empty, it is passed through
ifne in this case.
strace on ifne confirms this behavior.
Alternative
Perhaps, as an alternative:
1 #!/bin/bash -e
2
3 export PATH=/bin:/sbin:/usr/bin:/usr/sbin
4
5 while read a; do
6 look "$a" | grep -qi "^$a$" || echo "$a"
7 done < <(
8 tr -s '[[:punct:][:space:]]' '\n' < lotr.txt \
9 | tr '[A-Z]' '[a-z]' \
10 | sort -u \
11 | grep .
12 )

Related

how to awk pattern as variable and loop the result?

I assign a keyword as variable, and need to awk from a file using this variable and loop. The file has millions of lines.
i have tried the code below.
DEVICE="DEV2"
while read -r line
do
echo $line
X_keyword=`echo $line | cut -d ',' -f 2 | grep -w "X" | cut -d '=' -f2`
echo $X_keyword
done <<< "$(grep -w $DEVICE $config)"
log="Dev2_PRT.log"
while read -r file
do
VALUE=`echo $file | cut -d '|' -f 1`
HEADER=`echo $VALUE | cut -c 1-4`
echo $file
if [[ $HEADER = 'PTR:' ]]; then
VALUE=`echo $file | cut -d '|' -f 4`
echo $VALUE
XCOORD+=($VALUE)
((X++))
fi
done <<< "awk /$X_keyword/ $log"
expected result:
the log files content lots of below:
PTR:1|2|3|4|X_keyword
PTR:1|2|3|4|Y_rest .....
Filter the X_keyword and get the field no 4.
Unfortunately your shell script is simply the wrong approach to this problem (see https://unix.stackexchange.com/q/169716/133219 for some of the reasons why) so you should set it aside and start over.
To demonstrate the solution, lets create a sample input file:
$ seq 10 | tee file
1
2
3
4
5
6
7
8
9
10
and a shell variable to hold a regexp that's a character list of the chars 5, 6, or 7:
$ var='[567]'
Now, given the above input, here is the solution for how to g/re/p pattern as variable and count how many results:
$ awk -v re="$var" '$0~re{print; c++} END{print "---" ORS c+0}' file
5
6
7
---
3
If that's not all you need then please edit your question to clarify your requirements and provide concise, testable sample input and expected output.

why shell for expression cannot parse xargs parameter correctly

I have a black list to save tag id list, e.g. 1-3,7-9, actually it represents 1,2,3,7,8,9. And could expand it by below shell
for i in {1..3,7..9}; do for j in {$i}; do echo -n "$j,"; done; done
1,2,3,7,8,9
but first I should convert - to ..
echo -n "1-3,7-9" | sed 's/-/../g'
1..3,7..9
then put it into for expression as a parameter
echo -n "1-3,7-9" | sed 's/-/../g' | xargs -I # for i in {#}; do for j in {$i}; do echo -n "$j,"; done; done
zsh: parse error near `do'
echo -n "1-3,7-9" | sed 's/-/../g' | xargs -I # echo #
1..3,7..9
but for expression cannot parse it correctly, why is so?
Because you didn't do anything to stop the outermost shell from picking up the special keywords and characters ( do, for, $, etc ) that you mean to be run by xargs.
xargs isn't a shell built-in; it gets the command line you want it to run for each element on stdin, from its arguments. just like any other program, if you want ; or any other sequence special to be bash in an argument, you need to somehow escape it.
It seems like what you really want here, in my mind, is to invoke in a subshell a command ( your nested for loops ) for each input element.
I've come up with this; it seems to to the job:
echo -n "1-3,7-9" \
| sed 's/-/../g' \
| xargs -I # \
bash -c "for i in {#}; do for j in {\$i}; do echo -n \"\$j,\"; done; done;"
which gives:
{1..3},{7..9},
Could use below shell to achieve this
# Mac newline need special treatment
echo "1-3,7-9" | sed -e 's/-/../g' -e $'s/,/\\\n/g' | xargs -I# echo 'for i in {#}; do echo -n "$i,"; done' | bash
1,2,3,7,8,9,%
#Linux
echo "1-3,7-9" | sed -e 's/-/../g' -e 's/,/\n/g' | xargs -I# echo 'for i in {#}; do echo -n "$i,"; done' | bash
1,2,3,7,8,9,
but use this way is a little complicated maybe awk is more intuitive
# awk
echo "1-3,7-9,11,13-17" | awk '{n=split($0,a,","); for(i=1;i<=n;i++){m=split(a[i],a2,"-");for(j=a2[1];j<=a2[m];j++){print j}}}' | tr '\n' ','
1,2,3,7,8,9,11,13,14,15,16,17,%
echo -n "1-3,7-9" | perl -ne 's/-/../g;$,=",";print eval $_'

Remove all chars that are not a digit from a string

I'm trying to make a small function that removes all the chars that are not digits.
123a45a ---> will become ---> 12345
I've came up with :
temp=$word | grep -o [[:digit:]]
echo $temp
But instead of 12345 I get 1 2 3 4 5. How to I get rid of the spaces?
Pure bash:
word=123a45a
number=${word//[^0-9]}
Here's a pure bash solution
var='123a45a'
echo ${var//[^0-9]/}
12345
is this what you are looking for?
kent$ echo "123a45a"|sed 's/[^0-9]//g'
12345
grep & tr
echo "123a45a"|grep -o '[0-9]'|tr -d '\n'
12345
I would recommend using sed or perl instead:
temp="$(sed -e 's/[^0-9]//g' <<< "$word")"
temp="$(perl -pe 's/\D//g' <<< "$word")"
Edited to add: If you really need to use grep, then this is the only way I can think of:
temp="$( grep -o '[0-9]' <<< "$word" \
| while IFS= read -r ; do echo -n "$REPLY" ; done
)"
. . . but there's probably a better way. (It uses grep -o, like your solution, then runs over the lines that it outputs and re-outputs them without line-breaks.)
Edited again to add: Now that you've mentioned that you use can use tr instead, this is much easier:
temp="$(tr -cd 0-9 <<< "$word")"
What about using sed?
$ echo "123a45a" | sed -r 's/[^0-9]//g'
12345
As I read you are just allowed to use grep and tr, this can make the trick:
$ echo "123a45a" | grep -o [[:digit:]] | tr -d '\n'
12345
In your case,
temp=$(echo $word | grep -o [[:digit:]] | tr -d '\n')
tr will also work:
echo "123a45a" | tr -cd '[:digit:]'
# output: 12345
Grep returns the result on different lines:
$ echo -e "$temp"
1
2
3
4
5
So you cannot remove those spaces during the filtering, but you can afterwards, since $temp can transform itself like this:
temp=`echo $temp | tr -d ' '`
$ echo "$temp"
12345

Too many arguments error in shell script

I am trying a simple shell script like the following:
#!/bin/bash
up_cap=$( cat result.txt | cut -d ":" -f 6,7 | sort -n | cut -d " " -f 2 | sort -n)
down_cap=$( cat result.txt | cut -d : -f 6,7 | sort -n | cut -d " " -f 6| sort -n)
for value in "${down_cap[#]}";do
if [ $value > 80000 ]; then
cat result.txt | grep -B 1 "$value"
fi
done
echo " All done, exiting"
when I execute the above script as ./script.sh, I get the error:
./script.sh: line 5: [: too many arguments
All done, exiting
I have googled enough, and still not able to rectify this.
You want
if [ "$value" -gt 80000 ]; then
You use -gt for checking if A is bigger than B, not >. The quotation marks I merely added to prevent the script from failing in case $value is empty.
Try to declare variable $value explicitly:
declare -i value
So, with the dominikh's and mine additions the code should look like this:
#!/bin/bash
up_cap=$( cat result.txt | cut -d ":" -f 6,7 | sort -n | cut -d " " -f 2 | sort -n)
down_cap=$( cat result.txt | cut -d : -f 6,7 | sort -n | cut -d " " -f 6| sort -n)
for value in "${down_cap[#]}";do
declare -i value
if [ $value -gt 80000 ]; then
cat result.txt | grep -B 1 "$value"
fi
done
echo " All done, exiting"

How to assign output of multiple shell commmands to variable when using tee?

I want to tee and get the results from multiple shell commands connected in the pipeline. I made a simple example to explain the point. Suppose I wanna count the numbers of 'a', 'b' and 'c'.
echo "abcaabbcabc" | tee >(tr -dc 'a' | wc -m) >(tr -dc 'b' | wc -m) >(tr -dc 'c' | wc -m) > /dev/null
Then I tried to assign the result from each count to a shell variable, but they all end up empty.
echo "abcaabbcabc" | tee >(A=$(tr -dc 'a' | wc -m)) >(B=$(tr -dc 'b' | wc -m)) >(C=$(tr -dc 'c' | wc -m)) > /dev/null && echo $A $B $C
What is the right way to do it?
Use files. They are the single most reliable solution. Any of the commands may need different time to run. There is no easy way to synchronize command redirections. Then most reliable way is to use a separate "entity" to collect all the data:
tmpa=$(mktemp) tmpb=$(mktemp) tmpc=$(mktemp)
trap 'rm "$tmpa" "$tmpb" "$tmpc"' EXIT
echo "abcaabbcabc" |
tee >(tr -dc 'a' | wc -m > "$tmpa") >(tr -dc 'b' | wc -m > "$tmpb") |
tr -dc 'c' | wc -m > "$tmpc"
A=$(<"$tmpa")
B=$(<"$tmpb")
C=$(<"$tmpc")
rm "$tmpa" "$tmpb" "$tmpc"
trap '' EXIT
Second way:
You can prepend the data from each stream with a custom prefix. Then sort all lines (basically, buffer them) on the prefix and then read them. The example script will generate only a single number from each process substitution, so it's easy to do:
read -r A B C < <(
echo "abcaabbcabc" |
tee >(
tr -dc 'a' | wc -m | sed 's/^/A /'
) >(
tr -dc 'b' | wc -m | sed 's/^/B /'
) >(
tr -dc 'c' | wc -m | sed 's/^/C /'
) >/dev/null |
sort |
cut -d' ' -f2 |
paste -sd' '
)
echo A="$A" B="$B" C="$C"
Using temporary files with flock to synchronize the output of child processes could look like this:
tmpa=$(mktemp) tmpb=$(mktemp) tmpc=$(mktemp)
trap 'rm "$tmpa" "$tmpb" "$tmpc"' EXIT
echo "abcaabbcabc" |
(
flock 3
flock 4
flock 5
tee >(
tr -dc 'a' | wc -m |
{ sleep 0.1; cat; } > "$tmpa"
# unblock main thread
flock -u 3
) >(
tr -dc 'b' | wc -m |
{ sleep 0.2; cat; } > "$tmpb"
# unblock main thread
flock -u 4
) >(
tr -dc 'c' | wc -m |
{ sleep 0.3; cat; } > "$tmpc"
# unblock main thread
flock -u 5
) >/dev/null
# wait for subprocesses to finish
# need to re-open the files to block on them
(
flock 3
flock 4
flock 5
) 3<"$tmpa" 4<"$tmpb" 5<"$tmpc"
) 3<"$tmpa" 4<"$tmpb" 5<"$tmpc"
A=$(<"$tmpa")
B=$(<"$tmpb")
C=$(<"$tmpc")
declare -p A B C
You can use this featured letter frequency analysis
#!/usr/bin/env bash
declare -A letter_frequency
while read -r v k; do
letter_frequency[$k]="$v"
done < <(
grep -o '[[:alnum:]]' <<<"abcaabbcabc" |
sort |
uniq -c
)
for k in "${!letter_frequency[#]}"; do
printf '%c = %d\n' "$k" "${letter_frequency[$k]}"
done
Output:
c = 3
b = 4
a = 4
Or to only assign $A, $B and $C as in your example:
#!/usr/bin/env bash
{
read -r A _
read -r B _
read -r C _
}< <(
grep -o '[[:alnum:]]' <<<"abcaabbcabc" |
sort |
uniq -c
)
printf 'a=%d\nb=%d\nc=%d\n' "$A" "$B" "$C"
grep -o '[[:alnum:]]': split each alphanumeric character on its own line
sort: sort lines of characters
uniq -c: count each instance and output count and character for each
< <( command group; ): the output of this command group is for stdin of the command group before
If you need to count occurrence of non-printable characters, newlines, spaces, tabs, you have to make all these commands output and deal with null delimited lists. It can sure be done with the GNU versions of these tools. I let it to you as an exercise.
Solution to the count arbitrary characters except null:
As demonstrated, works also with Unicode.
#!/usr/bin/env bash
declare -A character_frequency
declare -i v
while read -d '' -r -N 8 v && read -r -d '' -N 1 k; do
character_frequency[$k]="$v"
done < <(
grep --only-matching --null-data . <<<$'a¹bc✓ ✓\n\t\t\u263A☺ ☺ aabbcabc' |
head --bytes -2 | # trim the newline added by grep
sort --zero-terminated | # sort null delimited list
uniq --count --zero-terminated # count occurences of char (null delim)
)
for k in "${!character_frequency[#]}"; do
printf '%q = %d\n' "$k" "${character_frequency[$k]}"
done
Output:
$'\n' = 1
$'\t' = 2
☺ = 3
\ = 7
✓ = 2
¹ = 1
c = 3
b = 4
a = 4

Resources