GNU parallel with custom script doing string comparison - bash

The follwoing script.sh compares part of a string (coming from stdin by cating a csv file) to a defined string and reports the differences in a certain format
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done < "${1:-/dev/stdin}"
It is intendet to be executed on a number of rows from a very large file in the format
XYZ,ABMDEFG
and it works well when i use it in a pipe:
cat large_file | ./find_something.sh
However, when I try to use it with parallel, i get this error:
$ cat large_file | parallel ./find_something.sh
./find_something.sh: line 9: XYZ, ABMDEFG : No such file or directory
What is causing this? Is parallel supposed to work for something like this, if I want to redirect the output to a single file afterwards?
Less important side note: I'm rather proud of my string comparison method, but if someone has a faster way to get from comparing ABCDEFG and XYZ,ABMDEFG to obtain XYZ,C3M I'd be happy to hear that, too.
Edit:
I should have said, I also want to preserve the order of each line in the output, corresponding to the input. Is that possible using parallel?

Your script accepts its input from a file (defaulting to stdin), whereas parallel will pass input as arguments, not via stdin. In that sense, parallel is closer to xargs.
Presumably, you want each of the lines in large_file to be processed as a unit, possibly in parallel.
That means you need your script to only process one such line at a time, and let parallel call your script many times, once for each line.
So your script should look like this:
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
line="$1"
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
Then you can redirect to a file as follows:
cat large_file | parallel ./find_something.sh > output_file

-k keeps the order.
#!/usr/bin/env bash
doit() {
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done
}
export -f doit
cat large_file | parallel --pipe -k doit
#or
parallel --pipepart -a large_file --block -10 -k doit

Related

How do I delete lines from my bash history matching a specific pattern?

I can get a list of the line numbers matching a specific pattern such as containing the word "function".
history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
If I do history -d on that it says bad pattern, I don't know if it's as it's a list or their strings rather than numbers?
history -d (history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
Quick answer:
while read n; do history -d $n; done < <(history | tac | awk '/function/{print $1}')
Explanation:
The history command accepts only a single offset when using the -d flag. On top of that when you delete an entry, it also renumbers all the commands after this entry. For this reason we revert the output of history using tac and process the lines from last to first. This short awk line just replaces the grep and sed command to pick up the history offset.
We do not use a full pipeline as this creates subshells and history -d $n would not work properly. This is nicely explained in: Why can't I delete multiple entries from bash history with this loop
Note: If you want to push this to your history file ($HISTFILE), you have to use history -w
Warning: When you have multiline commands in your history the story becomes very complicated and strongly depends on various options that have been set. See [U&L] When is a multiline history entry (aka lithist) in bash possible? for the nasty bits.
You can delete one history entry or a range of entries, but not a list. Your matches are likely to be spread out, so the range option is out.
The multiple sed commands to extract the history offsets can be simplified into one:
sed -E 's/^ *([0-9]*).*$/\1/'
One problem with history is that it can have multiline entries, like:
741 source <(history | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
If your grep matches on function above, your sed will not be able to extract the history offset number, so we need to make that possible. One way may be to remove all newlines and only add them on lines containing the history offset. This is one way that probably can be done in some easier way:
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}'
We can then produce a number of history -d commands with xargs. xargs can't run the build-it history directly, so I've just used it to produce input to the built-in source using Process Substitution:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
grep function | \
sed -E 's/^ *([0-9]*).*$/\1/' | \
sort -rn | \
xargs -n1 echo history -d)
#kvantour gives nice alternatives to grep + sed + sort -rn. Using those, my above blob could be simplified into:
source <(history | \
awk '/^ {0,4}[0-9]+/ {
printf("\n%s",$0);
}
!/^ {0,4}[0-9]+/{
printf(" %s",$0);
}
END{
printf("\n")
}' | \
awk '/function/ {print "history -d",$1}' | \
tac)
You need to store the pattern in a variable and then pass it to history.
$ history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g'
1077
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ history -d $var
However, as you can have a lot of ocurrences for the patter, I would use a loop
$ var=$( history | grep function | sed -e 's/^\(.\{5\}\).*/\1/' | sed 's/^ *//g')
$ for i in $var
> do
> history -d $i
> history -w
> done
If the line you want to delete has already been written to your $HISTFILE (which typically happens when you end a session by default), you will need to write back to $HISTFILE, or the line will reappear when you open a new session.
After the deletion you need to load again the .bashrc by executing
$ cd
$ source .bashrc
However, there are cases that the lines won't be deleted: if you set PROMPT_COMMAND to history -a, in that case it is already written to the history file, rather than on exit under normal configuration.

Grep Spellchecker

I am trying to write a simple shell script that takes a text file as input and checks all non-punctuated words against a dictionary (english.txt). It should return all non-matching (misspelled) words. I am using grep but it does not seem to successfully match all the lines in english.txt. I have included my code below.
#!/bin/bash
cat $1 |
tr ' \t' '\n\n' |
sed -e "/'/d" |
tr -d '[:punct:]' |
tr -cd '[:alpha:]\n' |
sed -e "/^$/d" |
grep -v -i -w -f english.txt

echo -e cat: argument line too long

I have bash script that would merge a huge list of text files and filter it. However I'll encounter 'argument line too long' error due to the huge list.
echo -e "`cat $dir/*.txt`" | sed '/^$/d' | grep -v "\-\-\-" | sed '/</d' | tr -d \' | tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' | sort -us -o $output
I have seen some similar answers here and i know i could rectify it using find and cat the files 1st. However, i would i like to know what is the best way to run a one liner code using echo -e and cat without breaking the code and to avoid the argument line too long error. Thanks.
First, with respect to the most immediate problem: Using find ... -exec cat -- {} + or find ... -print0 | xargs -0 cat -- will prevent more arguments from being put on the command line to cat than it can handle.
The more portable (POSIX-specified) alternative to echo -e is printf '%b\n'; this is available even in configurations of bash where echo -e prints -e on output (as when the xpg_echo and posix flags are set).
However, if you use read without the -r argument, the backslashes in your input string are removed, so neither echo -e nor printf %b will be able to process them later.
Fixing this can look like:
while IFS= read -r line; do
printf '%b\n' "$line"
done \
< <(find "$dir" -name '*.txt' -exec cat -- '{}' +) \
| sed [...]
grep -v '^$' $dir/*.txt | grep -v "\-\-\-" | sed '/</d' | tr -d \' \
| tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' \
| sort -us -o $output
If you think about it some more you can probably get rid of a lot more stuff and turn it into a single sed and sort, roughly:
sed -e '/^$/d' -e '/\-\-\-/d' -e '/</d' -e 's/\'\\\/<>(){}!?~;.:+`*-_ͱ//g' \
-e 's/ / /g' -e 's/^[ \t]*//' $dir/*.txt | sort -us -o $output

Bash: set a shell variable in the middle of the pipeline

I have text coming from some command (in example it's echo -e "10 ABC \n5 DEF \n87 GHI"). This text goes through the pipeline and I get wanted output (in example it's GHI). Wanted output is sent to the following pipeline step (in example it's | xargs -I {} grep -w {} FILES |).
My question is:
I want to append a variable to an "inter pipe" output before it's sent to a following step - How can I do this?
Example:
echo -e "10 ABC \n5 DEF \n87 GHI" |
sort -nr -k1 |
head -n1 |
cut -f 2 | # Wanted output comes here. I want to append it to a variable before it goes to `grep`
xargs -I {} grep -w {} FILES |
# FOLLOWING ANALYSIS
You can't set a shell variable in the middle of the pipeline, but you can send the output to a file using the tee command, and then read that file later.
echo -e "10 ABC \n5 DEF \n87 GHI" |
sort -nr -k1 |
head -n1 |
cut -f 2 |
tee intermediate.txt |
xargs -I {} grep -w {} FILES |
# FOLLOWING ANALYSIS
# intermediate.txt now contains 87 GHI
How about something like this
echo -e "10 ABC \n5 DEF \n87 GHI" | sort -nr -k1 | head -n1 | cut -f 2 | while read MYVAR; do echo "intermediate value: $MYVAR"; echo $MYVAR | xargs -I {} grep -w {} FILES; done
Insert it to a stream. so I think your looking just to add the ?contents? of a variable to every 'line' from the stream? This prepends the contents of $example
ie
example="A String"
echo -e "10 ABC \n5 DEF \n87 GHI" |
sort -nr -k1 |
head -n1 |
cut -f 2 |
sed s/^/$example/ |
xargs -I {} grep -w {} FILES |
# FOLLOWING ANALYSIS
sed s/$/$example/ to append
NB I tend to do lot of things this way in bash, but a long pipeline of cuts, seds and heads etc does suggest maybe its time to break out awk or perl.

Use each line of piped output as parameter for script

I have an application (myapp) that gives me a multiline output
result:
abc|myparam1|def
ghi|myparam2|jkl
mno|myparam3|pqr
stu|myparam4|vwx
With grep and sed I can get my parameters as below
myapp | grep '|' | sed -e 's/^[^|]*//' | sed -e 's/|.*//'
But then want these myparamx values as paramaters of a script to be executed for each parameter.
myscript.sh myparam1
myscript.sh myparam2
etc.
Any help greatly appreciated
Please see xargs. For example:
myapp | grep '|' | sed -e 's/^[^|]*//' | sed -e 's/|.*//' | xargs -n 1 myscript.sh
May be this can help -
myapp | awk -F"|" '{ print $2 }' | while read -r line; do /path/to/script/ "$line"; done
I like the xargs -n 1 solution from Dark Falcon, and while read is a classical tool for such kind of things, but just for completeness:
myapp | awk -F'|' '{print "myscript.sh", $2}' | bash
As a side note, speaking about extraction of 2nd field, you could use cut:
myapp | cut -d'|' -f 1 # -f 1 => second field, starting from 0

Resources