echo -e cat: argument line too long - bash

I have bash script that would merge a huge list of text files and filter it. However I'll encounter 'argument line too long' error due to the huge list.
echo -e "`cat $dir/*.txt`" | sed '/^$/d' | grep -v "\-\-\-" | sed '/</d' | tr -d \' | tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' | sort -us -o $output
I have seen some similar answers here and i know i could rectify it using find and cat the files 1st. However, i would i like to know what is the best way to run a one liner code using echo -e and cat without breaking the code and to avoid the argument line too long error. Thanks.

First, with respect to the most immediate problem: Using find ... -exec cat -- {} + or find ... -print0 | xargs -0 cat -- will prevent more arguments from being put on the command line to cat than it can handle.
The more portable (POSIX-specified) alternative to echo -e is printf '%b\n'; this is available even in configurations of bash where echo -e prints -e on output (as when the xpg_echo and posix flags are set).
However, if you use read without the -r argument, the backslashes in your input string are removed, so neither echo -e nor printf %b will be able to process them later.
Fixing this can look like:
while IFS= read -r line; do
printf '%b\n' "$line"
done \
< <(find "$dir" -name '*.txt' -exec cat -- '{}' +) \
| sed [...]

grep -v '^$' $dir/*.txt | grep -v "\-\-\-" | sed '/</d' | tr -d \' \
| tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' \
| sort -us -o $output
If you think about it some more you can probably get rid of a lot more stuff and turn it into a single sed and sort, roughly:
sed -e '/^$/d' -e '/\-\-\-/d' -e '/</d' -e 's/\'\\\/<>(){}!?~;.:+`*-_ͱ//g' \
-e 's/ / /g' -e 's/^[ \t]*//' $dir/*.txt | sort -us -o $output

Related

Bash grep -P with a list of regexes from a file

Problem: hundreds of thousands of files in hundreds of directories must be tested against a number of PCRE regexp to count and categorize files and to determine which of regex are more viable and inclusive.
My approach for a single regexp test:
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo '(?P<message>User activity exceeds.*?\:\s+(?P<user>.*?))\s' |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt
find | xargs allows to sidestep "the too many arguments" error for grep
grep -Pazo is the one doing the heavy lifing -P is for PCRE regex -a is to make sure files are read as text and -z -o are simply because it doesn't work otherwise with the filebase I have
tr -d '\000' is to make sure the output is not binary
fgrep -a is to get only the line with the filename
sed is to counteract sure the grep's awesome habit of appending trailing lines to each other (basically removes everything in a line before the filepath)
cut -d: -f1 cuts off the filepath only
wc -l counts the result size of the matched filelist
Result is a file with 10k+ lines like these: unsorted/./2020.03.02/68091ec4-cf04-4843-a4b2-95420756cd53 which is what I want in the end.
Obviously this is not very good, but this works fine for something made out of sticks and dirt. My main objective here is to test concepts and regex, not count for further scaling or anything, really.
So, since grep -P does not support -f parameter, I tried using the while read loop:
(while read regexline ;
do echo "$regexline" ;
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo "$regexline" |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt |
sed 's/^ *//' ;
done) < regex_1.txt
And as you can imagine - it fails spectacularly: zero matches for everything.
I've experimented with the quotemarks in the grep, with the loop type etc. Nothing.
Any help with the current code or suggestions on how to do this otherwise are very appreciated.
Thank you.
P.S. Yes, I've tried pcregrep, but it returns zero matches even on a single pattern. Dunno why.
You could do this which will be impossible slow:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r regexline; do
grep -Pazo "$regexline" "$file"
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabla
Or for each line:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r line; do
while IFS= read -r regexline; do
if grep -Pazo "$regexline" <<<"$line"; then
break
fi
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabl
Or maybe with xargs.
But I believe just join the regular expressions from the file with |:
find unsorted_test/. -type f -print0 |
{
regex=$(< regex_1.txt paste -sd '|')
# or maybe with braces
# regex=$(< regex_1.txt sed 's/.*/(&)/' | paste -sd '|')
xargs -0 grep -Pazo "$regex"
} |
....
Notes:
To read lines from file use IFS= read -r line. The -d '' option to read is bash syntax.
Lines with spaces, tabs and comments only after pipe are ignored. You can just put your commands on separate lines.
Use grep -F instead of deprecated fgrep.

why shell for expression cannot parse xargs parameter correctly

I have a black list to save tag id list, e.g. 1-3,7-9, actually it represents 1,2,3,7,8,9. And could expand it by below shell
for i in {1..3,7..9}; do for j in {$i}; do echo -n "$j,"; done; done
1,2,3,7,8,9
but first I should convert - to ..
echo -n "1-3,7-9" | sed 's/-/../g'
1..3,7..9
then put it into for expression as a parameter
echo -n "1-3,7-9" | sed 's/-/../g' | xargs -I # for i in {#}; do for j in {$i}; do echo -n "$j,"; done; done
zsh: parse error near `do'
echo -n "1-3,7-9" | sed 's/-/../g' | xargs -I # echo #
1..3,7..9
but for expression cannot parse it correctly, why is so?
Because you didn't do anything to stop the outermost shell from picking up the special keywords and characters ( do, for, $, etc ) that you mean to be run by xargs.
xargs isn't a shell built-in; it gets the command line you want it to run for each element on stdin, from its arguments. just like any other program, if you want ; or any other sequence special to be bash in an argument, you need to somehow escape it.
It seems like what you really want here, in my mind, is to invoke in a subshell a command ( your nested for loops ) for each input element.
I've come up with this; it seems to to the job:
echo -n "1-3,7-9" \
| sed 's/-/../g' \
| xargs -I # \
bash -c "for i in {#}; do for j in {\$i}; do echo -n \"\$j,\"; done; done;"
which gives:
{1..3},{7..9},
Could use below shell to achieve this
# Mac newline need special treatment
echo "1-3,7-9" | sed -e 's/-/../g' -e $'s/,/\\\n/g' | xargs -I# echo 'for i in {#}; do echo -n "$i,"; done' | bash
1,2,3,7,8,9,%
#Linux
echo "1-3,7-9" | sed -e 's/-/../g' -e 's/,/\n/g' | xargs -I# echo 'for i in {#}; do echo -n "$i,"; done' | bash
1,2,3,7,8,9,
but use this way is a little complicated maybe awk is more intuitive
# awk
echo "1-3,7-9,11,13-17" | awk '{n=split($0,a,","); for(i=1;i<=n;i++){m=split(a[i],a2,"-");for(j=a2[1];j<=a2[m];j++){print j}}}' | tr '\n' ','
1,2,3,7,8,9,11,13,14,15,16,17,%
echo -n "1-3,7-9" | perl -ne 's/-/../g;$,=",";print eval $_'

Sed error "command a expects \ followed by text"

Here is my script:
openscad $1 -D generate=1 -o $1.csg 2>&1 >/dev/null |
sed 's/ECHO: \"\[LC\] //' |
sed 's/"$//' |
sed '$a;' >./2d_$1
That output:
sed: 1: "$a;": command a expects \ followed by text
Your version of sed is not GNU sed which allows what you use. You need to write:
openscad $1 -D generate=1 -o $1.csg 2>&1 >/dev/null |
sed 's/ECHO: \"\[LC\] //' |
sed 's/"$//' |
sed '$a\
;' >./2d_$1
Also, three copies of sed is a little excessive (to be polite); one suffices:
openscad $1 -D generate=1 -o $1.csg 2>&1 >/dev/null |
sed -e 's/ECHO: \"\[LC\] //' \
-e 's/"$//' \
-e '$a\' \
-e ';' >./2d_$1
or:
openscad $1 -D generate=1 -o $1.csg 2>&1 >/dev/null |
sed -e 's/ECHO: \"\[LC\] //' -e 's/"$//' -e '$a\' -e ';' >./2d_$1

Grep Spellchecker

I am trying to write a simple shell script that takes a text file as input and checks all non-punctuated words against a dictionary (english.txt). It should return all non-matching (misspelled) words. I am using grep but it does not seem to successfully match all the lines in english.txt. I have included my code below.
#!/bin/bash
cat $1 |
tr ' \t' '\n\n' |
sed -e "/'/d" |
tr -d '[:punct:]' |
tr -cd '[:alpha:]\n' |
sed -e "/^$/d" |
grep -v -i -w -f english.txt

Bash script builds correct $cmd but fails to execute complex stream

This short script scrapes some log files daily to create a simple extract. It works from the command line and when I echo the $cmd and copy/paste, it also works. But it will breaks when I try to execute from the script itself.
I know this is a nightmare of patterns that I could probably improve, but am I missing something simple to just execute this correctly?
#!/bin/bash
priorday=$(date --date yesterday +"%Y-%m-%d")
outputfile="/home/CCHCS/da14/$priorday""_PROD_message_processing_times.txt"
cmd="grep 'Processed inbound' /home/rules/care/logs/RootLog* | cut -f5,6,12,16,18 -d\" \" | grep '^"$priorday"' | sed 's/\,/\./' | sed 's/ /\t/g' | sed -r 's/([0-9]+\-[0-9]+\-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort >$outputfile"
printf "command to execute:\n"
echo $cmd
printf "\n"
$cmd
ouput:
./make_log_extract.sh command to execute: grep 'Processed inbound' /home/rules/care/logs/RootLog.log /home/rules/care/logs/RootLog.log.1
/home/rules/care/logs/RootLog.log.10
/home/rules/care/logs/RootLog.log.11
/home/rules/care/logs/RootLog.log.12
/home/rules/care/logs/RootLog.log.2
/home/rules/care/logs/RootLog.log.3
/home/rules/care/logs/RootLog.log.4
/home/rules/care/logs/RootLog.log.5
/home/rules/care/logs/RootLog.log.6
/home/rules/care/logs/RootLog.log.7
/home/rules/care/logs/RootLog.log.8
/home/rules/care/logs/RootLog.log.9 | cut -f5,6,12,16,18 -d" " | grep
'^2014-01-30' | sed 's/\,/./' | sed 's/ /\t/g' | sed -r
's/([0-9]+-[0-9]+-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort
/home/CCHCS/da14/2014-01-30_PROD_message_processing_times.txt
grep: 5,6,12,16,18: No such file or directory
As grebneke comments, do not store the command and then execute it.
What you can do is to execute it but firstly print it: Bash: Print each command before executing?
priorday=$(date --date yesterday +"%Y-%m-%d")
outputfile="/home/CCHCS/da14/$priorday""_PROD_message_processing_times.txt"
set -o xtrace # <-- set printing mode "on"
grep 'Processed inbound' /home/rules/care/logs/RootLog* | cut -f5,6,12,16,18 -d\" \" | grep '^"$priorday"' | sed 's/\,/\./' | sed 's/ /\t/g' | sed -r 's/([0-9]+\-[0-9]+\-[0-9]+)\t/\1 /' | sed 's/ / /g' | sort >$outputfile"
set +o xtrace # <-- revert to normal

Resources