Bash grep -P with a list of regexes from a file - bash

Problem: hundreds of thousands of files in hundreds of directories must be tested against a number of PCRE regexp to count and categorize files and to determine which of regex are more viable and inclusive.
My approach for a single regexp test:
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo '(?P<message>User activity exceeds.*?\:\s+(?P<user>.*?))\s' |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt
find | xargs allows to sidestep "the too many arguments" error for grep
grep -Pazo is the one doing the heavy lifing -P is for PCRE regex -a is to make sure files are read as text and -z -o are simply because it doesn't work otherwise with the filebase I have
tr -d '\000' is to make sure the output is not binary
fgrep -a is to get only the line with the filename
sed is to counteract sure the grep's awesome habit of appending trailing lines to each other (basically removes everything in a line before the filepath)
cut -d: -f1 cuts off the filepath only
wc -l counts the result size of the matched filelist
Result is a file with 10k+ lines like these: unsorted/./2020.03.02/68091ec4-cf04-4843-a4b2-95420756cd53 which is what I want in the end.
Obviously this is not very good, but this works fine for something made out of sticks and dirt. My main objective here is to test concepts and regex, not count for further scaling or anything, really.
So, since grep -P does not support -f parameter, I tried using the while read loop:
(while read regexline ;
do echo "$regexline" ;
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo "$regexline" |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt |
sed 's/^ *//' ;
done) < regex_1.txt
And as you can imagine - it fails spectacularly: zero matches for everything.
I've experimented with the quotemarks in the grep, with the loop type etc. Nothing.
Any help with the current code or suggestions on how to do this otherwise are very appreciated.
Thank you.
P.S. Yes, I've tried pcregrep, but it returns zero matches even on a single pattern. Dunno why.

You could do this which will be impossible slow:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r regexline; do
grep -Pazo "$regexline" "$file"
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabla
Or for each line:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r line; do
while IFS= read -r regexline; do
if grep -Pazo "$regexline" <<<"$line"; then
break
fi
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabl
Or maybe with xargs.
But I believe just join the regular expressions from the file with |:
find unsorted_test/. -type f -print0 |
{
regex=$(< regex_1.txt paste -sd '|')
# or maybe with braces
# regex=$(< regex_1.txt sed 's/.*/(&)/' | paste -sd '|')
xargs -0 grep -Pazo "$regex"
} |
....
Notes:
To read lines from file use IFS= read -r line. The -d '' option to read is bash syntax.
Lines with spaces, tabs and comments only after pipe are ignored. You can just put your commands on separate lines.
Use grep -F instead of deprecated fgrep.

Related

How to use bash to get unique dates from list of file names

I have a large number of file names. I need to create a bash script that gets all of the unique dates from the file names.
Example:
input:
opencomposition_dxxx_20201123.csv.gz
opencomposition_dxxv_20201123.csv.gz
opencomposition_dxxu_20201123.csv.gz
opencomposition_sxxv_20201123.csv.gz
opencomposition_sxxe_20211223.csv.gz
opencomposition_sxxe_20211224.csv.gz
opencomposition_sxxe_20211227.csv.gz
opencomposition_sxxesgp_20230106.csv.gz
output:
20201123 20211224 20211227 20230106
Code:
for asof_dt in `find -H ./ -maxdepth 1 -nowarn -type f -name *open*.gz
| sort -r | cut -f3 -d "_" | cut -f1 -d"." | uniq`; do
echo $asof_dt
done
Error:
line 20: /bin/find: Argument list too long
Like this (GNU grep):
You need to add quotes on the glob: '*open*.gz', if not, the shell try to expand the wildcard *.
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' |
grep -oP '_\K\d{8}(?=\.csv)' |
sort -u
Output
20201123
20211223
20211224
20211227
20230106
The regular expression matches as follows:
Node
Explanation
_
_
\K
resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
\d{8}
digits (0-9) (8 times)
(?=
look ahead to see if there is:
\.
.
csv
'csv'
)
end of look-ahead
Using tr:
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' | tr -d 'a-z_.' | sort -u
If filenames don't contain newline characters, a quick-and-dirty method, similar to your attempt, might be
printf '%s\n' open*.gz | cut -d_ -f3 | cut -d. -f1 | sort -u
Note that printf is a bash builtin command and argument list too long is not applied to bash builtins.

Bash loop produces empty files [duplicate]

This question already has answers here:
How can I use a file in a command and redirect output to the same file without truncating it?
(14 answers)
Closed 4 years ago.
I'm trying to run the following loop but unfortunately it creates empty files.
 for f in *.tex; do cut -d "&" -s -f1,2,4 $f | sed "s/$/\\\\\\\\/g" | sed "s/Reg. year/\$year/g" | sed "s/=\([0-9]\{4\}\)/^\{\1\}\$/g" | sed "/Counterfactual/d" | sed "/Delta/d" | sed "/{2014}/d" | sed "/^\s*&\s*\&/d" > $f; done;
When I run the command on a single file (replacing $f by filename.ext), it does work well.
You are piping whole for loop You can put all commands after do in beak brackets:
{ cut ... | sed ... | ... }
Or You can use xargs:
find ./ -type f -name "*.tex" -print0 | xargs -0 cut -d "&" -s -f1,2,4 | sed ... | ...

count all the lines in all folders in bash [duplicate]

wc -l file.txt
outputs number of lines and file name.
I need just the number itself (not the file name).
I can do this
wc -l file.txt | awk '{print $1}'
But maybe there is a better way?
Try this way:
wc -l < file.txt
cat file.txt | wc -l
According to the man page (for the BSD version, I don't have a GNU version to check):
If no files are specified, the standard input is used and no file
name is
displayed. The prompt will accept input until receiving EOF, or [^D] in
most environments.
To do this without the leading space, why not:
wc -l < file.txt | bc
Comparison of Techniques
I had a similar issue attempting to get a character count without the leading whitespace provided by wc, which led me to this page. After trying out the answers here, the following are the results from my personal testing on Mac (BSD Bash). Again, this is for character count; for line count you'd do wc -l. echo -n omits the trailing line break.
FOO="bar"
echo -n "$FOO" | wc -c # " 3" (x)
echo -n "$FOO" | wc -c | bc # "3" (√)
echo -n "$FOO" | wc -c | tr -d ' ' # "3" (√)
echo -n "$FOO" | wc -c | awk '{print $1}' # "3" (√)
echo -n "$FOO" | wc -c | cut -d ' ' -f1 # "" for -f < 8 (x)
echo -n "$FOO" | wc -c | cut -d ' ' -f8 # "3" (√)
echo -n "$FOO" | wc -c | perl -pe 's/^\s+//' # "3" (√)
echo -n "$FOO" | wc -c | grep -ch '^' # "1" (x)
echo $( printf '%s' "$FOO" | wc -c ) # "3" (√)
I wouldn't rely on the cut -f* method in general since it requires that you know the exact number of leading spaces that any given output may have. And the grep one works for counting lines, but not characters.
bc is the most concise, and awk and perl seem a bit overkill, but they should all be relatively fast and portable enough.
Also note that some of these can be adapted to trim surrounding whitespace from general strings, as well (along with echo `echo $FOO`, another neat trick).
How about
wc -l file.txt | cut -d' ' -f1
i.e. pipe the output of wc into cut (where delimiters are spaces and pick just the first field)
How about
grep -ch "^" file.txt
Obviously, there are a lot of solutions to this.
Here is another one though:
wc -l somefile | tr -d "[:alpha:][:blank:][:punct:]"
This only outputs the number of lines, but the trailing newline character (\n) is present, if you don't want that either, replace [:blank:] with [:space:].
Another way to strip the leading zeros without invoking an external command is to use Arithmetic expansion $((exp))
echo $(($(wc -l < file.txt)))
Best way would be first of all find all files in directory then use AWK NR (Number of Records Variable)
below is the command :
find <directory path> -type f | awk 'END{print NR}'
example : - find /tmp/ -type f | awk 'END{print NR}'
This works for me using the normal wc -l and sed to strip any char what is not a number.
wc -l big_file.log | sed -E "s/([a-z\-\_\.]|[[:space:]]*)//g"
# 9249133

echo -e cat: argument line too long

I have bash script that would merge a huge list of text files and filter it. However I'll encounter 'argument line too long' error due to the huge list.
echo -e "`cat $dir/*.txt`" | sed '/^$/d' | grep -v "\-\-\-" | sed '/</d' | tr -d \' | tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' | sort -us -o $output
I have seen some similar answers here and i know i could rectify it using find and cat the files 1st. However, i would i like to know what is the best way to run a one liner code using echo -e and cat without breaking the code and to avoid the argument line too long error. Thanks.
First, with respect to the most immediate problem: Using find ... -exec cat -- {} + or find ... -print0 | xargs -0 cat -- will prevent more arguments from being put on the command line to cat than it can handle.
The more portable (POSIX-specified) alternative to echo -e is printf '%b\n'; this is available even in configurations of bash where echo -e prints -e on output (as when the xpg_echo and posix flags are set).
However, if you use read without the -r argument, the backslashes in your input string are removed, so neither echo -e nor printf %b will be able to process them later.
Fixing this can look like:
while IFS= read -r line; do
printf '%b\n' "$line"
done \
< <(find "$dir" -name '*.txt' -exec cat -- '{}' +) \
| sed [...]
grep -v '^$' $dir/*.txt | grep -v "\-\-\-" | sed '/</d' | tr -d \' \
| tr -d '\\\/<>(){}!?~;.:+`*-_ͱ' | tr -s ' ' | sed 's/^[ \t]*//' \
| sort -us -o $output
If you think about it some more you can probably get rid of a lot more stuff and turn it into a single sed and sort, roughly:
sed -e '/^$/d' -e '/\-\-\-/d' -e '/</d' -e 's/\'\\\/<>(){}!?~;.:+`*-_ͱ//g' \
-e 's/ / /g' -e 's/^[ \t]*//' $dir/*.txt | sort -us -o $output

How to get "wc -l" to print just the number of lines without file name?

wc -l file.txt
outputs number of lines and file name.
I need just the number itself (not the file name).
I can do this
wc -l file.txt | awk '{print $1}'
But maybe there is a better way?
Try this way:
wc -l < file.txt
cat file.txt | wc -l
According to the man page (for the BSD version, I don't have a GNU version to check):
If no files are specified, the standard input is used and no file
name is
displayed. The prompt will accept input until receiving EOF, or [^D] in
most environments.
To do this without the leading space, why not:
wc -l < file.txt | bc
Comparison of Techniques
I had a similar issue attempting to get a character count without the leading whitespace provided by wc, which led me to this page. After trying out the answers here, the following are the results from my personal testing on Mac (BSD Bash). Again, this is for character count; for line count you'd do wc -l. echo -n omits the trailing line break.
FOO="bar"
echo -n "$FOO" | wc -c # " 3" (x)
echo -n "$FOO" | wc -c | bc # "3" (√)
echo -n "$FOO" | wc -c | tr -d ' ' # "3" (√)
echo -n "$FOO" | wc -c | awk '{print $1}' # "3" (√)
echo -n "$FOO" | wc -c | cut -d ' ' -f1 # "" for -f < 8 (x)
echo -n "$FOO" | wc -c | cut -d ' ' -f8 # "3" (√)
echo -n "$FOO" | wc -c | perl -pe 's/^\s+//' # "3" (√)
echo -n "$FOO" | wc -c | grep -ch '^' # "1" (x)
echo $( printf '%s' "$FOO" | wc -c ) # "3" (√)
I wouldn't rely on the cut -f* method in general since it requires that you know the exact number of leading spaces that any given output may have. And the grep one works for counting lines, but not characters.
bc is the most concise, and awk and perl seem a bit overkill, but they should all be relatively fast and portable enough.
Also note that some of these can be adapted to trim surrounding whitespace from general strings, as well (along with echo `echo $FOO`, another neat trick).
How about
wc -l file.txt | cut -d' ' -f1
i.e. pipe the output of wc into cut (where delimiters are spaces and pick just the first field)
How about
grep -ch "^" file.txt
Obviously, there are a lot of solutions to this.
Here is another one though:
wc -l somefile | tr -d "[:alpha:][:blank:][:punct:]"
This only outputs the number of lines, but the trailing newline character (\n) is present, if you don't want that either, replace [:blank:] with [:space:].
Another way to strip the leading zeros without invoking an external command is to use Arithmetic expansion $((exp))
echo $(($(wc -l < file.txt)))
Best way would be first of all find all files in directory then use AWK NR (Number of Records Variable)
below is the command :
find <directory path> -type f | awk 'END{print NR}'
example : - find /tmp/ -type f | awk 'END{print NR}'
This works for me using the normal wc -l and sed to strip any char what is not a number.
wc -l big_file.log | sed -E "s/([a-z\-\_\.]|[[:space:]]*)//g"
# 9249133

Resources