How to use bash to get unique dates from list of file names - bash

I have a large number of file names. I need to create a bash script that gets all of the unique dates from the file names.
Example:
input:
opencomposition_dxxx_20201123.csv.gz
opencomposition_dxxv_20201123.csv.gz
opencomposition_dxxu_20201123.csv.gz
opencomposition_sxxv_20201123.csv.gz
opencomposition_sxxe_20211223.csv.gz
opencomposition_sxxe_20211224.csv.gz
opencomposition_sxxe_20211227.csv.gz
opencomposition_sxxesgp_20230106.csv.gz
output:
20201123 20211224 20211227 20230106
Code:
for asof_dt in `find -H ./ -maxdepth 1 -nowarn -type f -name *open*.gz
| sort -r | cut -f3 -d "_" | cut -f1 -d"." | uniq`; do
echo $asof_dt
done
Error:
line 20: /bin/find: Argument list too long

Like this (GNU grep):
You need to add quotes on the glob: '*open*.gz', if not, the shell try to expand the wildcard *.
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' |
grep -oP '_\K\d{8}(?=\.csv)' |
sort -u
Output
20201123
20211223
20211224
20211227
20230106
The regular expression matches as follows:
Node
Explanation
_
_
\K
resets the start of the match (what is Kept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
\d{8}
digits (0-9) (8 times)
(?=
look ahead to see if there is:
\.
.
csv
'csv'
)
end of look-ahead

Using tr:
find -H ./ -maxdepth 1 -nowarn -type f -name '*open*.gz' | tr -d 'a-z_.' | sort -u

If filenames don't contain newline characters, a quick-and-dirty method, similar to your attempt, might be
printf '%s\n' open*.gz | cut -d_ -f3 | cut -d. -f1 | sort -u
Note that printf is a bash builtin command and argument list too long is not applied to bash builtins.

Related

Bash grep -P with a list of regexes from a file

Problem: hundreds of thousands of files in hundreds of directories must be tested against a number of PCRE regexp to count and categorize files and to determine which of regex are more viable and inclusive.
My approach for a single regexp test:
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo '(?P<message>User activity exceeds.*?\:\s+(?P<user>.*?))\s' |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt
find | xargs allows to sidestep "the too many arguments" error for grep
grep -Pazo is the one doing the heavy lifing -P is for PCRE regex -a is to make sure files are read as text and -z -o are simply because it doesn't work otherwise with the filebase I have
tr -d '\000' is to make sure the output is not binary
fgrep -a is to get only the line with the filename
sed is to counteract sure the grep's awesome habit of appending trailing lines to each other (basically removes everything in a line before the filepath)
cut -d: -f1 cuts off the filepath only
wc -l counts the result size of the matched filelist
Result is a file with 10k+ lines like these: unsorted/./2020.03.02/68091ec4-cf04-4843-a4b2-95420756cd53 which is what I want in the end.
Obviously this is not very good, but this works fine for something made out of sticks and dirt. My main objective here is to test concepts and regex, not count for further scaling or anything, really.
So, since grep -P does not support -f parameter, I tried using the while read loop:
(while read regexline ;
do echo "$regexline" ;
find unsorted_test/. -type f -print0 |
xargs -0 grep -Pazo "$regexline" |
tr -d '\000' |
fgrep -a unsorted_test |
sed 's/^.*unsorted/unsorted/' |
cut -d: -f1 > matched_files_unsorted_test000.txt ;
wc -l matched_files_unsorted_test000.txt |
sed 's/^ *//' ;
done) < regex_1.txt
And as you can imagine - it fails spectacularly: zero matches for everything.
I've experimented with the quotemarks in the grep, with the loop type etc. Nothing.
Any help with the current code or suggestions on how to do this otherwise are very appreciated.
Thank you.
P.S. Yes, I've tried pcregrep, but it returns zero matches even on a single pattern. Dunno why.
You could do this which will be impossible slow:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r regexline; do
grep -Pazo "$regexline" "$file"
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabla
Or for each line:
find unsorted_test/. -type f -print0 |
while IFS= read -d '' -r file; do
while IFS= read -r line; do
while IFS= read -r regexline; do
if grep -Pazo "$regexline" <<<"$line"; then
break
fi
done < regex_1.txt
done |
tr -d '\000' | fgrep -a unsorted_test... blablabl
Or maybe with xargs.
But I believe just join the regular expressions from the file with |:
find unsorted_test/. -type f -print0 |
{
regex=$(< regex_1.txt paste -sd '|')
# or maybe with braces
# regex=$(< regex_1.txt sed 's/.*/(&)/' | paste -sd '|')
xargs -0 grep -Pazo "$regex"
} |
....
Notes:
To read lines from file use IFS= read -r line. The -d '' option to read is bash syntax.
Lines with spaces, tabs and comments only after pipe are ignored. You can just put your commands on separate lines.
Use grep -F instead of deprecated fgrep.

Displaying the result of two grep commands in bash

I am trying to find the number of files in a directory with two different patterns in the filenames. I don't want the combined count, but display the combined result.
Command 1: find | grep ".coded" | wc -l | Output : 4533
Command 2: find | grep ".read" | wc -l | Output: 654
Output sought: 4533 | 654 in one line
Any suggestions? Thanks!
With the bash shell using process substitution and pr
pr -mts' | ' <(find | grep "\.coded" | wc -l) <(find | grep "\.read" | wc -l)
With GNU find, you can use -printf to print whatever you want, for example a c for each file matching .coded and an "r" for each file matching .read, and then use awk to count how many of each you have:
find -type f \
\( -name '*.coded*' -printf 'c\n' \) \
-o \
\( -name '*.read*' -printf 'r\n' \) \
| awk '{ ++a[$0] } END{ printf "%d | %d\n", a["c"], a["r"] }'
By the way, your grep patterns match Xcoded and Yread, or really anything for your period; if it is a literal period, it has to be escaped, as in '\.coded' and '\.read'. Also, if your filenames contain linebreaks, your count is off.

integer expression expected while running the bash script

While running my below script from the jenkins's execute shell option, I'm getting -- [: 1 2 3 4 5 : integer expression expected, I tried using > symbol too without any lucks, I'm not sure exactly where I went wrong.
Any help will be really helpful.
#!/bin/bash
declare -a folders
declare -a folders_req
db_ver=<the value which I got from my DB with trimmed leading & trailing spaces, like below>
#db_ver=`echo $( get_value ) |sed -e 's/\-//g' | grep -oP '(?<=DESCRIPTION)(\s+)?([^ ]*)' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '| cut -d '/' -f2`
scripts_db_dir=`ls -td -- */ | head -1 | cut -d '/' -f1| sed -e 's/^[[:space:]]//g'`
cd ${scripts_db_dir}
folders=`ls -d */ | sed 's/\///g' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '`
for i in "${folders[#]}"; do
if [ "${i}" -gt "${db_ver}" ]; then
echo "inside loop: $i"
folders_req+=("$i")
fi
done
#echo "$i"
#echo ${folders_req[#]}
scripts_db_dir contains directory named like - 1 2 3 4 5
your folders variable should be initialized as an array and not as a string, eg :
folders=($(ls -d */ | sed 's/\///g' | sed -e 's/^[[:space:]]//g' | sed -e's/[[:space:]]*$//' | tr '\n' ' '))
Given the various comments regarding "parsing ls is bad", consider using find instead:
find * -maxdepth 1 -type d -name '[0-9]*' -print
where:
-maxdepth 1 - searches only the current directory, no sub directories
-type d - looks only for directories
-name '[0-9]*' (or '[[:digit:]]*') - matches only items consisting of all digits
-print - just print the results
Thus:
folders=($(find * -maxdepth 1 -type d -name '[0-9]*' -print))
or just:
for i in $(find * -maxdepth 1 -type d -name '[0-9]*' -print); do

How can I count the number of words in a directory recursively?

I'm trying to calculate the number of words written in a project. There are a few levels of folders and lots of text files within them.
Can anyone help me find out a quick way to do this?
bash or vim would be good!
Thanks
use find the scan the dir tree and wc will do the rest
$ find path -type f | xargs wc -w | tail -1
last line gives the totals.
tldr;
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
Explanation:
The find . -type f -exec wc -w {} + will run wc -w on all the files (recursively) contained by . (the current working directory). find will execute wc as few times as possible but as many times as is necessary to comply with ARG_MAX --- the system command length limit. When the quantity of files (and/or their constituent lengths) exceeds ARG_MAX, then find invokes wc -w more than once, giving multiple total lines:
$ find . -type f -exec wc -w {} + | awk '/total/{print $0}'
8264577 total
654892 total
1109527 total
149522 total
174922 total
181897 total
1229726 total
2305504 total
1196390 total
5509702 total
9886665 total
Isolate these partial sums by printing only the first whitespace-delimited field of each total line:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}'
8264577
654892
1109527
149522
174922
181897
1229726
2305504
1196390
5509702
9886665
paste the partial sums with a + delimiter to give an infix summation:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+
8264577+654892+1109527+149522+174922+181897+1229726+2305504+1196390+5509702+9886665
Evaluate the infix summation using bc, which supports both infix expressions and arbitrary precision:
$ find . -type f -exec wc -w {} + | awk '/total/{print $1}' | paste -sd+ | bc
30663324
References:
https://www.cyberciti.biz/faq/argument-list-too-long-error-solution/
https://www.in-ulm.de/~mascheck/various/argmax/
https://linux.die.net/man/1/find
https://linux.die.net/man/1/wc
https://linux.die.net/man/1/awk
https://linux.die.net/man/1/paste
https://linux.die.net/man/1/bc
You could find and print all the content and pipe to wc:
find path -type f -exec cat {} \; -exec echo \; | wc -w
Note: the -exec echo \; is needed in case a file doesn't end with a newline character, in which case the last word of one file and the first word of the next will not be separated.
Or you could find and wc and use awk to aggregate the counts:
find . -type f -exec wc -w {} \; | awk '{ sum += $1 } END { print sum }'
If there's one thing I've learned from all the bash questions on SO, it's that a filename with a space will mess you up. This script will work even if you have whitespace in the file names.
#!/usr/bin/env bash
shopt -s globstar
count=0
for f in **/*.txt
do
words=$(wc -w "$f" | awk '{print $1}')
count=$(($count + $words))
done
echo $count
Assuming you don't need to recursively count the words and that you want to include all the files in the current directory , you can use a simple approach such as:
wc -l *
10 000292_0
500 000297_0
510 total
If you want to count the words for only a specific extension in the current directory , you could try :
cat *.txt | wc -l

Count occurrence of files with an odd number characters in the filename

I'm trying to do a script that counts all the files in the system that have a name with a odd number of characters, only the name not the extension.
somebody can help me?
I've done this but it doesn't work
find /usr/lib -type f | cut -f 1 -d '.' | rev | cut -f 4 -d '/' | rev | wc -m
with this I count all the characters of all file, but how do I count the number of character of one file ?
The following awk command will print out the number of files with an odd number of characters in their name.
find /usr/lib -type f | awk -F/ '{gsub(/\.[^\.]*$/,"",$NF);if(length($NF)%2!=0)i++}END{print i}'
Print all the file names with an odd number of characters,
find /usr/lib -type f | xargs -i basename {} | cut -d . -f 1 | grep -Pv '^(..)+$'
pipe to wc to count.

Resources