Find and count compressed files by extension - bash

I have a bash script that counts compressed files by file extension and prints the count.
#!/bin/bash
FIND_COMPRESSED=$(find . -type f | sed -e 's/.*\.//' | sort | uniq -c | sort -rn | grep -Ei '(deb|tgz|tar|gz|zip)$')
COUNT_LINES=$($FIND_COMPRESSED | wc -l)
if [[ $COUNT_LINES -eq 0 ]]; then
echo "No archived files found!"
else
echo "$FIND_COMPRESSED"
fi
However, the script works only if there are NO files with .deb .tar .gz .tgz .zip.
If there are some, say test.zip and test.tar in the current folder, I get this error:
./arch.sh: line 5: 1: command not found
Yet, if I copy the contents of the FIND_COMPRESSED variable into the COUNT_LINES, all works fine.
#!/bin/bash
FIND_COMPRESSED=$(find . -type f | sed -e 's/.*\.//' | sort | uniq -c | sort -rn | grep -Ei '(deb|tgz|tar|gz|zip)$')
COUNT_LINES=$(find . -type f | sed -e 's/.*\.//' | sort | uniq -c | sort -rn | grep -Ei '(deb|tgz|tar|gz|zip)$'| wc -l)
if [[ $COUNT_LINES -eq 0 ]]; then
echo "No archived files found!"
else
echo "$FIND_COMPRESSED"
fi
What am I missing here?

So when you do that variable like that, it tries to execute it like a command, which is why it fails when it has contents. When it's empty, wc simply returns 0 and it marches on.
Thus, you need to change that line to this:
COUNT_LINES=$(echo $FIND_COMPRESSED | wc -l)
But, while we're at it, you can also simplify the other line with something like this:
FIND_COMPRESSED=$(find . -type f -iname "*deb" -or -iname "*tgz" -or -iname "*tar*") #etc

you can do
mapfile FIND_COMPRESSED < <(find . -type f -regextype posix-extended -regex ".*(deb|tgz|tar|gz|zip)$" -exec bash -c '[[ "$(file {})" =~ compressed ]] && echo {}' \;)
COUNT_LINES=${#FIND_COMPRESSED[#]}

Related

find and grep / zgrep / lzgrep progress bar

I would like to add a progress bar to this command line:
find . \( -iname "*.bz" -o -iname "*.zip" -o -iname "*.gz" -o -iname "*.rar" \) -print0 | while read -d '' file; do echo "$file"; lzgrep -a stringtosearch\.anything "$file"; done
The progress file should be calculated on the total of compressed size files (not on the single file).
Of course, it can be a script too.
I would also like to add other progress bars, if possible:
The total number of files processed (example 3 out of 21)
The percentage of progress of the single file
Can anybody help me please?
Here some example of it should look alike (example from here):
tar cf - /folder-with-big-files -P | pv -s $(du -sb /folder-with-big-files | awk '{print $1}') | gzip > big-files.tar.gz
Multiple progress bars (example from here):
pv -cN orig < foo.tar.bz2 | bzcat | pv -cN bzcat | gzip -9 | pv -cN gzip > foo.tar.gz
Thanks,
This is the first time I've ever heard of pv and it's not on any machine I have access to but assuming it needs to know a total at startup and then a number on each iteration of a command, you could do something like this to get a progress bar per file processed:
IFS= readarray -d '' files < <(find . -whatever -print0)
printf '%s\n' "${files[#]}" | pv -s "${#files[#]}" | command
The first line gives you an array of files so you can then use "${#files[#]}" to provide pv it's initial total value (looks like you use -s value for that?) and then do whatever you normally do to get progress as each file is processed.
I don't see any way to tell pv that the pipe it's reading from is NUL-terminated rather than newline-terminated so if your files can have newlines in their names then you'd have to figure out how to solve that problem.
To additionally get progress on a single file you might need something like:
IFS= readarray -d '' files < <(find . -whatever -print0)
printf '%s\n' "${files[#]}" |
pv -s "${#files[#]}" |
xargs -n 1 -I {} sh -c 'pv {} | command'
I don't have pv so all of the above is untested so check the syntax, especially since I've never heard of pv :-).
Thanks to Max C., I found a solution for the main question:
find ./ -type f -iname *\.gz -o -iname *\.bz | (tot=0;while read fname; do s=$(stat -c%s "$fname"); if [ ! -z "$s" ] ; then echo "$fname"; tot=$(($tot+$s)); fi; done; echo $tot) | tac | (read size; xargs -i{} cat "{}" | pv -s $size | lzgrep -a something -)
But this work only for gz and bz files, now I have to develop to use different tool according to extension.
I'm gonna to try the Ed solution too.
Thanks to ED and Max C., here the verision 0.2
This version work with zgrep, but not with lzgrep. :-\
#!/bin/bash
echo -n "collecting dump... "
IFS= readarray -d '' files < <(find . \( -iname "*.bz" -o -iname "*.gz" \) -print0)
echo done
echo "Calculating archives size..."
tot=0
for line in "${files[#]}"; do
s=$(stat -c\%s "$line")
if [ ! -z "$s" ]
then
tot=$(($tot+$s))
fi
done
(for line in "${files[#]}"; do
s=$(stat -c\%s "$line")
if [ ! -z "$s" ]
then
echo "$line"
fi
done
) | xargs -i{} sh -c 'echo Processing file: "{}" 1>&2 ; cat "{}"' | pv -s $tot | zgrep -a anything -

execute an if statement on every folder

I have for example 3 files (it could 1 or it could be 30) like this :
name_date1.tgz
name_date2.tgz
name_date3.tgz
When extracted it will look like :
name_date1/data/info/
name_date2/data/info/
name_date3/data/info/
Here how it looks inside each folder:
name_date1/data/info/
you.log
you.log.1.gz
you.log.2.gz
you.log.3.gz
name_date2/data/info/
you.log
name_date3/data/info/
you.log
you.log.1.gz
you.log.2.gz
What I want to do is concatenate all you file from each folder and concatenate one more time all the concatenated one to one single file.
1st step: extract all the folder
for a in *.tgz
do
a_dir=${a%.tgz}
mkdir $a_dir 2>/dev/null
tar -xvzf $a -C $a_dir >/dev/null
done
2nd step: executing an if statement on each folder available and cat everything
myarray=(`find */data/info/ -maxdepth 1 -name "you.log.*.gz"`)
ls -d */ | xargs -I {} bash -c "cd '{}' &&
if [ ${#myarray[#]} -gt 0 ];
then
find data/info -name "you.log.*.gz" -print0 | sort -z -rn -t. -k4 | xargs -0 zcat | cat -
data/info/you.log > youfull1.log
else
cat - data/info/you.log > youfull1.log
fi "
cat */youfull1.log > youfull.log
My issue when I put multiple name_date*.tgzit gives me this error:
gzip: stdin: unexpected end of file
With the error, I still have all my files concatenated, but why error message ?
But when I put only one .tgz file then I don't have any issue regardless the number you file.
any suggestion please ?
Try something simpler. No need for myarray. Pass files one at a time as they are inputted and decide what to do with them one at a time. Try:
find */data/info -type f -maxdepth 1 -name "you.log*" -print0 |
sort -z |
xargs -0 -n1 bash -c '
if [[ "${1##*.}" == "gz" ]]; then
zcat "$1";
else
cat "$1";
fi
' --
If you have to iterate over directories, don't use ls, still use find.
find . -maxdepth 1 -type d -name 'name_date*' -print0 |
sort -z |
while IFS= read -r -d '' dir; do
cat "$dir"/data/info/you.log
find "$dir"/data/info -type f -maxdepth 1 -name 'you.log.*.gz' -print0 |
sort -z -t'.' -n -k3 |
xargs -r -0 zcat
done
or (if you have to) with xargs, which should give you the idea how it's used:
find . -maxdepth 1 -type d -name 'name_date*' -print0 |
sort -z |
xargs -0 -n1 bash -c '
cat "$1"/data/info/you.log
find "$1"/data/info -type f -maxdepth 1 -name "you.log.*.gz" -print0 |
sort -z -t"." -n -k3 |
xargs -r -0 zcat
' --
Use -t option with xargs to see what it's doing.

How to print only results different from zero?

I've this script. I would like to print only the non-zero results.
My enviroment is Os X
find /PATH/ -type f -exec basename "{}" | grep -i "Word" | wc -l
First, here is a much faster find command that will do the same thing:
find /PATH/ -type f -iname '*Word*' | wc -l
Now, you can put this optimized command into an if statement:
if [[ `find /PATH/ -type f -iname '*Word*' | wc -l` ]]; then
find /PATH/ -type f -iname '*Word*' | wc -l
fi
To run the command just once, save the result into a variable:
count=`find /PATH/ -type f -iname '*Word*' | wc -l`
if [[ $count -gt 0 ]]; then
echo $count
fi
You can use grep -v to remove output that consists of just zero (with spaces before it, 'cause that's what wc prints). With #joanis' optimization of the search, that gives:
find /PATH/ -type f -iname '*Word*' | wc -l | grep -v '^ *0$'
When you count selected records, you do not have to filter on 0 hits.
This command shows all basenames that appear once or more.
find . -type f -iname '*Word*' -printf "%f\n" | sort | uniq -c
You might want to add | sort -n on the and to see which file occurs most.
Maybe you wanted something: How often Word occurs in different files.
grep -Rci while | grep -v ":0$"

List files (recursively) which have no matching pair

I have a set of files in multiple directories. Most of them have a related pair with a different extension and the same base name. The related files are always within the same directory. I need to list only files (and path) without pairs within a directory including all sub directories. How can I do that in bash?
file1.xxx
file1.yyy
file2.xxx
file2.zzz
file3.xxx
file3.aaa
file4.xxx
Any help is much appreciated!
You could use find and pipe to perl to sort the data
find . -type f -print0 |\
perl -0 -l012 -ne 'if(/.*\/(.*)\./){$x{$1}++;$y{$1}=$_}
}{for(keys %x){print $y{$_} if $x{$_}==1}'
This adds the name with no suffix to a hash and incremements for each match, whilst adding the full line to another hash with the same key.
In the end it just checks which have a single match and prints.
As the filenames are null delimited it should work with all filenames.
You can list all the files under your directory and then count how many matches you can find of their whole name in the same tree directory which has the same path name (excluding extension).
If your file matches with less or one names, that means it has not "companion" files:
for f in $(find -type f); do
c=$(find -wholename "$(echo $f | rev | cut --complement -d . -f 1 | rev).*" | wc -l);
if [ "$c" -le "1" ]; then echo $f; fi;
done
Edit:
It might more readable if the pattern composition is performed in a different line:
for f in $(find -type f); do
compPattern="$(echo $f | rev | cut --complement -d . -f 1 | rev).*"
c=$(find -wholename "$compPattern" | wc -l);
if [ "$c" -le "1" ]; then echo $f; fi;
done
Edit (2)
To avoid parsing the output of the find you can use read:
find -type f | while read f; do
if [ $(find -wholename "$(echo $f | rev | cut --complement -d . -f 1 | rev).*" | wc -l) -le "1" ]; then echo $f; fi;
done
Edit(3)
To handle special chars, spaces etc. you can use the following.
while IFS= read -r -d '' f ; do
c=$(find -wholename "$(echo $f | rev | cut --complement -d . -f 1 | rev).*" | wc -l);
if [ "$c" -le "1" ]; then echo $f; fi;
done < <(find -type f -print0)

Bash : Find and Remove duplicate files from different folders

I have two folders with some common files, I want to delete duplicate files from xyz folder.
folder1:
/abc/file1.csv
/abc/file2.csv
/abc/file3.csv
/abc/file4.csv
folder2:
/xyz/file1.csv
/xyz/file5.csv
I want to compare both folders and remove duplicate from /xyz folder. Output should be: file5.csv
For now I am using :
find "/xyz" "/abc" "/abc" -printf '%P\n' | sort | uniq -u | -exec rm {} \;
But it failing with reason : if -exec is not a typo you can run the following command to lookup the package that contains the binary:
command-not-found -exec
-bash: -exec: command not found
-exec is an option to find, you've already exited the command find when you started the pipes.
Try xargs instead, it take all the data from stdin and appends to the program.
UNTESTED
find "/xyz" "/abc" "/abc" -printf '%P\n' | sort | uniq -u | xargs rm
Find every file in 234 and 123 directory get filename by -printf, sort them, uniq -d give list of duplications, give back path by sed, using 123 directory to delete the duplications from, and pass files to xargs rm
Command:
find ./234 ./123 -type f -printf '%P\n' | sort | uniq -d | sed 's/^/.\/123\//g' | xargs rm
sed don't needed if you are in the ./123 directory and using full path for folders in find.
Another approach: just find the files in abc and attempt to remove them from xyz:
UNTESTED
find /abc -type f -printf 'rm -f /xyz/%P' | sh
Remove Duplicate Files From Particular Directory
FileList=$(ls)
for D1 in $FileList ;do
if [[ -f $D1 ]]; then
for D2 in $FileList ;do
if [[ -f $D2 ]]; then
if [[ $D1 == $D2 ]]; then
: 'Skip Orignal File'
else
if [[ $(md5sum $D1 | cut -d'=' -f 2 | cut -d ' ' -f 1 ) == $(md5sum $D2 | cut -d'=' -f 2 | cut -d ' ' -f 1 ) ]]; then
echo "Duplicate File Found : $D2"
rm -rf $D2
fi #Detect Duplicate Using MD5
fi #Skip Orginal File
fi #D2 File available Then Next
done
fi #D1 File available Then Next
done

Resources