how to speed up checking if file exists in bash - bash

I'm new at Bashing and wrote a code to check my photos files but find it very slow and gets a few empty returns checking 17000+ photos. Is there any way to use all 4 cpus running this script and so speed it up
Please help
#!/bin/bash
readarray -t array < ~/Scripts/ourphotos.txt
totalfiles="${#array[#]}"
echo $totalfiles
i=0
ii=0
check1=""
while :
do
check=${array[$i]}
if [[ ! -r $( echo $check ) ]] ; then
if [ $check = $check1 ]; then
echo "empty "$check
else
unset array[$i]
ii=$((ii + 1 ))
fi
fi
if [ $totalfiles = $i ]; then
break
fi
i=$(( i + 1 ))
done
if [ $ii -gt "1" ]; then
notify-send -u critical $ii" files have been deleted or are unreadable"
fi

It's a filesystem operation so multiple cores will hardly help.
Simplification might:
while read file; do
i=$((i+1)); [ -e "$file" ] || ii=$(ii+1));
done < "$HOME/Scripts/ourphotos.txt"
#...
Two points:
you don't need to keep the whole file in memory (no arrays needed)
$( echo $check ) forks a proces. You generally want to avoid forking and execing in loops.

This is an old question, but a common problem lacking an evidence-based solution.
awk '{print "[ -e "$1" ] && echo "$2}' | parallel # 400 files/s
awk '{print "[ -e "$1" ] && echo "$2}' | bash # 6000 files/s
while read file; do [ -e $file ] && echo $file; done # 12000 files/s
xargs find # 200000 files/s
parallel --xargs find # 250000 files/s
xargs -P2 find # 400000 files/s
xargs -P96 find # 800000 files/s
I tried this on a few different systems and the results were not consistent, but xargs -P (parallel execution) was consistently the fastest. I was surprised that xargs -P was faster than GNU parallel (not reported above, but sometimes much faster), and I was surprised that parallel execution helped so much — I thought that file I/O would be the limiting factor and parallel execution wouldn't matter much.
Also noteworthy is that xargs find is about 20x faster than the accepted solution, and much more concise. For example, here is a rewrite of OP's script:
#!/bin/bash
total=$(wc -l ~/Scripts/ourphotos.txt | awk '{print $1}')
# tr '\n' '\0' | xargs -0 handles spaces and other funny characters in filenames
found=$(cat ~//Scripts/ourphotos.txt | tr '\n' '\0' | xargs -0 -P4 find | wc -l)
if [ $total -ne $found ]; then
ii=$(expr $total - $found)
notify-send -u critical $ii" files have been deleted or are unreadable"
fi

Related

Creating file takes time in bash

I have a bash script in which I am doing string substitutions by taking input values different source files to create one complete string record. I have to create 5L such records in a file in 5mins on-the-go(records need to be written to the file as soon as it is created), however the script is very slow (20k records in 5mins). Below is the script I used.
#!/bin/bash
sampleRecod="__TIME__-0400 INFO 639582 truefile?apikey=__API_KEY__json||__STATUS__|34|0||0|0|__MAINSIZE__|1|"
count=0;
license_array=(`cat license.txt | xargs`)
status_array=(`cat status.json | xargs`)
error_array=(`cat 403.json | xargs`)
finalRes="";
echo $(date +"%Y-%m-%dT%H:%M:%S.%3N")
while true;do
time=$(date +'%Y-%m-%dT%T.%3N')
line=${license_array[`shuf -i 0-963 -n 1`]}
status=${status_array[`shuf -i 0-7 -n 1`]}
responseMainPart=$(shuf -i 100-999 -n 1)
if [ $status -eq 403 ] || [ $status -eq 0 ]
then
responseMainPart=${error_array[`shuf -i 0-3 -n 1`]}
fi
result=$(echo "$sampleRecod" | sed "s/__TIME__/$time/g")
result=$(echo "$result" | sed "s/__KEY__/$line/g")
result=$(echo "$result" | sed "s/__STATUS__/$status/g")
result=$(echo "$result" | sed "s/__MAIN_SIZE__/$responseMainPart/g")
finalRes+="${result} \n";
count=$((count+1))
if [ $count -eq 1000 ]
then
#echo "got count";
count=0;
echo -e $finalRes >> new_data_1.log;
finalRes="";
fi
done
echo -e $finalRes >> new_data_1.log;
echo $(date +"%Y-%m-%dT%H:%M:%S.%3N")
Can anyone suggest how can I optimize this?? The files I am retrieving values do not have many lines as well.
I have tried replacing shuf with sed but still not much help.

Bash - Extract Matching String from GZIP Files Is Running Very Slow

Complete novice in Bash. Trying to iterate thru 1000 gzip files, may be GNU parallel is the solution??
#!/bin/bash
ctr=0
echo "file_name,symbol,record_count" > $1
dir="/data/myfolder"
for f in "$dir"/*.gz; do
gunzip -c $f | while read line;
do
str=`echo $line | cut -d"|" -f1`
if [ "$str" == "H" ]; then
if [ $ctr -gt 0 ]; then
echo "$f,$sym,$ctr" >> $1
fi
ctr=0
sym=`echo $line | cut -d"|" -f3`
echo $sym
else
ctr=$((ctr+1))
fi
done
done
Any help to speed the process will be greatly appreciated !!!
#!/bin/bash
ctr=0
export ctr
echo "file_name,symbol,record_count" > $1
dir="/data/myfolder"
export dir
doit() {
f="$1"
gunzip -c $f | while read line;
do
str=`echo $line | cut -d"|" -f1`
if [ "$str" == "H" ]; then
if [ $ctr -gt 0 ]; then
echo "$f,$sym,$ctr"
fi
ctr=0
sym=`echo $line | cut -d"|" -f3`
echo $sym >&2
else
ctr=$((ctr+1))
fi
done
}
export -f doit
parallel doit ::: *gz 2>&1 > $1
The Bash while read loop is probably your main bottleneck here. Calling multiple external processes for simple field splitting will exacerbate the problem. Briefly,
while IFS="|" read -r first second third rest; do ...
leverages the shell's built-in field splitting functionality, but you probably want to convert the whole thing to a simple Awk script anyway.
echo "file_name,symbol,record_count" > "$1"
for f in "/data/myfolder"/*.gz; do
gunzip -c "$f" |
awk -F "\|" -v f="$f" -v OFS="," '
/H/ { if(ctr) print f, sym, ctr
ctr=0; sym=$3;
print sym >"/dev/stderr"
next }
{ ++ctr }'
done >>"$1"
This vaguely assumes that printing the lone sym is just for diagnostics. It should hopefully not be hard to see how this can be refactored if this is an incorrect assumption.

Creating a shell script with diff function to compare multiple files

I have five different files and all are in different directory, I want to check matching files and find out the unique files as well.
I am not sure how should I handle this.
You can look to the output of
chksum "path1/file1" "path2/f2" "p3/f3" "p4/f4" "p5/f5" | sort
You can also make a script looping through the files with
files=("path1/file1" "path2/f2" "p3/f3" "p4/f4" "p5/f5")
for i in {0..4}; do
((j=$i+1))
while [ $j -le 4 ]; do
diff "${files[i]}" "${files[j]}" >/dev/null
if [ $? -eq 0 ]; then
echo "${files[i]} and ${files[j]} are the same."
else
echo "${files[i]} and ${files[j]} are different."
fi
((j++))
done
done
You can use cksum ou md5sum to detect identical files :
find . -type f | while read f; do md5sum "$f"; done > tmp.txt
cat tmp.txt | cut -d" " -f1 | while read c
do n=`grep $c tmp.txt | wc -l`
if [ "$n" != "1" ]; then
grep $c tmp.txt
fi
done | sort -u

Bash: Native way to check if an entry is one line?

I have a find script that automatically opens a file if just one file is found. The way I currently handle it is doing a word count on the number of lines of the search results. Is there an easier way to do this?
if [ "$( cat "$temp" | wc -l | xargs echo )" == "1" ]; then
edit `cat "$temp"`
fi
EDITED - here is the context of the whole script.
term="$1"
temp=".aafind.txt"
find src sql common -iname "*$term*" | grep -v 'src/.*lib' >> "$temp"
if [ ! -s "$temp" ]; then
echo "ø - including lib..." 1>&2
find src sql common -iname "*$term*" >> "$temp"
fi
if [ "$( cat "$temp" | wc -l | xargs echo )" == "1" ]; then
# just open it in an editor
edit `cat "$temp"`
else
# format output
term_regex=`echo "$term" | sed "s%\*%[^/]*%g" | sed "s%\?%[^/]%g" `
cat "$temp" | sed -E 's%//+%/%' | grep --color -E -i "$term_regex|$"
fi
rm "$temp"
Unless I'm misunderstanding, the variable $temp contains one or more filenames, one per line, and if there is only one filename it should be edited?
[ $(wc -l <<< "$temp") = "1" ] && edit "$temp"
If $temp is a file containing filenames:
[ $(wc -l < "$temp") = "1" ] && edit "$(cat "$temp")"
Several of the results here will read through an entire file, whereas one can stop and have an answer after one line and one character:
if { IFS='' read -r result && ! read -n 1 _; } <file; then
echo "Exactly one line: $result"
else
echo "Either no valid content at all, or more than one line"
fi
For safely reading from find, if you have GNU find and bash as your shell, replace <file with < <(find ...) in the above. Even better, in that case, is to use NUL-delimited names, such that filenames with newlines (yes, they're legal) don't trip you up:
if { IFS='' read -r -d '' result && ! read -r -d '' -n 1 _; } \
< <(find ... -print0); then
printf 'Exactly one file: %q\n' "$result"
else
echo "Either no results, or more than one"
fi
Well, given that you are storing these results in the file $temp this is a little easier:
[ "$( wc -l < $temp )" -eq 1 ] && edit "$( cat $temp )"
Instead of 'cat $temp' you can do '< $temp', but it might take away some readability if you are not very familiar with redirection 8)
If you want to test whether the file is empty or not, test -s does that.
if [ -s "$temp" ]; then
edit `cat "$temp"`
fi
(A non-empty file by definition contains at least one line. You should find that wc -l agrees.)
If you genuinely want a line count of exactly one, then yes, it can be simplified substantially;
if [ $( wc -l <"$temp" ) = 1 ]; then
edit `cat "$temp"`
fi
You can use arrays:
x=($(find . -type f))
[ "${#x[*]}" -eq 1 ] && echo "just one || echo "many"
But you might have problems in case of filenames with whitespace, etc.
Still, something like this would be a native way
no this is the way, though you're making it over-complicated:
if [ "`wc -l $temp | cut -d' ' -f1`" = "1" ]; then
edit "$temp";
fi
what's complicating it is:
useless use of cat,
unuseful use of xargs
and I'm not sure if you really want the editcat $temp`` which is editing the file at the content of $temp

preventing wildcard expansion in bash script

I've searched here, but still can't find the answer to my globbing problems.
We have files "file.1" through "file.5", and each one should contain the string "completed" if our overnight processing went ok.
I figure it's a good thing to first check that there are some files, then I want to grep them to see if I find 5 "completed" strings. The following innocent approach doesn't work:
FILES="/mydir/file.*"
if [ -f "$FILES" ]; then
COUNT=`grep completed $FILES`
if [ $COUNT -eq 5 ]; then
echo "found 5"
else
echo "no files?"
fi
Thanks for any advice....Lyle
Per http://mywiki.wooledge.org/BashFAQ/004, the best approach to counting files is to use an array (with the nullglob option set):
shopt -s nullglob
files=( /mydir/files.* )
count=${#files[#]}
If you want to collect the names of those files, you can do it like so (assuming GNU grep):
completed_files=()
while IFS='' read -r -d '' filename; do
completed_files+=( "$filename" )
done < <(grep -l -Z completed /dev/null files.*)
(( ${#completed_files[#]} == 5 )) && echo "Exactly 5 files completed"
This approach is somewhat verbose, but guaranteed to work even with highly unusual filenames.
try this:
[[ $(grep -l 'completed' /mydir/file.* | grep -c .) == 5 ]] || echo "Something is wrong"
will print "Something is wrong" if doesn't find 5 completed lines.
Corrected the missing "-l" - the explanation
$ grep -c completed file.*
file.1:1
file.2:1
file.3:0
$ grep -l completed file.*
file.1
file.2
$ grep -l completed file.* | grep -c .
2
$ grep -l completed file.* | wc -l
2
You can do this to prevent globbing:
echo \'$FILES\'
but it seems you have a different problem

Resources