Bash shell script to find missing files from filename

Bash shell script to find missing files from filename - bash

I have a folder that should contain 1485 files, named PA0001.png, PA0002.png ... up to PA1485.png
Some of them are missing and I'd like to write a shell script able to identify the missing ones and print them, as a list, in a .txt file (preferably without the leading string PA and the .png extension, but with the leading zeroes, if any)
I have no clue on how to proceed though, maybe using awk? But I'm still quite of a noob... Any help would be much appreciated!

You can get the list of the sequence number of missing files using bash loop
# Redirect output, per answer
exec > file.txt
for ((i=1 ; i<=1485 ; i++)) ; do
# Convert to 4 digit zero padded
printf -v id '%04d' $i
if [ ! -f "PA$id.png" ] ; then
echo $id
fi
done

Here's a slight refactoring of the existing answer, with explanations in the comments.
# Assign each number in the sequence to i; loop until we have done them all
for ((i=1 ; i<=1485 ; i++)) ; do
# Format the number with padding for the file name part
printf -v id '%04d' "$i"
# If a file with this name does not exist,
if [ ! -f "PA$id.png" ] ; then
# Print it to standard output
echo "$id"
fi
# Redirect the loop's standard output to a file
done >missing.txt

You can do exactly this without a single Bash loop:
#!/usr/bin/env bash
{
find . \
-maxdepth 1 \
-regextype posix-extended \
-regex '.*/([[:digit:]]){4}\.png' \
-printf '%f\n'
printf '%04d.png\n' {1..1485}
} | sort | uniq --unique
It combines the list of files with the list of expected files;
then sort and print the unique entries that are those that are only in the printed expected list, so are missing files.

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!

while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.

for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done

A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net

A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Update numbers in filenames

I have a set of filenames which are ordered numerically like:
13B12363_1B1_0.png
13B12363_1B1_1.png
13B12363_1B1_2.png
13B12363_1B1_3.png
13B12363_1B1_4.png
13B12363_1B1_5.png
13B12363_1B1_6.png
13B12363_1B1_7.png
13B12363_1B1_8.png
13B12363_1B1_9.png
13B12363_1B1_10.png
[...]
13B12363_1B1_495.png
13B12363_1B1_496.png
13B12363_1B1_497.png
13B12363_1B1_498.png
13B12363_1B1_499.png
After some postprocessing, I removed some files and I would like to update the ordering number and replace the actual number by its new position. Looking at this previous question I end up doing something like:
(1) ls -v | cat -n | while read n f; do mv -i $f ${f%%[0-9]+.png}_$n.png; done
However, this command do not recognize the "ordering number + png" and just append the new number at the end of the filename. Something like 13B12363_1B1_10.png_9.png
On the other hand, if I do:
(2) ls -v * | cat -n | while read n f; do mv $f ${f%.*}_$n.png; done
The ordering number is added without issues. Like 13B12363_1B1_10_9.png
So, for (1) it seems I am not specifying the digit correctly but I am not able to find the correct syntax. So far I tried [0-9], [0-9]+, [[:digits:]] and [[:digits:]]+. Which should be the proper one?
Additionally, in (2) I am wondering how I should specify rename (CentOS version) to remove the numbers between the second and the third underscore. Here I have to say that I have some filenames like 20B12363_22_10_9.png, so I should somehow specify second and third underscore.

Using Bash's built-in Basic Regex Engine and a null delimited list of files.
Tested with sample
#!/usr/bin/env bash
prename=$1
# Bash setting to return empty result if no match found
shopt -s nullglob
# Create a temporary directory to prevent file rename collisions
tmpdir=$(mktemp -d) || exit 1
# Add a trap to remove the temporary directory on EXIT
trap 'rmdir -- "$tmpdir"' EXIT
# Initialize file counter
n=0
# Generate null delimited list of files
printf -- %s\\0 "${prename}_"*'.png' |
# Sort the null delimited list on 3rd field numeric order with _ separator
sort --zero-terminated --field-separator=_ --key=3n |
# Iterate the null delimited list
while IFS= read -r -d '' f; do
# If Bash Regex match the file name AND
# file has a different sequence number
if [[ "$f" =~ (.*)_([0-9]+)\.png$ ]] && [[ ${BASH_REMATCH[2]} -ne $n ]]; then
# Use captured Regex match group 1 to rename file with incrementing counter
# and move it to the temporary folder to prevent rename collision with
# existing file
echo mv -- "$f" "$tmpdir/${BASH_REMATCH[1]}_$((n)).png"
fi
# Increment file counter
n=$((n+1))
done
# Move back the renamed files in place
mv --no-clobber -- "$tmpdir/*" ./
# $tempdir removal is automatic on EXIT
# If something goes wrong, some files remain in it and it is not deleted
# so these can be dealt with manually
Remove the echo if the result matches your expectations.
Output from the sample
mv -- 13B12363_1B1_495.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_11.png
mv -- 13B12363_1B1_496.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_12.png
mv -- 13B12363_1B1_497.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_13.png
mv -- 13B12363_1B1_498.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_14.png
mv -- 13B12363_1B1_499.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_15.png

Do not parse ls.
read interprets \ and splits on IFS. bashfaq how to read a stream line by line
In ${f%%replacement} expansion the replacement is not regex, but globulation. Rules differ. + means literally +.
You could shopt -o extglob and then ${f%%+([0-9]).png}. Or write a loop. Or match the _ too and do f=${f%%.png}; f="${f%_[0-9]*}_".
Or something along (untested):
find . -maxdepth 1 -mindepth 1 -type f -name '13B12363_1B1_*.png' |
sort -t_ -n -k3 |
sed 's/\(.*\)[0-9]+\.png$/&\t\1/' |
{
n=1;
while IFS=$'\t' read -r from to; do
echo mv "$from" "$to$((n++)).png";
done;
}

Another alternative, with perl:
perl -e 'while(<#ARGV>){$o=$_;s/\d+(?=\D*$)/$i++.".renamed"/e;die if -e $_;rename $o,$_}while(<*.renamed>){$o=$_;s/\.renamed$//;die if -e $_;rename $o,$_}' $(ls -v|sed -E "s/$|^/'/g"|paste -sd ' ' -)
This solution should avoid rename collisions by: first renaming files adding extra ".renamed" extension. And then removing the ".renamed" extension as the last step. Also, There are checks to detect rename collision.
Anyways, please backup your data before trying :)
The perl script unrolled and explained:
while(<#ARGV>){ # loop through arguments.
# filenames are passed to "$_" variable
# save old file name
$o=$_;
# if not using variable, regex replacement (s///) uses topic variable ($_)
# e flag ==> evals the replacement
s/\d+(?=\D*$)/$i++.".renamed"/e; # works on $_
# Detect rename collision
die if -e $_;
rename $o,$_
}
while(<*.renamed>){
$o=$_;
s/\.renamed$//; # remove .renamed extension
die if -e $_;
rename $o,$_
}
The regex:
\d+ # one number or more
(?=\D*$) # followed by 0 or more non-numbers and end of string

How to compare 2 files word by word and storing the different words in result output file

Suppose there are two files:
File1.txt
My name is Anamika.
File2.txt
My name is Anamitra.
I want result file storing:
Result.txt
Anamika
Anamitra
I use putty so can't use wdiff, any other alternative.

not my greatest script, but it works. Other might come up with something more elegant.
#!/bin/bash
if [ $# != 2 ]
then
echo "Arguments: file1 file2"
exit 1
fi
file1=$1
file2=$2
# Do this for both files
for F in $file1 $file2
do
if [ ! -f $F ]
then
echo "ERROR: $F does not exist."
exit 2
else
# Create a temporary file with every word from the file
for w in $(cat $F)
do
echo $w >> ${F}.tmp
done
fi
done
# Compare the temporary files, since they are now 1 word per line
# The egrep keeps only the lines diff starts with > or <
# The awk keeps only the word (i.e. removes < or >)
# The sed removes any character that is not alphanumeric.
# Removes a . at the end for example
diff ${file1}.tmp ${file2}.tmp | egrep -E "<|>" | awk '{print $2}' | sed 's/[^a-zA-Z0-9]//g' > Result.txt
# Cleanup!
rm -f ${file1}.tmp ${file2}.tmp
This uses a trick with the for loop. If you use a for to loop on a file, it will loop on each word. NOT each line like beginners in bash tend to believe. Here it is actually a nice thing to know, since it transforms the files into 1 word per line.
Ex: file content == This is a sentence.
After the for loop is done, the temporary file will contain:
This
is
a
sentence.
Then it is trivial to run diff on the files.
One last detail, your sample output did not include a . at the end, hence the sed command to keep only alphanumeric charactes.

Bash: Correct way to store result of command in array [duplicate]

This question already has answers here:
How can I store the "find" command results as an array in Bash
(8 answers)
Closed 4 years ago.
How do I put the result of find $1 into an array?
In for loop:
for /f "delims=/" %%G in ('find $1') do %%G | cut -d\/ -f6-

I want to cry.
In bash:
file_list=()
while IFS= read -d $'\0' -r file ; do
file_list=("${file_list[#]}" "$file")
done < <(find "$1" -print0)
echo "${file_list[#]}"
file_list is now an array containing the results of find "$1
What's special about "field 6"? It's not clear what you were attempting to do with your cut command.
Do you want to cut each file after the 6th directory?
for file in "${file_list[#]}" ; do
echo "$file" | cut -d/ -f6-
done
But why "field 6"? Can I presume that you actually want to return just the last element of the path?
for file in "${file_list[#]}" ; do
echo "${file##*/}"
done
Or even
echo "${file_list[#]##*/}"
Which will give you the last path element for each path in the array. You could even do something with the result
for file in "${file_list[#]##*/}" ; do
echo "$file"
done
Explanation of the bash program elements:
(One should probably use the builtin readarray instead)
find "$1" -print0
Find stuff and 'print the full file name on the standard output, followed by a null character'. This is important as we will split that output by the null character later.
<(find "$1" -print0)
"Process Substitution" : The output of the find subprocess is read in via a FIFO (i.e. the output of the find subprocess behaves like a file here)
while ...
done < <(find "$1" -print0)
The output of the find subprocess is read by the while command via <
IFS= read -d $'\0' -r file
This is the while condition:
read
Read one line of input (from the find command). Returnvalue of read is 0 unless EOF is encountered, at which point while exits.
-d $'\0'
...taking as delimiter the null character (see QUOTING in bash manpage). Which is done because we used the null character using -print0 earlier.
-r
backslash is not considered an escape character as it may be part of the filename
file
Result (first word actually, which is unique here) is put into variable file
IFS=
The command is run with IFS, the special variable which contains the characters on which read splits input into words unset. Because we don't want to split.
And inside the loop:
file_list=("${file_list[#]}" "$file")
Inside the loop, the file_list array is just grown by $file, suitably quoted.

arrayname=( $(find $1) )
I don't understand your loop question? If you look how to work with that array then in bash you can loop through all array elements like this:
for element in $(seq 0 $((${#arrayname[#]} - 1)))
do
echo "${arrayname[$element]}"
done

This is probably not 100% foolproof, but it will probably work 99% of the time (I used the GNU utilities; the BSD utilities won't work without modifications; also, this was done using an ext4 filesystem):
declare -a BASH_ARRAY_VARIABLE=$(find <path> <other options> -print0 | sed -e 's/\x0$//' | awk -F'\0' 'BEGIN { printf "("; } { for (i = 1; i <= NF; i++) { printf "%c"gensub(/"/, "\\\\\"", "g", $i)"%c ", 34, 34; } } END { printf ")"; }')
Then you would iterate over it like so:
for FIND_PATH in "${BASH_ARRAY_VARIABLE[#]}"; do echo "$FIND_PATH"; done
Make sure to enclose $FIND_PATH inside double-quotes when working with the path.

Here's a simpler pipeless version, based on the version of user2618594
declare -a names=$(echo "("; find <path> <other options> -printf '"%p" '; echo ")")
for nm in "${names[#]}"
do
echo "$nm"
done

To loop through a find, you can simply use find:
for file in "`find "$1"`"; do
echo "$file" | cut -d/ -f6-
done
It was what I got from your question.

Find number of files with prefixes in bash

I've been trying to count all files with a specific prefix and then if the number of files with the prefix does not match the number 5 I want to print the prefix.
To achieve this, I wrote the following bash script:
#!/bin/bash
for filename in $(ls); do
name=$(echo $filename | cut -f 1 -d '.')
num=$(ls $name* | wc -l)
if [$num != 5]; then
echo $name
fi
done
But I get this error (repeatedly):
./check_uneven_number.sh: line 5: [1: command not found
Thank you!

The if statement takes a command, runs it, and checks its exit status. Left bracket ([) by itself is a command, but you wrote [$num. The shell expands $num to 1, creating the word [1, which is not a command.
if [ $num != 5 ]; then

Your code loops over file names, not prefixes; so if there are three file names with a particular prefix, you will get three warnings, instead of one.
Try this instead:
# Avoid pesky ls
printf '%s\n' * |
# Trim to just prefixes
cut -d . -f 1 |
# Reduce to unique
sort -u |
while IFS='' read -r prefix; do
# Pay attention to quoting
num=$(printf . "$prefix"* | wc -c)
# Pay attention to spaces
if [ "$num" -ne 5 ]; then
printf '%s\n' "$prefix"
fi
done
Personally, I'd prefer case over the clunky if here, but it takes some getting used to.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash shell script to find missing files from filename - bash

You can get the list of the sequence number of missing files using bash loop # Redirect output, per answer exec > file.txt for ((i=1 ; i<=1485 ; i++)) ; do # Convert to 4 digit zero padded printf -v id '%04d' $i if [ ! -f "PA$id.png" ] ; then echo $id fi done

Related

How to only concatenate files with same identifier using bash script?

Update numbers in filenames

How to compare 2 files word by word and storing the different words in result output file

Bash: Correct way to store result of command in array [duplicate]

Find number of files with prefixes in bash

Categories

Resources