Find files which share part of a filename - bash

In my current directory there are many files. Some of the files share part of their filename.
e.g.:
XGAE_537493_GSR.FITS
TGFE_537493_RRF.FITS
EGRE_537497_HDR.FITS
TRTE_537497_YUH.FITS
TRXX_537499_YDF.FITS
.
.
Files 1 & 2 would be a match, as would files 3 & 4. File 5 has no match. Therefore, files 1,2,3 and 4 would be moved.
I want to move the files which share part of their filename, in order to separate them from the ones that don't.
I was attempting to do this using bash. I googled but couldn't locate websites that were quite describing the process I need. So far in pseudo-code I have:
FOR F IN *
IF ${FILE:5:10} MATCHES ANY OTHER ${FILE:5:10}
MOVE ALL MATCHES TO ANOTHER DIRECTORY
Any information to help me move in the right direction would be appreciated.

Try this:
for f in ./*.FITS ; do
middleBit=$(echo $f| cut -d'_' -f 1)
count=$(ls *middleBit*.FITS | wc -l)
if [ $count -ge 1 ]
then
for match in *middleBit*.FITS ; do
mv $match ./somewhere
done
fi
done

Using associative array in BASH 4 you can do it easily:
#!/bin/bash
declare -A arr
for f in *.FITS; do
k="${f:5:6}"
[[ ${arr[$k]} ]] && mv "$f" /dest/ || arr["$k"]=1
done

if your file structure is fixed, you can scan them and find duplicates in sub fields of the file name in awk.
for example
$ ls -1 | awk -F_ 'NF==3{f[$2]=(a[$2]++?f[$2] OFS $0:$0)}
END{for(k in f) if(a[k]>1) print f[k]} '
TGFE_537493_RRF.FITS
XGAE_537493_GSR.FITS
you can then pipe the results to a cp command
$ ... | xargs -I file cp file file.DUP
adds suffix DUP to duplicate file names, or
$ ... | xargs -I file mv file anotherlocation/
moves to anotherlocation.

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Bash shell script to find missing files from filename

I have a folder that should contain 1485 files, named PA0001.png, PA0002.png ... up to PA1485.png
Some of them are missing and I'd like to write a shell script able to identify the missing ones and print them, as a list, in a .txt file (preferably without the leading string PA and the .png extension, but with the leading zeroes, if any)
I have no clue on how to proceed though, maybe using awk? But I'm still quite of a noob... Any help would be much appreciated!
You can get the list of the sequence number of missing files using bash loop
# Redirect output, per answer
exec > file.txt
for ((i=1 ; i<=1485 ; i++)) ; do
# Convert to 4 digit zero padded
printf -v id '%04d' $i
if [ ! -f "PA$id.png" ] ; then
echo $id
fi
done
Here's a slight refactoring of the existing answer, with explanations in the comments.
# Assign each number in the sequence to i; loop until we have done them all
for ((i=1 ; i<=1485 ; i++)) ; do
# Format the number with padding for the file name part
printf -v id '%04d' "$i"
# If a file with this name does not exist,
if [ ! -f "PA$id.png" ] ; then
# Print it to standard output
echo "$id"
fi
# Redirect the loop's standard output to a file
done >missing.txt
You can do exactly this without a single Bash loop:
#!/usr/bin/env bash
{
find . \
-maxdepth 1 \
-regextype posix-extended \
-regex '.*/([[:digit:]]){4}\.png' \
-printf '%f\n'
printf '%04d.png\n' {1..1485}
} | sort | uniq --unique
It combines the list of files with the list of expected files;
then sort and print the unique entries that are those that are only in the printed expected list, so are missing files.

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

Incrementing number in filenames in bash

I'm trying to take a list of files and rename them, incrementing a number in their filenames. The directory contains a bunch of files named like:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 senreg10.csv.2
senreg1.csv.3 senreg2.csv.3 ... senreg10.csv.3
senreg1.csv.4 senreg2.csv.4 senreg10.csv.4
... ... ...
senreg1.csv.10 senreg2.csv.10 senreg10.csv.10
senreg1.csv.11 senreg2.csv.11 senreg10.csv.11
I want to increment all of the files that end in 3 or higher so I can insert a new file with suffix 3, so I made a text file called 'renames.txt' containing all the filenames that I want to rename. Then, I tried using a for loop to do the actual renaming.
for f in `cat renames.txt`
do
newfile=`echo $f | awk 'BEGIN { FS = "."}; { printf $1 "." $2 "." $3+1 }'`
mv "$f" "$newfile"
done
I want to end up with something like:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 senreg10.csv.2
senreg1.csv.4 senreg2.csv.4 ... senreg10.csv.4
senreg1.csv.5 senreg2.csv.5 senreg10.csv.5
... ... ...
senreg1.csv.11 senreg2.csv.11 senreg10.csv.11
senreg1.csv.12 senreg2.csv.12 senreg10.csv.12
But instead I get:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 ... senreg10.csv.2
senreg1.csv.12 senreg2.csv.12 senreg10.csv.12
The contents of senregX.csv.12 are the same as the original senregX.csv.3. Hope this explanation made sense. Anybody know what's going on here?
You need to rename the files in reverse.
11 -> 12
10 -> 11
9 -> 10
and so on.
This script do what you want, without temporary files, only to have diversity of solutions:
#!/bin/bash
for file in $(ls -1 *[0-9]) # list files ending with a number
do
# get file name and id
name=$(echo $file | sed 's/\(.*\)\.\([0-9]\+\)$/\1/g');
id=$(echo $file | sed 's/.*\.\([0-9]\+\)$/\1/g');
if [ $id -ge 3 ]
then
((id += 1))
# We need to backup the files because we may override some files
cp $file "_$name.$id"
fi
done
# remove old files
for file in $(ls -1 [!_]*[0-9])
do
id=$(echo $file | sed 's/.*\.\([0-9]\+\)$/\1/g');
if [ $id -ge 3 ]
then
rm $file;
fi
done
# finish
for file in $(ls -1 _*[0-9])
do
name=$(echo $file | tr -d '_');
mv "$file" "$name";
done

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources