Find files with identical content [duplicate] - bash

This question already has answers here:
When to wrap quotes around a shell variable?
(5 answers)
Compare files with each other within the same directory
(6 answers)
Closed 4 years ago.
Answer to my question using Kubator command line :
#Function that shows the files having the same content in the current directory
showDuplicates (){
last_file=''
while read -r f1_hash f1_name; do
if [ "$last_file" != "$f1_hash" ]; then
echo "The following files have the exact same content :"
echo "$f1_name"
while read -r f2_hash f2_name; do
if [ "$f1_hash" == "$f2_hash" ] && [ "$f1_name" != "$f2_name" ]; then
echo "$f2_name"
fi
done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
fi
last_file="$f1_hash"
done < <(find ./ -maxdepth 1 -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D)
}
Original question :
I've seen some discussions about what I'm about to ask but I have troubles understanding the mechanics behind the solution proposed and I have not been able to solve my problem that follows.
I want to make a function to compare files, for that, naively, I've tried the following :
#somewhere I use that to get the files paths
files_to_compare=$(find $base_path -maxdepth 1 -type f)
files_to_compare=( $files_to_compare )
#then I pass files_to_compare as an argument to the following function
showDuplicates (){
files_to_compare=${1}
n_files=$(( ${#files_to_compare[#]} ))
for (( i=0; i < $n_files ; i=i+1 )); do
for (( j=i+1; j < $n_files ; j=j+1 )); do
sameContent "${files_to_compare[i]}" "${files_to_compare[j]}"
r=$?
if [ $r -eq 1 ]; then
echo "The following files have the same content :"
echo ${files_to_compare[i]}
echo ${files_to_compare[j]}
fi
done
done
}
The function 'sameContent' takes the absolute paths of two files and makes use of different commends (du, wc, diff) to return 1 or 0 depending on the files having the same content (respectively).
The incorrectness of that code has showed up with file names containing spaces but I've since read that it's not the way to go to manipulate files in bash.
On https://unix.stackexchange.com/questions/392393/bash-moving-files-with-spaces and some other pages I've read that the correct way to go is to use a code that looks like this :
$ while IFS= read -r file; do echo "$file"; done < files
I seem not to be able to understand what lies behind that bit of code and how I could use it to solve my problem. Particularly due to the fact that I want/need to use intricate loops.
I'm new to bash and it's seems to be a common problem but still if someone was kind enough to give me some insight about how that works that would be wonderful.
p.s.: please excuse the probable grammar mistakes

How about to use md5sum to compare content of files in Your folder instead. That's way safer and standard way. Then You would need only something like this:
find ./ -type f -print0 | xargs -0 md5sum | sort -k1,32 | uniq -w32 -D
What it does:
find finds all files -type f in current folder ./ and output separates by null byte -print0 that's needed for special characters like space in filenames (like You are mentioning moving files with space)
xargs takes output from find separated by null byte -0 and performs md5sum hashes on files
sort sorts output by positions 1-32 (which is md5 hash) -k1,32
uniq makes output unique by first 32 characters (md5 hash) -w32 and filter only duplicated lines -D
Output example:
7a2e203cec88aeffc6be497af9f4891f ./file1.txt
7a2e203cec88aeffc6be497af9f4891f ./folder1/copy_of_file1.txt
e97130900329ccfb32516c0e176a32d5 ./test.log
e97130900329ccfb32516c0e176a32d5 ./test_copy.log
If performance is crucial this can be tuned to sort firstly by filesize and only then compare md5sums. Or called mv, rm etc.

Related

For Loop: Identify Filename Pairs, Input to For Loop

I am attempting to adapt a previously answered question for use in a for loop.
I have a folder containing multiple paired file names that need to be provided sequentially as input to a for loop.
Example Input
WT1_0min-SRR9929263_1.fastq
WT1_0min-SRR9929263_2.fastq
WT1_20min-SRR9929265_1.fastq
WT1_20min-SRR9929265_2.fastq
WT3_20min-SRR12062597_1.fastq
WT3_20min-SRR12062597_2.fastq
Paired file names can be identified with the answer from the previous question:
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
I now want adopt this for use in a for loop so that each output file can be independently piped to subsequent commands and also so that output file names can be appended.
Input files can be provided as a comma-separated list of files after the -1 and -2 flags respectively. So for this example, the bulk and undesired input would be:
-1 WT1_0min-SRR9929263_1.fastq,WT1_20min-SRR9929265_1.fastq,WT3_20min-SRR12062597_1.fastq
-2 WT1_0min-SRR9929263_2.fastq,WT1_20min-SRR9929265_2.fastq,WT3_20min-SRR12062597_2.fastq
However, I would like to run this as a for loop so that input files are provided sequentially:
Iteration #1
-1 WT1_0min-SRR9929263_1.fastq
-2 WT1_0min-SRR9929263_2.fastq
Iteration #2
-1 WT1_20min-SRR9929265_1.fastq
-2 WT1_20min-SRR9929265_2.fastq
Iteration #3
-1 WT3_20min-SRR12062597_1.fastq
-2 WT3_20min-SRR12062597_2.fastq
Below is an example of the for loop I would like to run using the xarg code to pull filenames. It currently does not work. I assume I need to somehow save the paired filenames from the xarg code as a variable that can be referenced in the for loop?
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
for file in *.fastq
do
bowtie2 -p 8 -x /path/genome \
1- {}_1.fastq \
2- {}_2.fastq \
"../path/${file%%.fastq}_UnMappedReads.fastq.gz" \
2> "../path/${file%%.fastq}_Bowtie2_log.txt" | samtools view -# 7 -b | samtools sort -# 7 -m 5G -o "../path/${file%%.fastq}_Mapped.bam"
done
The expected outputs for the example would be:
WT1_0min-SRR9929263_UnMappedReads.fastq.gz
WT1_20min-SRR9929265_UnMappedReads.fastq.gz
WT3_20min-SRR12062597_UnMappedReads.fastq.gz
WT1_0min-SRR9929263_Bowtie2_log.txt
WT1_20min-SRR9929265_Bowtie2_log.txt
WT3_20min-SRR12062597_Bowtie2_log.txt
WT1_0min-SRR9929263_Mapped.bam
WT1_20min-SRR9929265_Mapped.bam
WT3_20min-SRR12062597_Mapped.bam
I don't know what "bowtie2" or "samtools" are but best I can tell all you need is:
#!/usr/bin/env bash
for file1 in *_1.fastq; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done
Replace echo with whatever you want to do with ta pair of files.
If you HAD to use find for some reason then it'd be:
#!/usr/bin/env bash
while IFS= read -r file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print)
or if your file names can contain newlines then:
#!/usr/bin/env bash
while IFS= read -r -d $'\0' file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print0)

How to get list of certain strings in a list of files using bash?

The title is maybe not really descriptive, but I couldn't find a more concise way to describe the problem.
I have a directory containing different files which have a name that e.g. looks like this:
{some text}2019Q2{some text}.pdf
So the filenames have somewhere in the name a year followed by a capital Q and then another number. The other text can be anything, but it won't contain anything matching the format year-Q-number. There will also be no numbers directly before or after this format.
I can work something out to get this from one filename, but I actually need a 'list' so I can do a for-loop over this in bash.
So, if my directory contains the files:
costumerA_2019Q2_something.pdf
costumerB_2019Q2_something.pdf
costumerA_2019Q3_something.pdf
costumerB_2019Q3_something.pdf
costumerC_2019Q3_something.pdf
costumerA_2020Q1_something.pdf
costumerD2020Q2something.pdf
I want a for loop that goes over 2019Q2, 2019Q3, 2020Q1, and 2020Q2.
EDIT:
This is what I have so far. It is able to extract the substrings, but it still has doubles. Since I'm already in the loop and I don't see how I can remove the doubles.
find original/*.pdf -type f -print0 | while IFS= read -r -d '' line; do
echo $line | grep -oP '[0-9]{4}Q[0-9]'
done
# list all _filanames_ that end with .pdf from the folder original
find original -maxdepth 1 -name '*.pdf' -type f -print "%p\n" |
# extract the pattern
sed 's/.*\([0-9]{4}Q[0-9]\).*/\1/' |
# iterate
while IFS= read -r file; do
echo "$file"
done
I used -print %p to print just the filename, instead of full path. The GNU sed has -z option that you can use with -print0 (or -print "%p\0").
With how you have wanted to do this, if your files have no newline in the name, there is no need to loop over list in bash (as a rule of a thumb, try to avoid while read line, it's very slow):
find original -maxdepth 1 -name '*.pdf' -type f | grep -oP '[0-9]{4}Q[0-9]'
or with a zero seprated stream:
find original -maxdepth 1 -name '*.pdf' -type f -print0 |
grep -zoP '[0-9]{4}Q[0-9]' | tr '\0' '\n'
If you want to remove duplicate elements from the list, pipe it to sort -u.
Try this, in bash:
~ > $ ls
costumerA_2019Q2_something.pdf costumerB_2019Q2_something.pdf
costumerA_2019Q3_something.pdf other.pdf
costumerA_2020Q1_something.pdf someother.file.txt
~ > $ for x in `(ls)`; do [[ ${x} =~ [0-9]Q[1-4] ]] && echo $x; done;
costumerA_2019Q2_something.pdf
costumerA_2019Q3_something.pdf
costumerA_2020Q1_something.pdf
costumerB_2019Q2_something.pdf
~ > $ (for x in *; do [[ ${x} =~ ([0-9]{4}Q[1-4]).+pdf ]] && echo ${BASH_REMATCH[1]}; done;) | sort -u
2019Q2
2019Q3
2020Q1

renumbering image files to be contiguous in bash

I have a directory with image files that follow a naming scheme and are not always contiguous. e.i:
IMG_33.jpg
IMG_34.jpg
IMG_35.jpg
IMG_223.jpg
IMG_224.jpg
IMG_225.jpg
IMG_226.jpg
IMG_446.jpg
I would like to rename them so they go something like this, in the same order:
0001.jpg
0002.jpg
0003.jpg
0004.jpg
0005.jpg
0006.jpg
0007.jpg
0008.jpg
So far this is what I came up, and while it does the four-digit padding, it doesn't sort by the number values in the filenames.
#!/bin/bash
X=1;
for i in *; do
mv $i $(printf %04d.%s ${X%.*} ${i##*.})
let X="$X+1"
done
result:
IMG_1009.JPG 0009.JPG
IMG_1010.JPG 0010.JPG
IMG_101.JPG 0011.JPG
IMG_102.JPG 0012.JPG
Update:
Try this. If output is okay remove echo.
X=1; find . -maxdepth 1 -type f -name "*.jpg" -print0 | sort -z -n -t _ -k2 | while read -d $'\0' -r line; do echo mv "$line" "$(printf "%04d%s" $X .jpg)"; ((X++)); done
Using the super helpful rename. First, pads files with one digit to two digits; then pads files with two digits to three digits; etc.
rename IMG_ IMG_0 IMG_?.jpg
rename IMG_ IMG_0 IMG_??.jpg
rename IMG_ IMG_0 IMG_???.jpg
Then, your for-loop (or another similar one) that renames does the trick as the files are in both alphabetical and numerical order.
how about this :
while read f1;do
echo $f1
mv IMG_$f1 $f1
done< <(ls | cut -d '_' -f 2 | sort -n)
thanks
Michael

Counting the number of files in a directory in bash

I have a bash script where I'm trying to find out the number of files in a directory and perform an addition operation on it as well.
But while doing the same I'm getting the error as follows:
admin> ./fileCount.sh
1
./fileCount.sh: line 6: 22 + : syntax error: operand expected (error token is " ")
My script is as shown:
#!/usr/bin/bash
Var1=22
Var2= ls /stud_data/Input_Data/test3 | grep ".txt" | wc -l
Var3= $(($Var1 + $Var2))
echo $Var3
Can anyone point out where is the error.
A little away
As #devnull already answered to the question point out where is the error,
Just some more ideas:
General unix
To make this kind of browsing, there is a very powerfull command find that let you find recursively, exactly what you're serching for:
Var2=`find /stud_data/Input_Data/test3 -name '*.txt' | wc -l`
If you won't this to be recursive:
Var2=`find /stud_data/Input_Data/test3 -maxdepth 1 -name '*.txt' | wc -l`
If you want files only (meaning no symlink, nor directories)
Var2=`find /stud_data/Input_Data/test3 -maxdepth 1 -type f -name '*.txt' | wc -l`
And so on... Please read the man page: man find.
Particular bash solutions
As your question stand for bash, there is some bashism you could use to make this a lot quicker:
#!/bin/bash
Var1=22
VarLs=(/stud_data/Input_Data/test3/*.txt)
[ -e $VarLs ] && Var2=${#VarLs[#]} || Var2=0
Var3=$(( Var1 + Var2 ))
echo $Var3
# Uncomment next line to see more about current environment
# set | grep ^Var
Where bash expansion will translate /path/*.txt in an array containing all filenames matching the jocker form.
If there is no file matching the form, VarLs will only contain the jocker form himself.
So the test -e will correct this: If the first file of the returned list exist, then assing the number of elements in the list (${#VarLs[#]}) to Var2 else, assign 0 to Var2.
Can anyone point out where is the error.
You shouldn't have spaces around =.
You probably wanted to use command substitution to capture the result in Var2.
Try:
Var1=22
Var2=$(ls /stud_data/Input_Data/test3 | grep ".txt" | wc -l)
Var3=$(($Var1 + $Var2))
echo $Var3
Moreover, you could also say
Var3=$((Var1 + Var2))

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources