identifying .mp3 not by name with shell script - bash

I would like to write a script that finds duplicate mp3's by content and not bf file name. I am wondering how one goes about seeing a file types inner data for the sake of comparison.
Thank you.

cmp can be used to compare binary files.
cmp file1.mp3 file2.mp3
if [[ $? -eq 0 ]]; then echo "Matched"; fi
cmp command returns 0 if the files are same or else -1.

This first command line lists all files having same size and same md5sum from the current directory
find . -type f -printf '%11s ' -exec md5sum '{}' ';' |
sort | uniq -w44 --all-repeated=separate
The second command line is
faster because it calculates md5sum solely for the files having the same size
more robust because it handles filenames having special characters as 'space' or 'newline'
Therefore it is also more complex
find . -type f -printf '%11s %P\0' |
LC_ALL=C sort -z |
uniq -Dzw11 |
while IFS= read -r -d '' line
do
md5sum "${line:12}"
done |
uniq -w32 --all-repeated=separate |
tee duplicated.log
Some explanations
# Print file size/md5sum/name in one line (size aligned in 11 characters)
find . -printf '%11s ' -exec md5sum '{}' ';'
# Print duplicated lines considering the the first 44 characters only
# 44 characters = size (11 characters) + one space + md5sum (32 characters)
uniq -w44 --all-repeated=separate
# Print size and path/filename terminated by a null character
find . -printf '%11s %P\0'
# Sort lines separated by a null character (-z) instead of a newline character
# based on native byte value (LC_ALL=C) instead of locals
LC_ALL=C sort -z
# Read lines separated by null character
IFS= read -r -d '' line
# Skip the first 12 characters (size and space)
# in order to obtain the rest: path/filename
"${line:12}"

If the files are really byte-to-byte equivalent, you can start searching for files of the same size. If their size is the same, you can investigate further (e.g. compare their md5sum). If the files just contain the same song, but use a different codec/compression/whatever, bash is probably not the right tool for the task.

I use this script for my photo but it can be used for other files.
First I transfer pictures from my phone/camera to directory newfiles
Then I run this script from my pictures root directory
On detection of duplicated files, the script keeps one file and moves the other ones to directory ../garbage
The script moves in priority file from newfiles
Caution: This script does not compare file content, but it detects files having same size & name (this is OK for camera files). My other answer is based on content comparison (md5sum).
#!/bin/bash
# If a file from directory 'newfile' has same size & name
# that another file from another directory
# then moves the file from 'newfile' to 'garbage'
find newfiles/ -type f -printf '%s %f\n' |
while read SIZE f
do
find . -name "$f" -size ${SIZE}c |
grep -v 'newfiles' &&
find . -name "$f" -size ${SIZE}c -path '*newfiles*' -exec mv -v '{}' ../garbage ';' &&
echo
done
# Detect all other duplicated files
# Keep the first occurrence and moves all other to 'garbage'
find . -type f -printf '%s %f\n' |
LC_ALL=C sort | #LC_ALL=C disables locale => sort is faster
uniq -dc | #keep duplicates and count number of occurrences
while read n SIZE f
do
echo -e "\n_____ $n files\t$SIZE bytes\tname: $f"
find . -name "$f" -size ${SIZE}c |
head -n 1 |
xargs mv -v -t ../garbage
done

Related

How to grep files in date order

I can list the Python files in a directory from most recently updated to least recently updated with
ls -lt *.py
But how can I grep those files in that order?
I understand one should never try to parse the output of ls as that is a very dangerous thing to do.
You may use this pipeline to achieve this with gnu utilities:
find . -maxdepth 1 -name '*.py' -printf '%T#:%p\0' |
sort -z -t : -rnk1 |
cut -z -d : -f2- |
xargs -0 grep 'pattern'
This will handle filenames with special characters such as space, newline, glob etc.
find finds all *.py files in current directory and prints modification time (epoch value) + : + filename + NUL byte
sort command performs reverse numeric sort on first column that is timestamp
cut command removes 1st column (timestamp) from output
xargs -0 grep command searches pattern in each file
There is a very simple way if you want to get the filelist in chronologic order that hold the pattern:
grep -sil <searchpattern> <files-to-grep> | xargs ls -ltr
i.e. you grep e.g. "hello world" in *.txt, with -sil you make the grep case insensitive (-i), suppress messages (-s) and just list files (-l); this you then pass on to ls (| xargs), sorting it by date (-t) showing date (-l) and all files (-a).

Handle files with space in filename and output file names

I need to write a Bash script that achieve the following goals:
1) move the newest n pdf files from folder 1 to folder 2;
2) correctly handles files that could have spaces in file names;
3) output each file name in a specific position in a text file. (In my actual usage, I will use sed to put the file names in a specific position of an existing file.)
I tried to make an array of filenames and then move them and do text output in a loop. However, the following array cannot handle files with spaces in filename:
pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 ls -1 -t | head -n$NUM))
Suppose a file has name "Filename with Space". What I get from the above array will have "with" and "Space" in separate array entries.
I am not sure how to avoid these words in the same filename being treated separately.
Can someone help me out?
Thanks!
-------------Update------------
Sorry for being vague on the third point as I thought I might be able to figure that out after achieving the first and second goals.
Basically, it is a text file that have a line start with "%comment" near the end and I will need to insert the filenames before that line in the format "file=PATH".
The PATH is the folder 2 that I have my pdfs moved to.
You can achieve this using mapfile in conjunction with gnu versions of find | sort | cut | head that have options to operate on NUL terminated filenames:
mapfile -d '' -t pdfs < <(find "$DOWNLOADS/*.pdf" -name 'file*' -printf '%T#:%p\0' |
sort -z -t : -rnk1 | cut -z -d : -f2- | head -z -n $NUM)
Commands used are:
mapfile -d '': To read array with NUL as delimiter
find: outputs each file's modification stamp in EPOCH + ":" + filename + NUL byte
sort: sorts reverse numerically on 1st field
cut: removes 1st field from output
head: outputs only first $NUM filenames
find downloads -name "*.pdf" -printf "%T# %p\0" |
sort -z -t' ' -k1 -n |
cut -z -d' ' -f2- |
tail -z -n 3
find all *.pdf files in downloads
for each file print it's modifition date %T with the format specifier # that means seconds since epoch with fractional part, then print space, filename and separate with \0
Sort the null separated stream using space as field separator using only first field using numerical sort
Remove the first field from the stream, ie. creation date, leaving only filenames.
Get the count of the newest files, in this example 3 newest files, by using tail. We could also do reverse sort and use head, no difference.
Don't use ls in scripts. ls is for nice formatted output. You could do xargs -0 stat --printf "%Y %n\0" which would basically move your script forward, as ls isn't meant to be used for scripts. Just that I couldn't make stat output fractional part of creation date.
As for the second part, we need to save the null delimetered list to a file
find downloads ........ >"$tmp"
and then:
str='%comment'
{
grep -B$((2**32)) -x "$str" "$out" | grep -v "$str"
# I don't know what you expect to do with newlines in filenames, but I guess you don't have those
cat "$tmp" | sed -z 's/^/file=/' | sed 's/\x0/\n/g'
grep -A$((2**32)) -x "$str" "$out"
} | sponge "$out"
where output is the output file name
assuming output file name is stored in variable "$out"
filter all lines before the %comment and remove the line %comment itself from the file
output each filename with file= on the beginning. I also substituted zeros for newlines.
the filter all lines after %comment including %comment itself
write the output for outfile. Remember to use a temporary file.
Don't use pdf=$(...) on null separated inputs. You can use mapfile to store that to an array, as other answers provided.
Then to move the files, do smth like
<"$tmp" xargs -0 -i mv {} "$outdir"
or faster, with a single move:
{ cat <"$tmp"; printf "%s\0" "$outdir"; } | xargs -0 mv
or alternatively:
<"$tmp" xargs -0 sh -c 'outdir="$1"; shift; mv "$#" "$outdir"' -- "$outdir"
Live example at turorialspoint.
I suppose following code will be close to what you want:
IFS=$'\n' pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 -I ls -lt "{}" | tail -n +1 | head -n$NUM))
Then you can access the output through ${pdfs[0]}, ${pdfs[1]}, ...
Explanations
IFS=$'\n' makes the following line to be split only with "\n".
-I option for xargs tells xargs to substitute {} with filenames so it can be quoted as "{}".
tail -n +1 is a trick to suppress an error message saying "xargs: 'ls' terminated by signal 13".
Hope this helps.
Bash v4 has an option globstar, after enabling this option, we can use ** to match zero or more subdirectories.
mapfile is a built-in command, which is used for reading lines into an indexed array variable. -t option removes a trailing newline.
shopt -s globstar
mapfile -t pdffiles < <(ls -t1 **/*.pdf | head -n"$NUM")
typeset -p pdffiles
for f in "${pdffiles[#]}"; do
echo "==="
mv "${f}" /dest/path
sed "/^%comment/i${f}=/dest/path" a-text-file.txt
done

Find lines longer than x characters and truncate for display

I want to run grep on HTML files to find out lines longer than x characters and truncate the display using grep.
What I know
To figure out lines longer than 100 characters in html files.
find . -name '*.html' -print | xargs grep -on '.\{100\}'
To find lines matching title and limit display by 40 characters with
find . -name '*.html' -print | xargs grep -onE '.{0,40}title.{0,40}'
What I don't know
How can I find out lines that exceed 100 characters and then display those lines by limited to 40 characters?
MVCE
I have a bunch of html files, which look like
$ cat 1.html
abcdefghijklmnopqrstuv12345675689
12345675689abcdefghijklmnopqrstuv
abcd1234
Now, I'd like to find out lines longer than 20 characters, and then cut the display to 15 characters only.
Expected output with favoretti solution
$ find . -name '*.html' -print | xargs grep -on '.\{20\}' | cut -c -15
./1.html:1:abcd
./1.html:2:1234
./2.html:1:abcd
./2.html:2:1234
First of all it's worth mentioning that unless you're very confident that you can treat your "HTML" files as a series of line-separated records, then you should probably be using an HTML-aware tool (either standalone, or included in a scripting language).
Since you mentioned Awk in an earlier comment:
find . -name '*.html' -exec awk '
length($0) > 20 { print FILENAME, substr($0, 1, 15) }' {} +
This matches lines with length greater than 20 and prints the first 15 characters. I put the file name at the start, you can remove that if you like.
It's not clear whether you need find for a recursive search or not - if not, then you might be fine with letting the shell generate the list of files:
awk 'length($0) > 20 { print FILENAME, substr($0, 1, 15) }' *.html
And with globstar enabled (shopt -s globstar), you can use **/*.html for recursive matching in Bash.
If for some reason you want to just use grep
find . -name '*.html' -exec grep -oP '.{40}(?=.{60})' {} /dev/null \;
The first grep works ok I suppose, so if you want to print out just 40 chars, pipe it through cut?
find . -name '*.html' -print | grep -on '.\{100\}' | cut -c 1-40

Move files based of a comparison with a file

I have 1000 files with following names:
something-345-something.txt
something-5468-something.txt
something-100-something.txt
something-6200-something.txt
and a lot more...
And I have one txt file, with only numbers in it. f.e:
1000
500
5468
6200
699
usw...
Now I would like to move all files, which have a number in their filenames which is in my txt file.
So in my example above the following files should be moved only:
something-5468-something.txt
something-6200-something.txt
Is there an easy way to achieve this?
What about on the fly moving files by doing this:
for i in `cat you-file.txt`; do
find . -iname "*-$i-*" -exec mv '{}' /target/dir \;
; done
For every line in your text file, the find command will try to find only does matching the pattern *-$i-* (something-6200-something.txt) and move it to your target dir.
Naive implementation: for file in $(ls); do grep $(echo -n $file | sed -nr 's/[^-]*-([0-9]+).*/\1/p') my-one-txt.txt && mv $file /tmp/somewhere; done
In English: For every file in output of ls: parse number part of filename with sed and grep for it in your text file. grep returns a non-zero exit code if nothing is found, so mv is in evaluated in that case.
Script file named move (executable):
#!/bin/bash
TARGETDIR="$1"
FILES=`find . -type f` # build list of files
while read n # read numbers from standard input
do # n contains a number => filter list of files by that number:
echo "$FILES" | grep "\-$n-" | while read f
do # move file that passed the filter because its name matches n:
mv "$f" "$TARGETDIR"
done
done
Use it like this:
cd directory-with-files
./move target-directory < number-list.txt
Here's a crazy bit of bash hackery
shopt -s extglob nullglob
mv -t /target/dir *-#($(paste -sd "|" numbers.txt))-*
That uses paste to join all the lines in your numbers file with pipe characters, then uses bash extended pattern matching to find the files matching any one of the numbers.
I assume mv from GNU coreutils for the -t option.

Recursively find and open files

I want to search through all subdirectories and files to find files with a specific extension. When I find a file with the extension, I need to open it, find a specific string from within the file and store it within a txt file.
This is what I have so far for finding all of the correct files:
find . -name ".ext" ! -path './ExcludeThis*'
This is what I have for opening the file and getting the part of the file I want and storing it:
LINE=$(head .ext | grep search_string)
SUBSTR=$(echo $LINE | cut -f2 -d '"')
echo $SUBSTR >> results.txt
I am struggling for how to combine the 2 together, I have looked at 'for f in **/*' and then run an if statement in there to see if it matches the .ext and remove the need for find all together but **/* seems to work on directories only and not files.
A break down of any solutions would be very much appreciated too, I am new to shell scripting. Thanks.
find -name "*.ext" \! -path './ExcludeThis*' -exec head -q '{}' \+ |
grep search_string | cut -f2 -d'"' >> results.txt
find explanation
find -name "*.ext" \! -path './ExcludeThis*' -exec head -q '{}' \+
For each file name matched, executes head (with \+, the command line is built by appending each selected file name at the end, so the total number of invocations of the command will be much less than the number of matched files).
Notice I replaced .ext with *.ext (the first way just math a file named exactly .ext), and ! with \! (protection from interpretation by the shell).
The head option -q is necessary because that command prints headers when used with multiple files (due to \+ in this case).
In addition, if no path is given, the default is taken (.). i.e.: find . -name = find -name.
pipeline explanation
<find ... -exec head> | grep search_string | cut -f2 -d'"' >> results.txt
While head write the lines (10 by default) for every file in the pipe, grep read them.
If grep matches search_string in some of them, write those lines in the next pipe.
At the same time, cut take the second fields (delimited by ") of every line and appends them in results.txt

Resources