Removing Duplicate Files in Unix

Removing Duplicate Files in Unix - bash

I want to be able to delete duplicate files and at the same time create a symbolic link to the removed duplicate lines.So far I can display the duplicate files ,the problem is removal and deleting.Since I want to retain a copy
find "$#" -type f -print0 | xargs -0 -n1 md5sum | sort --key=1,32 | uniq -w
32 -d --all-repeated=separate
Output:
1463b527b1e7ed9ed8ef6aa953e9ee81 ./tope5final
1463b527b1e7ed9ed8ef6aa953e9ee81 ./Tests/tope5
2a6dfec6f96c20f2c2d47f6b07e4eb2f ./tope3final
2a6dfec6f96c20f2c2d47f6b07e4eb2f ./Tests/tope3
5baa4812f4a0838dbc283475feda542a ./tope1bfinal
5baa4812f4a0838dbc283475feda542a ./Tests/tope1b
69d7799197049b64f8675ed4500df76c ./tope3afinal
69d7799197049b64f8675ed4500df76c ./Tests/tope3a
945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/butter6
945fe30c545fc0d7dc2d1cb279cf9c04 ./Tests/tope6
98340fa2af27c79da7efb75ae7c01ac6 ./tope2cfinal
98340fa2af27c79da7efb75ae7c01ac6 ./Tests/tope2c
d15df73b8eaf1cd237ce96d58dc18041 ./tope1afinal
d15df73b8eaf1cd237ce96d58dc18041 ./Tests/tope1a
d5ce8f291a81c1e025d63885297d4b56 ./tope4final
d5ce8f291a81c1e025d63885297d4b56 ./Tests/tope4
ebde372904d6d2d3b73d2baf9ac16547 ./tope1cfinal
ebde372904d6d2d3b73d2baf9ac16547 ./Tests/tope1c
In this case for example I want to delete ./tope1cfinal and remain with ./Tests/tope1c. After deleting I also want to create a symbolic link with name /tope1cfinal pointing to /Tests/tope1c.

One possibility: create an associative array, the keys of which are the md5sum, the fields of which are the corresponding first file found (the one that won't be deleted). Each time an md5sum is found in this associative array, the file will be deleted and a corresponding link to the corresponding key will be created (after checking that the file to delete isn't the original file). Takes the directories to search as arguments; with no arguments the search is performed inside current directory.
#!/bin/bash
shopt -s globstar nullglob
(($#==0)) && set .
declare -A md5sum=() || exit 1;
while(($#)); do
[[ $1 ]] || continue
for file in "$1"/**/*; do
[[ -f $file ]] || continue
h=$(md5sum < "$file") || continue
read h _ <<< "$h" # This line is optional: to remove the hyphen in the md5sm
if [[ ${md5sum[$h]} ]]; then
# already seen this md5sum
[[ "$file" -ef "${md5sum[$h]}" ]] && continue # prevent unwanted removal!
rm -- "$file" || continue
ln -rs -- "${md5sum[$h]}" "$file"
else
# first time seeing this file
md5sum[$h]=$file
fi
done
shift
done
(Untested, use at your own risks!)

Related

How can I check if exists file with name according to "template" in the directory?

Given variable with name template , for example: template=*.txt.
How can I check if files with name like this template exist in the current directory?
For example, according to the value of the template above, I want to know if there is files with the suffix .txt in the current directory.

I would do it like this with just built-ins:
templcheck () {
for f in * .*; do
[[ -f $f ]] && [[ $f = $1 ]] && return 0
done
return 1
}
This takes the template as an argument (must be quoted to prevent premature expansion) and returns success if there was a match in the current directory. This should work for any filenames, including those with spaces and newlines.
Usage would look like this:
$ ls
file1.txt 'has space1.txt' script.bash
$ templcheck '*.txt' && echo yes
yes
$ templcheck '*.md' && echo yes || echo no
no
To use with the template contained in a variable, that expansion has to be quoted as well:
templcheck "$template"

Use find:
: > found.txt # Ensure the file is empty
find . -prune -exec find -name "$template" \; > found.txt
if [ -s found.txt ]; then
echo "No matching files"
else
echo "Matching files found"
fi
Strictly speaking, you can't assume that found.txt contains exactly one file name per line; a filename with an embedded newline will look the same as two separate files. But this does guarantee that an empty file means no matching files.
If you want an accurate list of matching file names, you need to disable field splitting while keeping pathname expansion.
[[ -v IFS ]] && OLD_IFS=$IFS
IFS=
shopt -s nullglob
files=( $template )
[[ -v OLD_IFS ]] && IFS=$OLD_IFS
printf "Found: %s\n" "${files[#]}"
This requires several bash extensions (the nullglob option, arrays, and the -v operator for convenience of restoring IFS). Each element of the array is exactly one match.

bash script not filtering

I'm hoping this is a simple question, since I've never done shell scripting before. I'm trying to filter certain files out of a list of results. While the script executes and prints out a list of files, it's not filtering out the ones I don't want. Thanks for any help you can provide!
#!/bin/bash
# Purpose: Identify all *md files in H2 repo where there is no audit date
#
#
#
# Example call: no_audits.sh
#
# If that call doesn't work, try ./no_audits.sh
#
# NOTE: Script assumes you are executing from within the scripts directory of
# your local H2 git repo.
#
# Process:
# 1) Go to H2 repo content directory (assumption is you are in the scripts dir)
# 2) Use for loop to go through all *md files in each content sub dir
# and list all file names and directories where audit date is null
#
#set counter
count=0
# Go to content directory and loop through all 'md' files in sub dirs
cd ../content
FILES=`find . -type f -name '*md' -print`
for f in $FILES
do
if [[ $f == "*all*" ]] || [[ $f == "*index*" ]] ;
then
# code to skip
echo " Skipping file: " $f
continue
else
# find audit_date in file metadata
adate=`grep audit_date $f`
# separate actual dates from rest of the grepped line
aadate=`echo $adate | awk -F\' '{print $2}'`
# if create date is null - proceed
if [[ -z "$aadate" ]] ;
then
# print a list of all files without audit dates
echo "Audit date: " $aadate " " $f;
count=$((count+1));
fi
fi
done
echo $count " files without audit dates "

First, to address the immediate issue:
[[ $f == "*all*" ]]
is only true if the exact contents of f is the string *all* -- with the wildcards as literal characters. If you want to check for a substring, then the asterisks shouldn't be quoted:
[[ $f = *all* ]]
...is a better-practice solution. (Note the use of = rather than == -- this isn't essential, but is a good habit to be in, as the POSIX test command is only specified to permit = as a string comparison operator; if one writes [ "$f" == foo ] by habit, one can get unexpected failures on platforms with a strictly compliant /bin/sh).
That said, a ground-up implementation of this script intended to follow best practices might look more like the following:
#!/usr/bin/env bash
count=0
while IFS= read -r -d '' filename; do
aadate=$(awk -F"'" '/audit_date/ { print $2; exit; }' <"$filename")
if [[ -z $aadate ]]; then
(( ++count ))
printf 'File %q has no audit date\n' "$filename"
else
printf 'File %q has audit date %s\n' "$filename" "$aadate"
fi
done < <(find . -not '(' -name '*all*' -o -name '*index*' ')' -type f -name '*md' -print0)
echo "Found $count files without audit dates" >&2
Note:
An arbitrary list of filenames cannot be stored in a single bash string (because all characters that might otherwise be used to determine where the first name ends and the next name begins could be present in the name itself). Instead, read one NUL-delimited filename at a time -- emitted with find -print0, read with IFS= read -r -d ''; this is discussed in [BashFAQ #1].
Filtering out unwanted names can be done internal to find.
There's no need to preprocess input to awk using grep, as awk is capable of searching through input files itself.
< <(...) is used to avoid the behavior in BashFAQ #24, wherein content piped to a while loop causes variables set or modified within that loop to become unavailable after its exit.
printf '...%q...\n' "$name" is safer than echo "...$name..." when handling unknown filenames, as printf will emit printable content that accurately represents those names even if they contain unprintable characters or characters which, when emitted directly to a terminal, act to modify that terminal's configuration.

Nevermind, I found the answer here:
bash script to check file name begins with expected string
I tried various versions of the wildcard/filename and ended up with:
if [[ "$f" == *all.md ]] || [[ "$f" == *index.md ]] ;
The link above said not to put those in quotes, and removing the quotes did the trick!

bash string length in a loop

I am looping through a folder and depending on the length of files do certain condition. I seem not to come right with that. I evaluate and output the length of a string in the terminal.
echo $file|wc -c gives me the answer of all files in the terminal.
But incorporating this into a loop is impossible
for file in `*.zip`; do
if [[ echo $file|wc -c ==9]]; then
some commands
where I want to operate on files that have a length of nine characters

Try this one:
for file in *.zip ; do
wcout=$(wc -c "$file")
if [[ ${wcout%% *} -eq 9 ]] ; then
# some commands
fi
done
The %% operator in variable expansion deletes everything that match the pattern after it. This is glob pattern, not regular expression.
Opposite to natural good sense of typical programmers the == operator in BASH compares strings, not numbers.
Alternatively (following the comment) you can:
for file in *.zip ; do
wcout=$(wc -c < "$file")
if [[ ${wcout} -eq 9 ]] ; then
# some commands
fi
done
Additional observation is that if BASH cannot expand *.zip as there is no ZIP files in the current directory it will pass "*.zip" into $file and let single iteration of the loop. That leads to the error reported by wc command. So it would be recommended to add:
if [[ -e ${file} ]] ; then ...
as a prevention mechanism.
Comments leads to another form of this solution (plus I added my safety mechanism):
for file in *.zip ; do
if [[ -e "$file" && (( $(wc -c < "$file") == 9 )) ]] ; then
# some commands
fi
done

using filter outside the loop
ls -1 *.zip \
| grep -E '^.{9}$' \
| while read FileName
do
# Your action
done
using filter inside loop
ls -1 *.zip \
| while read FileName
do
if [ ${#FileName} -eq 9 ]
then
# Your action
fi
done
alternative to ls -1 that is always a bit dangereous, find . -name '*.zip' -print [ but you neet to add 2 char length or filter the name form headin ./ and maybe limit to current folder depth ]

Bash: Pass alias or function as argument to program

Quite often i need to work on the newest file in a directory.
Normally i do:
ls -rt
and then open the last file in vim or less.
Now i wanted to produce an alias or function, like
lastline() {ls -rt | tail -n1}
# or
alias lastline=$(ls -rt | tail -n1)
Calling lastline outputs the newest file in the directory, which is nice.
But calling
less lastline
wants to open the file "lastline" which doesn't exist.
How do i make bash execute the function or alias, if possible without a lot of typing $() or ``?
Or is there any other way to achieve the same result?
Thanks for your help.

You're parsing ls, and this is very bad. Moreover, if the last modified “file” is a directory, you'll be lessing/viming a directory.
So you need a robust way to determine the last modified file in the current directory. You may use a helper function like the following (that you'll put in your .bashrc):
last_modified_regfile() {
# Finds the last modified regular file in current directory
# Found file is in variable last_modified_regfile_ret
# Returns a failure return code if no reg files are found
local file
last_modified_regfile_ret=
for file in *; do
[[ -f $file ]] || continue
if [[ -z $last_modified_regfile_ret ]] || [[ $file -nt $last_modified_regfile_ret ]]; then
last_modified_regfile_ret=$file
fi
done
[[ $last_modified_regfile_ret ]]
}
Then you may define another function that will vim the last found file:
vimlastline() {
last_modified_regfile && vim -- "$last_modified_regfile_ret"
}
You may even have last_modified_regfile take optional arguments: the directories where it will find the last modified regular file:
last_modified_regfile() {
# Finds the last modified regular file in current directory
# or in directories given as arguments
# Found file is in variable last_modified_regfile_ret
# Returns a failure return code if no reg files are found
local file dir
local save_shopt_nullglob=$(shopt -p nullglob)
shopt -s nullglob
(( $# )) || set .
last_modified_regfile_ret=
for dir; do
dir=${dir%/}
[[ -d $dir/ ]] || continue
for file in "$dir"/*; do
[[ -f $file ]] || continue
if [[ -z $last_modified_regfile_ret ]] || [[ $file -nt $last_modified_regfile_ret ]]; then
last_modified_regfile_ret=$file
fi
done
done
$save_shopt_nullglob
[[ $last_modified_regfile_ret ]]
}
Then you can even alter vimlastline accordingly:
vimlastline() {
last_modified_regfile "$#" && vim -- "$last_modified_regfile_ret"
}

Use command substitution like this:
lastline() { ls -rt | tail -n1; }
less "$(lastline)"
Or pipe it to xargs:
lastline | xargs -I {} less '{}'

While loop does not execute

I currently have this code:
listing=$(find "$PWD")
fullnames=""
while read listing;
do
if [ -f "$listing" ]
then
path=`echo "$listing" | awk -F/ '{print $(NF)}'`
fullnames="$fullnames $path"
echo $fullnames
fi
done
For some reason, this script isn't working, and I think it has something to do with the way that I'm writing the while loop / declaring listing. Basically, the code is supposed to pull out the actual names of the files, i.e. blah.txt, from the find $PWD.

read listing does not read a value from the string listing; it sets the value of listing with a line read from standard input. Try this:
# Ignoring the possibility of file names that contain newlines
while read; do
[[ -f $REPLY ]] || continue
path=${REPLY##*/}
fullnames+=( $path )
echo "${fullnames[#]}"
done < <( find "$PWD" )
With bash 4 or later, you can simplify this with
shopt -s globstar
for f in **/*; do
[[ -f $f ]] || continue
path+=( "$f" )
done
fullnames=${paths[#]##*/}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Removing Duplicate Files in Unix - bash

Related

How can I check if exists file with name according to "template" in the directory?

bash script not filtering

bash string length in a loop

Bash: Pass alias or function as argument to program

While loop does not execute

Categories

Resources