Efficient way to find paths from a list of filenames - bash

From a list of file names stored in a file f, what's the best way to find the relative path of each file name under dir, outputting this new list to file p? I'm currently using the following:
while read name
do
find dir -type f -name "$name" >> p
done < f
which is too slow for a large list, or a large directory tree.
EDIT: A few numbers:
Number of directories under dir: 1870
Number of files under dir: 80622
Number of filenames in f: 73487
All files listed in f do exist under dir.

The following piece of python code does the trick. The key is to run find once and store the output in a hashmap to provide an O(1) way to get from file_name to the list of paths for the filename.
#!/usr/bin/env python
import os
file_names = open("f").readlines()
file_paths = os.popen("find . -type f").readlines()
file_names_to_paths = {}
for file_path in file_paths:
file_name = os.popen("basename "+file_path).read()
if file_name not in file_names_to_paths:
file_names_to_paths[file_name] = [file_path]
else:
file_names_to_paths[file_name].append(file_path) # duplicate file
out_file = open("p", "w")
for file_name in file_names:
if file_names_to_paths.has_key(file_name):
for path in file_names_to_paths[file_name]:
out_file.write(path)

Try this perl one-liner
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),<$p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
1- create an hashmap whose keys are filenames : %H=map{chomp;$_=>1}<>
2- define a recursive subroutine to traverse directories : sub R{}
2.1- recusive call for directories : map R($_), if -d$p
2.2- extract the filename from the path : ($b=$p)=~s|.*/||
2.3- print if hashmap contains filename : print"$p\n" if$H{$b}
3- call R with path current directory : R"."
EDIT : to traverse hidden directories (.*)
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),grep !m|/\.\.?$|,<$p/.* $p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f

I think this should do the trick:
xargs locate -b < f | grep ^dir > p
Edit: I can't think of an easy way to prefix dir/*/ to the list of file names, otherwise you could just pass that directly to xargs locate.

Depending on what percentage of the directory tree is considered a match, it might be faster to find every file, then grep out the matching ones:
find "$dir" -type f | grep -f <( sed 's+\(.*\)+/\1$+' "$f" )
The sed command pre-processes your list of file names into regular expressions that will only match full names at the end of a path.

Here is an alternative using bash and grep
#!/bin/bash
flist(){
for x in "$1"/*; do #*/ for markup
[ -d "$x" ] && flist $x || echo "$x"
done
}
dir=/etc #the directory you are searching
list=$(< myfiles) #the file with file names
#format the list for grep
list="/${list//
/\$\|/}"
flist "$dir" | grep "$list"
...if you need full posix shell compliance (busybox ash, hush, etc...) replace the $list substring manipulation with a variant of chepner's sed and replace $(< file) with $(cat file)

Related

delete all but the last match

I want to delete all but the last match of a set of files matching file* that are present in each folder within a directory.
For example:
Folder 1
file
file_1-1
file_1-2
file_2-1
stuff.txt
stuff
Folder 2
file_1-1
file_1-2
file_1-3
file_2-1
file_2-2
stuff.txt
Folder 3
...
and so on. Within every subfolder I want to keep only the last of the matched files, so for Folder 1 this would be file_2-1, in Folder 2 it would be file_2-2. The number of files is generally different within each subfolder.
Since I have a very nestled folder structure I thought about using the find command somehow like this
find . -type f -name "file*" -delete_all_but_last_match
I know how to delete all matches but not how to exclude the last match.
I also found the following piece of code:
https://askubuntu.com/questions/1139051/how-to-delete-all-but-x-last-items-from-find
but when I apply a modified version to a test folder
find . -type f -name "file*" -print0 | head -zn-1 | xargs -0 rm -rf
it deletes all the matches in most cases, only in some the last file is spared. So it does not work for me, presumably because of the different number of files in each folder.
Edit:
The folders contain no further subfolders, but they are generally at the end of several subfolder levels. It would therefore be a benefit if the script can be executed some levels above as well.
#!/bin/bash
shopt -s globstar
for dir in **/; do
files=("$dir"file*)
unset 'files[-1]'
rm "${files[#]}"
done
Try the following solution utilising awk and xargs:
find . -type f -name "file*" | awk -F/ '{ map1[$(NF-1)]++;map[$(NF-1)][map1[$(NF-1)]]=$0 }END { for ( i in map ) { for (j=1;j<=(map1[i]-1);j++) { print "\""map[i][j]"\"" } } }' | xargs rm
Explanation:
find . -type f -name "file*" | awk -F/ '{ # Set the field delimiter to "/" in awk
map1[$(NF-1)]++; # Create an array map1 with the sub-directory as the index and an incrementing counter the value (number of files in each sub-directory)
map[$(NF-1)][map1[$(NF-1)]]=$0 # Create a two dimentional array with the sub directory index one and the file count the second. The line the value
}
END {
for ( i in map ) {
for (j=1;j<=(map1[i]-1);j++) {
print "\""map[i][j]"\"" # Loop through the map array utilising map1 to get the last but one file and printing the results
}
}
}' | xargs rm # Run the result through xargs rm
Remove the pipe to xargs to verify that the files are listing as expected before adding back in to actually remove the files.

Script for printing out file names and their number of appearance starting from a given folder

I need to write a shell script, which starts with a given folder name as an argument will print out the names of folder and files in it and how many times does each name appear in the given folder.
edit I need to check only their names, without taking into consideration the file extensions.
#!/bin/bash
folder="$1"
for f in "$folder"
do
echo "$f"
done
And I would expect to see something like this (if i have 3 files with the same name and different extension like x.html, x.css, x.sh and so on, in a directory called dir)
x
3 times
after executing the script with dir (the name of the directory) as a parameter.
The find command already does most of this for you.
find . -printf "%f\n" |
sort | uniq -c
This will not work correctly if you have files whose names contain a newline.
If your find doesn't support -printf, maybe try
find . -exec basename {} \; |
sort | uniq -c
To restrict to just file names or directory names, add -type f or -type d, respectively, before the action (-exec or -printf).
If you genuinely want to remove extensions, try
find .... whatever ... |
sed 's%\.[^./]*$%%' |
sort | uniq -c
Can you try this,
#!/bin/bash
IFS=$'\n' array=($(ls))
iter=0;
for file in ${array[*]}; do
filename=$(basename -- "$file")
extension="${filename##*.}"
filename="${filename%.*}"
filenamearray[$iter]=$filename
iter=$((iter+1))
done
for filename in ${filenamearray[#]}; do
echo $filename;
grep -o $filename <<< "${filenamearray[#]}" | wc -l
done
You can try with find and awk :
find . -type f -print0 |
awk '
BEGIN {
FS="/"
RS="\0"
}
{
k = split( $NF , b , "." )
if ( k > 1 )
sub ( "."b[k] , "" , $NF )
a[$NF]++
}
END {
for ( i in a ) {
j = a[i]>1 ? "s" : ""
print i
print a[i] " time" j
}
}'

Find if null exists in csv file

I have a csv file. The file has some anomalies as it contains some unknown characters.
The characters appear at line 1535 in popular editors (images attached below). The sed command in the terminal for this linedoes not show anything.
$ sed '1535!d' sample.csv
"sample_id","sample_column_text_1","sample_"sample_id","sample_column_text_1","sample_column_text_2","sample_column_text_3"
However below are the snapshots of the file in various editors.
Sublime Text
Nano
Vi
The directory has various csv files that contain this character/chain of characters.
I need to write a bash script to determine the files that have such characters. How can I achieve this?
The following is from;
http://www.linuxquestions.org/questions/programming-9/how-to-check-for-null-characters-in-file-509377/
#!/usr/bin/perl -w
use strict;
my $null_found = 0;
foreach my $file (#ARGV) {
if ( ! open(F, "<$file") ) {
warn "couldn't open $file for reading: $!\n";
next;
}
while(<F>) {
if ( /\000/ ) {
print "detected NULL at line $. in file $file\n";
$null_found = 1;
last;
}
}
close(F);
}
exit $null_found;
If it works as desired, you can save it to a file, nullcheck.pl and make it executable;
chmod +x nullcheck.pl
It seems to take an array of files names as input, but will fail if it finds in any, so I'd only pass in one at a time. The command below is used to run the script.
for f in $(find . -type f -exec grep -Iq . {} \; -and -print) ; do perl ./nullcheck.pl $f || echo "$f has nulls"; done
The above find command is lifted from Linux command: How to 'find' only text files?
You can try tr :
grep '\000' filename to find if the files contain the \000 characters.
You can use this to remove NULL and make it non-NULL file :
tr < file-with-nulls -d '\000' > file-without-nulls

Recursively check length of directory name

I need to determine if there are any directory names > 31 characters in a given directory (i.e. look underneath that root).
I know I can use something like find /path/to/root/dir -type d >> dirnames.txt
This will give me a text file of complete paths.
What I need is to get the actual number of characters in each directory name. Not sure if parsing the above results w/sed or awk makes sense. Looking for ideas/thoughts/suggestions/tips on how to accomplish this. Thanks!
This short script does it all in one go, i.e. finds all directory names and then outputs any which are greater than 31 characters in length (along with their length in characters):
for d in `find /path/to/root/dir -type d -exec basename {} \;` ; do
len=$(echo $d | wc -c)
if [ $len -gt 31 ] ; then
echo "$d = $len characters"
fi
done
Using your dirnames.txt file created by your find cmd, you can then sort the data by length of pathname, i.e.
awk '{print length($0) "\t" $0}' dirnames.txt | sort +0nr -1 > dirNamesWithSize.txt
This will present the longest path names (based on the value of length) at the top of the file.
I hope this helps.
Try this
find . -type d -exec bash -c '[ $(wc -c <<<"${1##*/}") -gt 32 ] && echo "${1}"' -- {} \; 2>/dev/null
The one bug, which I consider minor, is that it will over-count directory name length by 1 every time.
If what you wanted was the whole path rather than the last path component, then use this:
find . -type d | sed -e '/.\{32,\}/!d'
This version also has a bug, but only when file names have embedded newlines.
The output of both commands is a list of file names which match the criteria. Counting the length of each one is trivial from there.

How can I copy all my disorganized files into a single directory? (on linux)

I have thousands of mp3s inside a complex folder structure which resides within a single folder. I would like to move all the mp3s into a single directory with no subfolders. I can think of a variety of ways of doing this using the find command but one problem will be duplicate file names. I don't want to replace files since I often have multiple versions of a same song. Auto-rename would be best. I don't really care how the files are renamed.
Does anyone know a simple and safe way of doing this?
You could change a a/b/c.mp3 path into a - b - c.mp3 after copying. Here's a solution in Bash:
find srcdir -name '*.mp3' -printf '%P\n' |
while read i; do
j="${i//\// - }"
cp -v "srcdir/$i" "dstdir/$j"
done
And in a shell without ${//} substitution:
find srcdir -name '*.mp3' -printf '%P\n' |
sed -e 'p;s:/: - :g' |
while read i; do
read j
cp -v "srcdir/$i" "dstdir/$j"
done
For a different scheme, GNU's cp and mv can make numbered backups instead of overwriting -- see -b/--backup[=CONTROL] in the man pages.
find srcdir -name '*.mp3' -exec cp -v --backup=numbered {} dstdir/ \;
bash like pseudocode:
for i in `find . -name "*.mp3"`; do
NEW_NAME = `basename $i`
X=0
while ! -f move_to_dir/$NEW_NAME
NEW_NAME = $NEW_NAME + incr $X
mv $i $NEW_NAME
done
#!/bin/bash
NEW_DIR=/tmp/new/
IFS="
"; for a in `find . -type f `
do
echo "$a"
new_name="`basename $a`"
while test -e "$NEW_DIR/$new_name"
do
new_name="${new_name}_"
done
cp "$a" "$NEW_DIR/$new_name"
done
I'd tend to do this in a simple script rather than try to fit in in a single command line.
For instance, in python, it would be relatively trivial to do a walk() through the directory, copying each mp3 file found to a different directory with an automatically incremented number.
If you want to get fancier, you could have a dictionary of existing file names, and simply append a number to the duplicates. (the index of the dictionary being the file name, and the value being the number of files found so far, which would become your suffix)
find /path/to/mp3s -name *.mp3 -exec mv \{\} /path/to/target/dir \;
At the risk of many downvotes, a perl script could be written in short time to accomplish this.
Pseudocode:
while (-e filename)
change filename to filename . "1";
In python: to actually move the file, change debug=False
import os, re
from_dir="/from/dir"
to_dir = "/target/dir"
re_ext = "\.mp3"
debug = True
w = os.walk(from_dir)
n = w.next()
while n:
d, arg, names = n
names = filter(lambda fn: re.match(".*(%s)$"%re_ext, fn, re.I) , names)
n = w.next()
for fn in names:
from_fn = os.path.join(d,fn)
target_fn = os.path.join(to_dir, fn)
file_exists = os.path.exists(target_fn)
if not debug:
if not file_exists:
os.rename(from_fn, target_fn)
else:
print "DO NOT MOVE - FILE EXISTS ", from_fn
else:
print "MOVE ", from_fn, " TO " , target_fn
Since you don't care how the duplicate files are named, utilize the 'backup' option on move:
find /path/to/mp3s -name *.mp3 -exec mv --backup=numbered {} /path/to/target/dir \;
Will get you:
song.mp3
song.mp3.~1~
song.mp3.~2~

Resources