Bash: Sort within a directory, concatenate contents of file with smallest entry in new file - bash

I have a set of directories RUN1, RUN2, etc.
In each directory, there is a set of files. In each file, there are two numbers. For example (these are saved as .csv, even though there are clearly no commas here):
RUN1
mod_1.csv
2.32e-00
1.2e-01
mod_b.csv
4.53e-00
1.1e-01
RUN2
mod_a.csv
1.23e-01
0.5e-02
mod3.csv
1.67e-00
0.4e-01
etc.
I would like to do this:
For each directory:
For each file in a directory:
Sort files by the first entry
Save contents and path of file with smallest value of first entry.
For example, above, this would result in a new file containing:
2.32e-00 1.2e-01 ./RUN1/mod_1.csv
1.23e-01 0.5e-02 ./RUN2/mod_a.csv
I started out by trying this:
#!/bin/bash
resultfile="best_results.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \; >> "$resultfile"
done
But it gives me both values from all files, like this:
2.32e-00 ./RUN1/mod_1.csv
1.2e-01 ./RUN1/mod_1.csv
4.53e-00 ./RUN1/mod_b.csv
1.1e-01 ./RUN1/mod_b.csv
1.23e-01 ./RUN2/mod_a.csv
0.5e-02 ./RUN2/mod_a.csv
1.67e-00 ./RUN2/mod_3.csv
0.4e-01 ./RUN2/mod_3.csv
Then I figured I needed to use head, but this modification:
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \; | head -1 >> "$resultfile"
gave me:
find: `awk' terminated by signal 13
I think I need another sort, and probably head, but I can't quite put this together.
EDIT (for clarity):
I want to look through all of the files in a directory, find the file with the smallest first number, and write the values of that file and the file path to a new file. Then, proceed to the next directory and do the same. In my example:
Directory RUN1 contains files mod_1.csv and mod_b.csv. File mod_1.csv has the smallest first value. I want to write its contents and file path on one line:
2.32e-00 1.2e-01 ./RUN1/mod_1.csv
to a file.
Directory RUN2 contains files, mod_a.csv and mod3.csv. File mod_a.csv has the smallest first value. I want to write its contents and file path on one line:
1.23e-01 0.5e-02 ./RUN2/mod_a.csv
So that the new file looks like this:
2.32e-00 1.2e-01 ./RUN1/mod_1.csv
1.23e-01 0.5e-02 ./RUN2/mod_a.csv
I understand that this was not clear in my pre-edit question. Please ask any questions you have! I'm not sure how to make this more clear.

You'll probably want to remove the newlines during find from what it looks like:
resultfile="best_results.txt"
for d in $(find . -type d -name 'RUN*');
do
find "$d" -type f -name 'mod*' -exec awk '{printf "%s ",$0} END {print "", FILENAME}' {} \;
done | sort >> "$resultfile"
The sorting would normally done at the very end (once all the results are returned from stdout), however, it's unclear how exactly you expect it to be sorted. You could probably get rid of the for loop if you really wanted, since using something such as the following should work similarly:
find RUN* -type f -name 'mod*' -exec awk '{printf "%s ",$0} END {print "", FILENAME}' {} \; | sort -k 2,2 >> "$resultfile"
Use the -k option with sort to specify which column(s) to sort by.
Result (using sort -k 2,2):
1.67e-00 0.4e-01 RUN2/mod3.csv
1.23e-01 0.5e-02 RUN2/mod_a.csv
4.53e-00 1.1e-01 RUN1/mod_b.csv
2.32e-00 1.2e-01 RUN1/mod_1.csv

Related

Bash find: exec in reverse oder

I am iterating over files like so:
find $directory -type f -exec codesign {} \;
Now the problem here is that files on a higher hierarchy are signed first.
Is there a way to iterate over a directory tree and handle the deepest files first?
So that
/My/path/to/app/bin
is handled before
/My/path/mainbin
Yes, just use -depth:
-depth
The primary shall always evaluate as true; it shall cause descent of the directory hierarchy to be done so that all entries in a directory are acted on before the directory itself. If a -depth primary is not specified, all entries in a directory shall be acted on after the directory itself. If any -depth primary is specified, it shall apply to the entire expression even if the -depth primary would not normally be evaluated.
For example:
$ mkdir -p top/a/b/c/d/e/f/g/h
$ find top -print
top
top/a
top/a/b
top/a/b/c
top/a/b/c/d
top/a/b/c/d/e
top/a/b/c/d/e/f
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f/g/h
$ find top -depth -print
top/a/b/c/d/e/f/g/h
top/a/b/c/d/e/f/g
top/a/b/c/d/e/f
top/a/b/c/d/e
top/a/b/c/d
top/a/b/c
top/a/b
top/a
top
Note that at a particular level, ordering is still arbitrary.
Using GNU utilities, and decorate-sort-undecorate pattern (aka Schwartzian transform):
find . -type f -printf '%d %p\0' |
sort -znr |
sed -z 's/[0-9]* //' |
xargs -0 -I# echo codesign #
Drop the echo if the output looks ok.
Using find's -depth option as my other answer, or naive sort as some others, only ensures that sub-directories of a directory are processed before the directory itself, but not that the deepest level is processed first.
For example:
$ mkdir -p top/a/b/d/f/h top/a/c/e/g
$ find top -depth -print
top/a/c/e/g
top/a/c/e
top/a/c
top/a/b/d/f/h
top/a/b/d/f
top/a/b/d
top/a/b
top/a
top
For overall deepest level to be processed first, the ordering should be something like:
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
To determine this ordering, the entire list must be known, and then the number of levels (ie. /) of each path counted to enable ranking.
A simple-ish Perl script (assigned to a shell function for this example) to do this ordering is:
$ dsort(){
perl -ne '
BEGIN { $/ = "\0" } # null-delimited i/o
$fname[$.] = $_;
$depth[$.] = tr|/||;
END {
print
map { $fname[$_] }
sort { $depth[$b] <=> $depth[$a] }
keys #fname
}
'
}
Then:
$ find top -print0 | dsort | xargs -0 -I# echo #
top/a/b/d/f/h
top/a/c/e/g
top/a/b/d/f
top/a/c/e
top/a/b/d
top/a/c
top/a/b
top/a
top
How about sorting the output of find in descending order:
while IFS= read -d "" -r f; do
codesign "$f"
done < <(find "$directory" -type f -print0 | sort -zr)
<(command ..) is a process substitution which feeds the output
of the command to the read command in while loop via the redirect.
-print0, sort -z and read -d "" combo uses a null character
as a file delimiter. It is useful to protect filenames which include
special characters such as whitespace.
I don't know if there is a native way in find, but you may pipe the output of it into a loop and process it line by line as you wish this way:
find . | while read file; do echo filename: "$file"; done
In your case, if you are happy just reversing the output of find, you may go with something like:
find $directory -type f | tac | while read file; do codesign "$file"; done

delete all but the last match

I want to delete all but the last match of a set of files matching file* that are present in each folder within a directory.
For example:
Folder 1
file
file_1-1
file_1-2
file_2-1
stuff.txt
stuff
Folder 2
file_1-1
file_1-2
file_1-3
file_2-1
file_2-2
stuff.txt
Folder 3
...
and so on. Within every subfolder I want to keep only the last of the matched files, so for Folder 1 this would be file_2-1, in Folder 2 it would be file_2-2. The number of files is generally different within each subfolder.
Since I have a very nestled folder structure I thought about using the find command somehow like this
find . -type f -name "file*" -delete_all_but_last_match
I know how to delete all matches but not how to exclude the last match.
I also found the following piece of code:
https://askubuntu.com/questions/1139051/how-to-delete-all-but-x-last-items-from-find
but when I apply a modified version to a test folder
find . -type f -name "file*" -print0 | head -zn-1 | xargs -0 rm -rf
it deletes all the matches in most cases, only in some the last file is spared. So it does not work for me, presumably because of the different number of files in each folder.
Edit:
The folders contain no further subfolders, but they are generally at the end of several subfolder levels. It would therefore be a benefit if the script can be executed some levels above as well.
#!/bin/bash
shopt -s globstar
for dir in **/; do
files=("$dir"file*)
unset 'files[-1]'
rm "${files[#]}"
done
Try the following solution utilising awk and xargs:
find . -type f -name "file*" | awk -F/ '{ map1[$(NF-1)]++;map[$(NF-1)][map1[$(NF-1)]]=$0 }END { for ( i in map ) { for (j=1;j<=(map1[i]-1);j++) { print "\""map[i][j]"\"" } } }' | xargs rm
Explanation:
find . -type f -name "file*" | awk -F/ '{ # Set the field delimiter to "/" in awk
map1[$(NF-1)]++; # Create an array map1 with the sub-directory as the index and an incrementing counter the value (number of files in each sub-directory)
map[$(NF-1)][map1[$(NF-1)]]=$0 # Create a two dimentional array with the sub directory index one and the file count the second. The line the value
}
END {
for ( i in map ) {
for (j=1;j<=(map1[i]-1);j++) {
print "\""map[i][j]"\"" # Loop through the map array utilising map1 to get the last but one file and printing the results
}
}
}' | xargs rm # Run the result through xargs rm
Remove the pipe to xargs to verify that the files are listing as expected before adding back in to actually remove the files.

bash script to list duplicate hash files [duplicate]

This question already has answers here:
Linux Command Line using for loop and formatting results
(3 answers)
Closed 5 years ago.
I want to create a bash script that searches a given directory for pictures to copy. the pictures have to have the name format IMG_\d\d\d\d.JPG. If the pictures have a duplicate filename, then copy them to /images/archives and append .JPG to the end of their name, so the duplicates have .JPG.JPG. There are also duplicate pictures, so I want to hash each picture and check if it is a duplicate picture. If it is a duplicate picture, then do not copy the duplicate into /archives but store the duplicate file path into a file called output.txt.
I am struggling with trying to get the duplicate hashes to display the filenames as well. This is what I had so far:
if [ -d $1 ]
then echo using directory $1 as source
else echo Sorry, not a valid drive
exit
fi
if [ -d $2 ]
then echo $2 target location already exists
else mkdir -p $2
fi
cd $1
myList=`find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG"`
echo $myList
ImagesToCopy=`find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | cut -f1 -d" " | sort | uniq`
echo $ImagesToCopy
This gives me a list of the files I need to copy and their hashes. In the command line if I type in the command:
# find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | sort | cut -f1 -d" "| uniq -d
I receive the results:
266ab54fd8a6dbc7ba61a0ee526763e5
88761da2c2a0e57d8aab5327a1bb82a9
cc640e50f69020dd5d2d4600e20524ac
This is the list of duplicate files that I do not want to copy but I want to also display the file path and filenames alongside this, like this:
# find . -mindepth 1 -type f -name "*MG_[0-9][0-9][0-9][0-9].JPG" -exec md5sum {} \; | sort -k1 | uniq -u
043007387f39f19b3418fcba67b8efda ./IMG_1597.JPG
05f0c10c49983f8cde37d65ee5790a9f ./images/IMG_2012/IMG_2102.JPG
077c22bed5e0d0fba9e666064105dc72 ./DCIM/IMG_0042.JPG
1a2764a21238aaa1e28ea6325cbf00c2 ./images/IMG_2012/IMG_1403.JPG
1e343279cd05e8dbf371331314e3a2f6 ./images/IMG_1959.JPG
2226e652bf5e3ca3fbc63f3ac169c58b ./images/IMG_0058.JPG
266ab54fd8a6dbc7ba61a0ee526763e5 ./images/IMG_0079.JPG
266ab54fd8a6dbc7ba61a0ee526763e5 ./images/IMG_2012/IMG_0079.JPG
2816dbcff1caf70aecdbeb934897fd6e ./images/IMG_1233.JPG
451110cc2aff1531e64f441d253b7fec ./DCIM/103canon/IMG_0039.JPG
45a00293c0837f10e9ec2bfd96edde9f ./DCIM/103canon/IMG_0097.JPG
486f9dd9ee20ba201f0fd9a23c8e7289 ./images/IMG_2013/IMG_0060.JPG
4c2054c57a2ca71d65f92caf49721b4e ./DCIM/IMG_1810.JPG
53313e144725be3993b1d208c7064ef6 ./IMG_2288.JPG
5ac56dcddd7e0fd464f9b243213770f5 ./images/IMG_2012/favs/IMG_0039.JPG
65b15ebd20655fae29f0d2cf98588fc3 ./DCIM/IMG_2564.JPG
88761da2c2a0e57d8aab5327a1bb82a9 ./images/IMG_2012/favs/IMG_1729.JPG
88761da2c2a0e57d8aab5327a1bb82a9 ./images/IMG_2013/IMG_1729.JPG
8fc75b0dd2806d5b4b2545aa89618eb6 ./DCIM/103canon/IMG_2317.JPG
971f0a4a064bb1a2517af6c058dc3eb3 ./images/IMG_2012/favs/IMG_2317.JPG
aad617065e46f97d97bd79d72708ec10 ./images/IMG_2013/IMG_1311.JPG
c937509b5deaaee62db0bf137bc77366 ./DCIM/IMG_1152.JPG
cc640e50f69020dd5d2d4600e20524ac ./images/IMG_2012/favs/IMG_2013.JPG
cc640e50f69020dd5d2d4600e20524ac ./images/IMG_2013/IMG_2013.JPG
d8edfcc3f9f322ae5193e14b5f645368 ./images/IMG_2012/favs/IMG_1060.JPG
dcc1da7daeb8507f798e4017149356c5 ./DCIM/103canon/IMG_1600.JPG
ded2f32c88796f40f080907d7402eb44 ./IMG_0085.JPG
Thanks in advance.
Let's suppose that you have the results of md5sum. For example:
$ cat file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1b
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2b
To remove duplicates from the list, use awk:
$ awk '!($1 in a){a[$1]; print}' file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
This uses the array a to keep track of which md5 sums we have seen so far. For each line, if the md5 has not appeared before, !($1 in a), we mark that md5 as having been seen and print the line.
Alternative
A shorter version of the code is:
$ awk '!a[$1]++' file
266ab54fd8a6dbc7ba61a0ee526763e5 /path/to/file1a
88761da2c2a0e57d8aab5327a1bb82a9 /path/to/file2a
cc640e50f69020dd5d2d4600e20524ac /path/to/file3
This uses array a to count the number of times that md5sum $1 has appeared. If the count is initially zero, then the line is printed.

bash script reading lines in every file copying specific values to newfile

I want to write a script helping me to do my work.
Problem: I have many files in one dir containing data and I need from every file specific values copied in a newfile.
The datafiles can look likes this:
Name abc $desV0
Start MJD56669 opCMS v2
End MJD56670 opCMS v2
...
valueX 0.0456 RV_gB
...
valueY 12063.23434 RV_gA
...
What the script should do is copy valueX and the following value and also valueY and following value copied into an new file in one line. And the add in that line the name of the source datafile. Additionally the value of valueY should only contain everything before the dot.
The result should look like this:
valueX 0.0456 valueY 12063 name_of_sourcefile
I am so far:
for file in $(find -maxdepth 0 -type f -name *.wt); do
for line in $(cat $file | grep -F vb); do
cp $line >> file_done
done
done
But that doesn't work at all. I also have no idea how to get the data in ONE line in the newfile.
Can anyone help me?
I think you can simplify your script a lot using awk:
awk '/valueX/{x=$2}/valueY/{print "valueX",x,"valueY",$2,FILENAME}' *.wt > file_done
This goes through every file in the current directory. When "valueX" is matched, the value is saved to the variable x. When "valueY" is matched, the line is printed.
This assumes that the line containing "valueX" always comes before the one containing "valueY". If that isn't a valid assumption, the script can easily be changed.
To print only the integer part of "valueY", you can use printf instead of print:
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,FILENAME}' *.wt > file_done
%d is the format specifier for an integer.
If your requirements are more complex and you need to use find, you should use -exec rather than looping through the results, to avoid problems with awkward file names:
find -maxdepth 1 -iname "5*.par" ! -iname "*_*" -exec \
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,"{}"}' '{}' \; > file_done
don't fight. I'm really thankful for your help and exspecially the fast answers.
This is my final solution I think:
#!/bin/bash
for file in $(find * -maxdepth 1 -iname "5*.par" ! -iname "*_*"); do
awk '/TASC/{x=$2}/START/{printf "TASC %s MJD %d %s",x,$2, FILENAME}' $file > mjd_vs_tasc
done
Very thanks again to you guys.
Try something like below :
egrep "valueX|valueY" *.wt | awk -vRD="\n" -vORS=" " -F':| ' '{if (NR%2==0) {print $2, $3, $1} else {print $2, $3}}' > $file.new.txt

Recursively check length of directory name

I need to determine if there are any directory names > 31 characters in a given directory (i.e. look underneath that root).
I know I can use something like find /path/to/root/dir -type d >> dirnames.txt
This will give me a text file of complete paths.
What I need is to get the actual number of characters in each directory name. Not sure if parsing the above results w/sed or awk makes sense. Looking for ideas/thoughts/suggestions/tips on how to accomplish this. Thanks!
This short script does it all in one go, i.e. finds all directory names and then outputs any which are greater than 31 characters in length (along with their length in characters):
for d in `find /path/to/root/dir -type d -exec basename {} \;` ; do
len=$(echo $d | wc -c)
if [ $len -gt 31 ] ; then
echo "$d = $len characters"
fi
done
Using your dirnames.txt file created by your find cmd, you can then sort the data by length of pathname, i.e.
awk '{print length($0) "\t" $0}' dirnames.txt | sort +0nr -1 > dirNamesWithSize.txt
This will present the longest path names (based on the value of length) at the top of the file.
I hope this helps.
Try this
find . -type d -exec bash -c '[ $(wc -c <<<"${1##*/}") -gt 32 ] && echo "${1}"' -- {} \; 2>/dev/null
The one bug, which I consider minor, is that it will over-count directory name length by 1 every time.
If what you wanted was the whole path rather than the last path component, then use this:
find . -type d | sed -e '/.\{32,\}/!d'
This version also has a bug, but only when file names have embedded newlines.
The output of both commands is a list of file names which match the criteria. Counting the length of each one is trivial from there.

Resources