Bash: find and path - bash

As an extension of this question , I would now like to have not only the filename, but the directories up to k positions back. Here's the problem:
I have directories named RUN1, RUN2, and RUN3
Each directory has some files. Directory RUN1 has files mod1_1.csv, mod1_2.csv, mod1_3.csv. Directory RUN2 has files mod2_1.csv, mod2_2.csv, mod3_3.csv, etc.
The contents of mod1_1.csv file look like this:
5.71 6.66 5.52 6.90
5.78 6.69 5.55 6.98
5.77 6.63 5.73 6.91
And mod1_2.csv looks like this:
5.73 6.43 5.76 6.57
5.79 6.20 5.10 7.01
5.71 6.21 5.34 6.81
In RUN2, mod2_1.csv looks like this:
5.72 6.29 5.39 5.59
5.71 6.10 5.10 7.34
5.70 6.23 5.23 6.45
And mod2_2.csv looks like this:
5.72 6.29 5.39 5.69
5.71 6.10 5.10 7.32
5.70 6.23 5.23 6.21
My goal is to obtain the line with the smallest value of column 4 for each RUN* directory, and write that and the model which gave it and part of the path to a new .csv file. Right now, I have this code:
#!/bin/bash
resultfile="best_results_mlp_onelayer.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \;|sort -k4 -g |head -1 >> "$resultfile"
done
This gives me:
5.73 6.43 5.76 6.57 ./RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./RUN_2/mod2_1.csv
But I would like a .csv file with these contents:
5.73 6.43 5.76 6.57 ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./DIR1/DIR2/DIR3/RUN_2/mod2_1.csv
where my pwd is /DIRk/DIRm/DIRl/DIR1/DIR2/DIR3
EDIT:
Based on a reply, what I mean by 'k positions back' is:
Right now, my code gives me ./RUN_1/mod1_2.csv as the last column value in the first row. To me, that is a pwd 'one position back', because it shows the directory where the file mod1_2.csv is located. I would like the path '4 positions back'. That is, I would like ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv. I said 'k' because that's a common placeholder, and I was hoping I could just substitute a number in there.

Following dgeorgiev's answer, I placed my results gathering code in a directory further up in the hierarchy. So, continuing from my question, if my pwd is /DIRk/DIRm/DIRl/DIR1/DIR2/DIR3, I moved my .sh file to DIRk/DIRm/DIRl. Then I ran this:
#!/bin/bash
resultfile="best_results_mlp.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \;|sort -k4 -g |head -1 >> "$resultfile"
done
And the result was, as desired:
5.73 6.43 5.76 6.57 ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./DIR1/DIR2/DIR3/RUN_2/mod2_1.csv

Further to my answer in your previous question:
find passes the file with the path under which it was found. So if you search in "/path/to/$d", you will get "/path/to/$d/filename.csv". Just make find search in the path you would like to get.
So if your RUN* dirs are located in /path/to/, and you would like to have ./to/RUNx/filename.csv in your results, you can always do
cd /path/ && find ./to/RUNx/ # ...
If you need the absolute path, you can just run find on /path/to/RUNx
Just be careful when changing directories, and make sure to change back to where you have to if necessary. (eg. You might have to provide the path to your output file)

I don't see any commas in these CSVs. I assume you're just separating by whitespace. And since you're already using awk in your find line, I guess we can assume that you're open to awk-based options.
$ find . -type f
./RUN1/mod1_1
./RUN1/mod1_2
./RUN2/mod2_1
./RUN2/mod2_2
$ awk 'NR == 1 {n=$4} $4 > n {n=$4; f=FILENAME} END {print f,n}' RUN*/mod*
RUN2/mod2_1 7.34
This uses the awk built-in variable FILENAME which always contains the name of the current file.
I can't tell from your question what you mean by "k positions back", but you can strip or parse this output however you see fit.

How about something like this
find . -type d -name 'RUN*' | while read -r dir; do
awk '{print $0, FILENAME}' "$dir"/mod* \
| sort -k4 -g | head -1
done
(sorry about my original misinterpretation of your requirements, edited to correct the issue)

Related

How do I get a list of files that have line count below 18

I need to search for files in a directory by month/year and pass them through wc -l or lines and test if [ $lines -le 18 ], or something similar and give me a list of files that match.
In the past I called this with 'file.sh 2020-06' and used something like this to process the files for that month:
find . -name "* $1-*" -exec grep '(1 |2 |3 )' {}
but I now need to test for a line count.
The above -exec worked but when I changed over to passing the file to another exec I get complaints of "too many parameters" because the file name has spaces. I just can't seem to get on track with solving this one.
Any pointers to get me going would be very much appreciated.
Rick
Here's one using find and awk. But first some test files (Notice: it creates files named 16, 17, 18 and 19):
$ for i in 16 17 18 19 ; do seq 1 $i > $i ; done
Then:
$ find . -name 1\[6789\] -exec awk 'NR==18{exit c=1}END{if(!c) print FILENAME}' {} \;
./16
./17

Remove text files with less than three lines

I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.
I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.
Thanks in advance.
Could you please try following findcommand.(tested with GNU awk)
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;
So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;
If the files in the current directory are all text files, this should be efficient and portable:
for f in *; do
[ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done # | xargs rm
Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.
Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.
You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.
You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3
perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' *
If you want to pipe the output of some other command (say find) you can use it like
$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt
$
the output of wc -l * on the same directory
$ wc -l *
12 bing.fasta
16 chris_smith.txt
8 dawn.txt
9 drcatfish.txt
3 fileA
3 fileB
13 foo.yaml
3 hubbs.txt
8 ip.txt
19 join_tab.pl
6 manoj1.txt
6 manoj2.txt
5 moose.txt
17 query_ip.txt
3 rororo.txt
5 scottc.txt
22 seats.ksh
1 steveman.txt
4 tane.txt
13 test_input_so.txt
24 ya801.txt
200 total
$

How to get 20% the total number of file in a folder?

I am using shell to count the number files of a folder. For example, a folder A has 100 file, and I just want to show 20% of it and it must be integer, means 20. This is my code but it was failure
file_num= find . -type f | wc -l
prob_select=0.2
file_num=$(expr $file_num \* $prob_select)
file_num=$( printf "%.0f" $file_num)
For a somewhat simpler approach which shows every n files instead of requiring you to know how many there are before deciding which ones to display,
find . -type f | awk -v factor=5 'NR%factor == 0'
You can't work with float numbers like that in bash, but try to convert 20% -> 0.2 -> 2/10 -> 1/5, so:
file_num=$(($(find . -type f | wc -l) / 5)); echo "${file_num}"
You will get the number of 20% of found files.
Next, just run find . -type f | head -n "${file_num}"

awk, IFS, and file name truncations

Updated question based on new information…
Here is a gist of my code, with the general idea that I store items in DropBox at:
~/Dropbox/Public/drops/xx.xx.xx/whatever
Where the date is always 2 chars, 2 chars, and 2 chars, dot separated. Within that folder can be more folders and more files, which is why when I use find I do not set the depth and allow it to scan recursively.
https://gist.github.com/anonymous/ad51dc25290413239f6f
Below is a shortened version of the gist, it won't run as it stands, I don't believe, though the gist will run assuming you have DropBox installed and there are files at the path location that I set up.
General workflow:
SIZE="+250k" # For `find` this is the value in size I am looking for files to be larger than
# Location where I store the output to `find` to process that file further later on.
TEMP="/tmp/drops-output.txt"
Next I rm the tmp file and touch a new one.
I will then cd into
DEST=/Users/$USER/Dropbox/Public/drops
Perform a quick conditional check to make sure that I am working where I want to be,
with all my values as variables, I could mess up easily and not be working where I
thought I would be.
# Conditional check: is the current directory the one I want to be the working directory?
if [ "$(pwd)" = "${DEST}" ]; then
echo -e "Destination and current working directory are equal, this is good!:\n $(pwd)\n"
fi
The meat of step one is the `find` command
# Use `find` to locate a subset of files that are larger than a certain size
# save that to a temp file and process it. I believe this could all be done in
# one find command with -exec or similar but I can't figure it out
find . -type f -size "${SIZE}" -exec ls -lh {} \; >> "$TEMP"
Inside $TEMP will be a data set that looks like this:
-rw-r--r--# 1 me staff 61K Dec 28 2009 /Users/me/Dropbox/Public/drops/12.28.09/wor-10e619e1-120407.png
-rw-r--r--# 1 me staff 230K Dec 30 2009 /Users/me/Dropbox/Public/drops/12.30.09/hijack-loop-d6250496-153355.pdf
-rw-r--r--# 1 me staff 49K Dec 31 2009 /Users/me/Dropbox/Public/drops/12.31.09/mt-5a819185-180538.png
The trouble is, not all files will contains no spaces, though I have done all I can to make sure variables are quoted
and wrapped in parens or braces or quotes where applicable.
With the results in /tmp I run:
# Number of results located as a result of the find `command` above
RESULTS=$(wc -l "$TEMP" | awk '{print $1}')
echo -e "Located: [$RESULTS] total files greater than or equal to $SIZE\n"
# With a result set found via `find`, now use awk to print out the sorted list of file
# sizes and paths.
echo -e "SIZE DATE FILE PATH"
#awk '{print "["$5"] ", $9, $10}' < "$TEMP" | sort -n
awk '{for(i=5;i<=NF;i++) {printf $i " "} ; printf "\n"}' "$TEMP" | sort -n
With the changes to awk from how I had it originally, my result now looks like this:
751K Oct 21 19:00 ./10.21.14/netflix-67-190039.png
760K Sep 14 19:07 ./01.02.15/logos/RCA_old_logo.jpg
797K Aug 21 03:25 ./08.21.14/girl-88-032514.zip
916K Sep 11 21:47 ./09.11.14/small-shot-4d-214727.png
I want it to look like this:
SIZE FILE PATH
========================================
751K ./10.21.14/netflix-67-190039.png
760K ./01.02.15/logos/RCA_old_logo.jpg
797K ./08.21.14/girl-88-032514.zip
916K ./09.11.14/small-shot-4d-214727.png
# All Done
if [ "$?" -ne "0" ]; then
echo "find of drop files larger than $SIZE completed without errors.\n"
exit 1
fi
Original Post to Stack prior to gaining some new information leading to new questions…
Original Post is below, given new information, I tried some new tactics and have left myself with the above script and info.
I have a simple script, Mac OS X, it performs a find on a dir and locates all files of type file and of size greater than +SIZE
These are then appended to a file via >>
From there, I have a file that essentially contains a ls -la listing, so I use awk to get to the file size and the file name with this command:
# With a result set found via `find`, now use awk to print out the sorted list of file
# sizes and paths.
echo -e "SIZE FILE PATH"
awk '{print "["$5"] ", $9, $10}' < "$TEMP" | sort -n
All works as I want it to, but I get some filename truncation right at the above code. The entire file is around 30 lines, I have pinned it to this line. I think if I throw in a different Internal Field Sep that would fix it. I could use \t as there can't be a \t in Mac OS X filenames.
I thought it was just quoting, but I can't seem to see where if that is the case. Here is a sample of the data returned, usually I get about 50 results. The first one I stuffed in this file has filename truncation:
[1.0M] ./11.26.14/Bruna Legal
[1.4M] ./12.22.14/card-88-082636.jpg
[1.6M] ./12.22.14/thrasher-8c-082637.jpg
[11M] ./01.20.15/td-6e-225516.mp3
Bruna Legal is "Bruna Legal Name.pdf" on the filesystem.
You can avoid parsing the output of ls command and do the whole work with find using the printf action, like:
find /tmp -type f -maxdepth 1 -size +4k 2>/dev/null -printf "%kKB %f\n" |
sort -nrk1,1
In my example it outputs every file that is bigger than 4 kilobytes. The issue is that the find command cannot print formatted output with the size in MB. In addition the numeric ordering does not work for me with square brackets surrounding the number, so I omit them. In my test it yields:
140KB +~JF7115171557203024470.tmp
140KB +~JF3757415404286641313.tmp
120KB +~JF8126196619419441256.tmp
120KB +~JF7746650828107924225.tmp
120KB +~JF7068968012809375252.tmp
120KB +~JF6524754220513582381.tmp
120KB +~JF5532731202854554147.tmp
120KB +~JF4394954996081723171.tmp
24KB +~JF8516467789156825793.tmp
24KB +~JF3941252532304626610.tmp
24KB +~JF2329724875703278852.tmp
16KB 578829321_2015-01-23_1708257780.pdf
12KB 575998801_2015-01-16_1708257780-1.pdf
8KB adb.log
EDIT because I've noted that %k is not accurate enough, so you can use %s to print in bytes and transform to KB o MB using awk, like:
find /tmp -type f -maxdepth 1 -size +4k 2>/dev/null -printf "%sKB %f\n" |
sort -nrk1,1 |
awk '{ $1 = sprintf( "%.2f", $1 / 1024) } { print }'
It yields:
136.99KB +~JF7115171557203024470.tmp
136.99KB +~JF3757415404286641313.tmp
117.72KB +~JF8126196619419441256.tmp
117.72KB +~JF7068968012809375252.tmp
117.72KB +~JF6524754220513582381.tmp
117.68KB +~JF7746650828107924225.tmp
117.68KB +~JF5532731202854554147.tmp
117.68KB +~JF4394954996081723171.tmp
21.89KB +~JF8516467789156825793.tmp
21.89KB +~JF3941252532304626610.tmp
21.89KB +~JF2329724875703278852.tmp
14.14KB 578829321_2015-01-23_1708257780.pdf
10.13KB 575998801_2015-01-16_1708257780-1.pdf
4.01KB adb.log

Unix shell group files extensions by size

i want to group and sort files sizes by extensions in current and all subfolders
for i in `find . -type f -name '*.*' | sed 's/.*\.//' | sort | uniq `
do
echo $i
done
got code which gets all files extensions in current and all subfolders
now i need to sum all files sizes by those extensions and print them
Any ideas how this could be done?
example output:
sh (files sizes sum by sh extension)
pl (files sizes sum by pl extension)
c (files sizes sum by c extension)
I would use a loop, so that you can provide a different extension every time and find just the files with that extension:
for extension in c php pl ...
do
find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc
done
The sum is based on the answer in total size of group of files selected with 'find'.
In case you want the very specific output you mention in the question, you can store the last line and then print it together with the extension name:
for extension in c php pl ...
do
sum=$(find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc | tail -1)
echo "$extension ($sum)"
done
If you don't want to name file extensions beforehand, the stat(1) program has a format option (-c) that can make tasks like this a bit easier, if you're on a system that includes it, and xargs(1) usually helps performance.
#!/bin/sh
find . -type f -name '*.*' -print0 |
xargs -0 stat -c '%s %n' |
sed 's/ .*\./ /' |
awk '
{
sums[$2] += $1
}
END {
for (key in sums) {
printf "%s %d\n", key, sums[key]
}
}'

Resources