How to get 20% the total number of file in a folder? - bash

I am using shell to count the number files of a folder. For example, a folder A has 100 file, and I just want to show 20% of it and it must be integer, means 20. This is my code but it was failure
file_num= find . -type f | wc -l
prob_select=0.2
file_num=$(expr $file_num \* $prob_select)
file_num=$( printf "%.0f" $file_num)

For a somewhat simpler approach which shows every n files instead of requiring you to know how many there are before deciding which ones to display,
find . -type f | awk -v factor=5 'NR%factor == 0'

You can't work with float numbers like that in bash, but try to convert 20% -> 0.2 -> 2/10 -> 1/5, so:
file_num=$(($(find . -type f | wc -l) / 5)); echo "${file_num}"
You will get the number of 20% of found files.
Next, just run find . -type f | head -n "${file_num}"

Related

Bash to find missing file

I'm counting files in a photos folder:
% find . -type f | wc -l
22188
Then I'm counting files per extension:
% find . -type f | sed -n 's/..*\.//p' | sort | uniq -c
268 AVI
14983 JPG
61 MOV
1 MP4
131 MPG
1 VOB
21 avi
1 jpeg
6602 jpg
12 mov
20 mp4
74 mpg
12 png
The sum of that is 22187, not 22188. So I thought it could be a file without extension:
% find . -type f ! -name "*.*"
But the result was empty. Maybe a file starting with .:
% find . -type f ! -name "?*.*"
But also empty. How can I find out what that file is?
I'm on macOS 10.15.
This command should find the missing file:
comm -3 <(find . -type f | sort) <(find . -type f | sed -n '/..*\./p' | sort)
Perhaps a file with an embedded carriage return (or linefeed)?
Would be curious to see what this generates:
find . -type f | grep -Eiv '\.avi|\.jpg|\.mov|\.mp4|\.mpg|\.vob|\.avi|\.jpeg|\.png'
Would you please try:
find . -type f -name $'*\n*'
It will pick up filenames which contain newline character.
The ANSI-C quoting is supported by bash-3.2.x or so on MacOS.

How do I get a list of files that have line count below 18

I need to search for files in a directory by month/year and pass them through wc -l or lines and test if [ $lines -le 18 ], or something similar and give me a list of files that match.
In the past I called this with 'file.sh 2020-06' and used something like this to process the files for that month:
find . -name "* $1-*" -exec grep '(1 |2 |3 )' {}
but I now need to test for a line count.
The above -exec worked but when I changed over to passing the file to another exec I get complaints of "too many parameters" because the file name has spaces. I just can't seem to get on track with solving this one.
Any pointers to get me going would be very much appreciated.
Rick
Here's one using find and awk. But first some test files (Notice: it creates files named 16, 17, 18 and 19):
$ for i in 16 17 18 19 ; do seq 1 $i > $i ; done
Then:
$ find . -name 1\[6789\] -exec awk 'NR==18{exit c=1}END{if(!c) print FILENAME}' {} \;
./16
./17

Cleanest way to get the highest suffix (or prefix) of a certain file type in a set of directories with bash?

I have a set of data files across a number of directories with format
ls lcp01/output/
> dst000.dat dst001.dat ... dst075.dat nn000.dat nn001.dat ... nn036.dat aa000.dat aa001.dat ... aa040.dat
That is to say, there are a set of directories lcp01 through lcp25 with a collection of different data files in their output folders. I want to know what the highest number dstXXX.dat file is in each directory (in the example shown the result would be 75).
I wrote a script which achieves this, but I'm not satisfied with the final step which feels a bit hacky:
#!/bin/bash
for i in `seq -f "%02g" 1 25`; #specify dir extensions 1 through 25
do
echo " "
echo $i
names=($(ls lcp$i/output | grep dst )) #dir containing dst files
NUMS=()
for j in "${names[#]}";
do
temp="$(echo $j | tr -dc '0-9' && printf " ")" # record suffixes for each dst file
NUMS+=("$((10#$temp))") #force base 10 interpretation of dst suffixes
done
numList="$(echo "${NUMS[*]}" | sort -nr | head -n1)"
echo ${numList:(-3)} #print out the last 3 characters of the sorted list - the largest file suffix
done
The final two steps organise the list of output indices, then I show the last 3 characters of that list which will be my largest file number (providing the file numbers are smaller than 100).
Is there a cleaner way of doing this? Ideally I would like more control over the output format, but mainly it's the step of reading the last 3 characters out. I would like to be able to just output the largest number, which should be the last element of the list but I cannot figure out how.
You could do something like the following:
for d in lc[0-9][0-9]; do find $d -name 'dst*.dat' -print | sort -u | tail -n1; done
Above command will only work if the numbering has the same number of digits (dst001..999.dat), as it is sorted as a string; if that's not the case:
for d in lc[0-9][0-9]; do echo -n $d: ; find $d -name 'dst*.dat' -print | grep -o '[0-9]*.dat' | sort -n | tail -n1; done
using filename expansions
for d in lcp*/output; do
files=( $d/dst*.dat )
file=${files[-1]}
[[ -e $file ]] || continue
file=${file#dst*}
echo ${file%.dat}
done
or with extension option to restrict pattern to numbers
shopt -s extglob
... lcp*([0-9])/output
... $d/dst*([0-9]).dat
...
file=${file##dst*(0)}
...

Bash: find and path

As an extension of this question , I would now like to have not only the filename, but the directories up to k positions back. Here's the problem:
I have directories named RUN1, RUN2, and RUN3
Each directory has some files. Directory RUN1 has files mod1_1.csv, mod1_2.csv, mod1_3.csv. Directory RUN2 has files mod2_1.csv, mod2_2.csv, mod3_3.csv, etc.
The contents of mod1_1.csv file look like this:
5.71 6.66 5.52 6.90
5.78 6.69 5.55 6.98
5.77 6.63 5.73 6.91
And mod1_2.csv looks like this:
5.73 6.43 5.76 6.57
5.79 6.20 5.10 7.01
5.71 6.21 5.34 6.81
In RUN2, mod2_1.csv looks like this:
5.72 6.29 5.39 5.59
5.71 6.10 5.10 7.34
5.70 6.23 5.23 6.45
And mod2_2.csv looks like this:
5.72 6.29 5.39 5.69
5.71 6.10 5.10 7.32
5.70 6.23 5.23 6.21
My goal is to obtain the line with the smallest value of column 4 for each RUN* directory, and write that and the model which gave it and part of the path to a new .csv file. Right now, I have this code:
#!/bin/bash
resultfile="best_results_mlp_onelayer.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \;|sort -k4 -g |head -1 >> "$resultfile"
done
This gives me:
5.73 6.43 5.76 6.57 ./RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./RUN_2/mod2_1.csv
But I would like a .csv file with these contents:
5.73 6.43 5.76 6.57 ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./DIR1/DIR2/DIR3/RUN_2/mod2_1.csv
where my pwd is /DIRk/DIRm/DIRl/DIR1/DIR2/DIR3
EDIT:
Based on a reply, what I mean by 'k positions back' is:
Right now, my code gives me ./RUN_1/mod1_2.csv as the last column value in the first row. To me, that is a pwd 'one position back', because it shows the directory where the file mod1_2.csv is located. I would like the path '4 positions back'. That is, I would like ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv. I said 'k' because that's a common placeholder, and I was hoping I could just substitute a number in there.
Following dgeorgiev's answer, I placed my results gathering code in a directory further up in the hierarchy. So, continuing from my question, if my pwd is /DIRk/DIRm/DIRl/DIR1/DIR2/DIR3, I moved my .sh file to DIRk/DIRm/DIRl. Then I ran this:
#!/bin/bash
resultfile="best_results_mlp.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \;|sort -k4 -g |head -1 >> "$resultfile"
done
And the result was, as desired:
5.73 6.43 5.76 6.57 ./DIR1/DIR2/DIR3/RUN_1/mod1_2.csv
5.72 6.29 5.39 5.59 ./DIR1/DIR2/DIR3/RUN_2/mod2_1.csv
Further to my answer in your previous question:
find passes the file with the path under which it was found. So if you search in "/path/to/$d", you will get "/path/to/$d/filename.csv". Just make find search in the path you would like to get.
So if your RUN* dirs are located in /path/to/, and you would like to have ./to/RUNx/filename.csv in your results, you can always do
cd /path/ && find ./to/RUNx/ # ...
If you need the absolute path, you can just run find on /path/to/RUNx
Just be careful when changing directories, and make sure to change back to where you have to if necessary. (eg. You might have to provide the path to your output file)
I don't see any commas in these CSVs. I assume you're just separating by whitespace. And since you're already using awk in your find line, I guess we can assume that you're open to awk-based options.
$ find . -type f
./RUN1/mod1_1
./RUN1/mod1_2
./RUN2/mod2_1
./RUN2/mod2_2
$ awk 'NR == 1 {n=$4} $4 > n {n=$4; f=FILENAME} END {print f,n}' RUN*/mod*
RUN2/mod2_1 7.34
This uses the awk built-in variable FILENAME which always contains the name of the current file.
I can't tell from your question what you mean by "k positions back", but you can strip or parse this output however you see fit.
How about something like this
find . -type d -name 'RUN*' | while read -r dir; do
awk '{print $0, FILENAME}' "$dir"/mod* \
| sort -k4 -g | head -1
done
(sorry about my original misinterpretation of your requirements, edited to correct the issue)

Unix shell group files extensions by size

i want to group and sort files sizes by extensions in current and all subfolders
for i in `find . -type f -name '*.*' | sed 's/.*\.//' | sort | uniq `
do
echo $i
done
got code which gets all files extensions in current and all subfolders
now i need to sum all files sizes by those extensions and print them
Any ideas how this could be done?
example output:
sh (files sizes sum by sh extension)
pl (files sizes sum by pl extension)
c (files sizes sum by c extension)
I would use a loop, so that you can provide a different extension every time and find just the files with that extension:
for extension in c php pl ...
do
find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc
done
The sum is based on the answer in total size of group of files selected with 'find'.
In case you want the very specific output you mention in the question, you can store the last line and then print it together with the extension name:
for extension in c php pl ...
do
sum=$(find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc | tail -1)
echo "$extension ($sum)"
done
If you don't want to name file extensions beforehand, the stat(1) program has a format option (-c) that can make tasks like this a bit easier, if you're on a system that includes it, and xargs(1) usually helps performance.
#!/bin/sh
find . -type f -name '*.*' -print0 |
xargs -0 stat -c '%s %n' |
sed 's/ .*\./ /' |
awk '
{
sums[$2] += $1
}
END {
for (key in sums) {
printf "%s %d\n", key, sums[key]
}
}'

Resources