How to change an awk script to work on Solaris - bash

the below script is bin-packing First-fit algorithm,the script is running normally on ubuntu Linux and i can call bin_packing.awk, but when I try to run it on unix solaris I'm getting errors
bin_packing.awk:
function first_fit(v, file) {
# find first bin that can accomodate the volume
for (i=1; i<=n; ++i) {
if (b[i] > v) {
b[i] -= v
bc[i]++
cmd="mv "file" subdir_" i
print cmd
# system(cmd)
return
}
}
# no bin found, create new bin
if (i > n) {
b[++n] = c - v
bc[n]++
cmd="mkdir subdir_"n
print cmd
# system(cmd)
cmd="mv "file" subdir_"n
print cmd
# system(cmd)
}
return
}
BEGIN{ if( (c+0) == 0) exit }
{ first_fit($1,$2) }
END { print "REPORT:"
print "Created",n,"directories"
for(i=1;i<=n;++i) print "- subdir_"i,":", c-b[i],"bytes",bc[i],"files"
}
and to call it:
$ find . -type f -iname '*pdf' -printf "%s %p\n" \
| awk -v c=100000 -f bin_packing.awk
This will create a list of files with the file size in bytes in front of it., value c to be the maximum size a directory can have in bytes. The above value c=100000 is only an example, This will create output like:
...
mv file_47 subdir_6
mv file_48 subdir_6
mv file_49 subdir_5
mv file_50 subdir_6
REPORT:
Created 6 directories
- subdir_1 : 49 bytes 12 files
- subdir_2 : 49 bytes 9 files
- subdir_3 : 49 bytes 8 files
- subdir_4 : 49 bytes 8 files
- subdir_5 : 48 bytes 8 files
- subdir_6 : 37 bytes 5 files
it shows the below erros if i try to run it on Solaris, and based on feedback -printf is a GNU feature, so it isn't available in non-GNU versions of find
find: bad option -printf
find: [-H | -L] path-list predicate-list
awk: syntax error near line 1
awk: bailing out near line 1

using nawk (new awk) or /usr/xpg4/bin/awk (POSIX awk) with Solaris. awk is the original legacy version with Perl to glean the same info as find's -printf:
Here is the soluation:
$ find . -type f -name '*.pdf' -print | perl -lne '$,=" "; #s=stat $_; print $s[7],$_, $s[2]' | nawk -v c=5000000 -f bin_packing.awk

to save the problem with the missing --printf find feature, you can try with:
find . -type f -iname '*pdf' -exec stat --printf="%s %n\n" {} \; \
| awk -v c=100000 -f bin_packing.awk

Related

How to generate a NUL-delimited stream of timestamped filenames with BSD `stat` command

Let's suppose that you need to generate a NUL-delimited stream of timestamped filenames.
On Linux & Solaris I can do it with:
stat --printf '%.9Y %n\0' -- *
On BSD, I can get the same info, but delimited by newlines, with:
stat -f '%.9Fm %N' -- *
The man talks about a few escape sequences but the NUL byte doesn't seem supported:
If the % is immediately followed by one of n, t, %, or #, then a newline character, a tab character, a percent character, or the current file number is printed.
Is there a way to work around that? edit: (accurately and efficiently?)
Update:
Sorry, the glob * is misleading. The arguments can contain any path.
I have a working solution that forks a stat call for each path. I want to improve it because of the massive number of files to process.
You may try this work-around solution if running stat command for files:
stat -nf "%.9Fm %N/" * | tr / '\0'
Here:
-n: To suppress newlines in stat output
Added / as terminator for each entry from stat output
tr / '\0': To convert / into NUL byte
Another work-around is to use a control character in stat and use tr to replace it with \0 like this:
stat -nf "%.9Fm %N"$'\1' * | tr '\1' '\0'
This will work with directories also.
Unfortunately, stat out of the box does not offer this option, and so what you ask is not directly achievable.
However, you can easily implement the required functionality in a scripting language like Perl or Python.
#!/usr/bin/env python3
from pathlib import Path
from sys import argv
for arg in argv[1:]:
print(
Path(arg).stat().st_mtime,
arg, end="\0")
Demo: https://ideone.com/vXiSPY
The demo exhibits a small discrepancy in the mtime which does not seem to be a rounding error, but the result could be different on MacOS (the demo platform is Debian Linux, apparently). If you want to force the result to a particular number of decimal places, Python has formatting facilities similar to those of stat and printf.
With any command that can't produce NUL-terminated (or any other character/string terminated) output, you can just wrap it in a function to call the command and then printf it's output with a terminating NUL instead of newline, for example:
nulstat() {
local fmt=$1 file
shift
for file in "$#"; do
printf '%s\0' "$(stat -f "$fmt" "$file")"
done
}
nulstat '%.9Fm %N' *
For example:
$ > foo
$ > $'foo\nbar'
$ nulstat '%.9Fm %N' foo* | od -c
0000000 1 6 6 3 1 6 2 5 3 6 . 4 7 7 9 8
0000020 0 1 4 0 f o o \0 1 6 6 3 1 6 2
0000040 5 3 9 . 3 8 8 0 6 9 9 3 0 f o
0000060 o \n b a r \0
0000066
1. What you can do (accurate but slow):
Fork a stat command for each input path:
for p in "$#"
do
stat -nf '%.9Fm' -- "$p" &&
printf '\t%s\0' "$p"
done
2. What you can do (accurate but twisted):
In the input paths, replace each occurrence of (possibly overlapping) /././ with a single /./, make stat output /././\n at the end of each record, and use awk to substitute each /././\n by a NUL byte:
#!/bin/bash
shopt -s extglob
stat -nf '%.9Fm%t%N/././%n' -- "${#//\/.\/+(.\/)//./}" |
awk -F '/\\./\\./' '{
if ( NF == 2 ) {
printf "%s%c", record $1, 0
record = ""
} else
record = record $1 "\n"
}'
N.B. If you wonder why I chose /././\n as record separator then take a look at Is it "safe" to replace each occurrence of (possibly overlapped) /./ with / in a path?
3. What you should do (accurate & fast):
You can use the following perl one‑liner on almost every UNIX/Linux:
LANG=C perl -MTime::HiRes=stat -e '
foreach (#ARGV) {
my #st = stat($_);
if ( #st > 0 ) {
printf "%.9f\t%s\0", $st[9], $_;
} else {
printf STDERR "stat: %s: %s\n", $_, $!;
}
}
' -- "$#"
note: for perl < 5.8.9, remove the -MTime::HiRes=stat from the command line.
ASIDE: There's a bug in BSD's stat:
When %N is at the end of the format string and the filename ends with a newline character, then its trailing newline might get stripped:
For example:
stat -f '%N' -- $'file1\n' file2
file1
file2
For getting the output that one would expect from stat -f '%N' you can use the -n switch and add an explicit %n at the end of the format string:
stat -nf '%N%n' -- $'file1\n' file2
file1
file2
Is there a way to work around that?
If all you need is to just replace all newlines with NULLs, then following tr should suffice
stat -f '%.9Fm %N' * | tr '\n' '\000'
Explanation: 000 is NULL expressed as octal value.

Remove text files with less than three lines

I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.
I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.
Thanks in advance.
Could you please try following findcommand.(tested with GNU awk)
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;
So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;
If the files in the current directory are all text files, this should be efficient and portable:
for f in *; do
[ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done # | xargs rm
Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.
Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.
You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.
You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3
perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' *
If you want to pipe the output of some other command (say find) you can use it like
$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt
$
the output of wc -l * on the same directory
$ wc -l *
12 bing.fasta
16 chris_smith.txt
8 dawn.txt
9 drcatfish.txt
3 fileA
3 fileB
13 foo.yaml
3 hubbs.txt
8 ip.txt
19 join_tab.pl
6 manoj1.txt
6 manoj2.txt
5 moose.txt
17 query_ip.txt
3 rororo.txt
5 scottc.txt
22 seats.ksh
1 steveman.txt
4 tane.txt
13 test_input_so.txt
24 ya801.txt
200 total
$

Listing files that are older than one day in reverse order of modification time

In order to write a cleanup script on a directory, I need to take look at all files that are older than one day. Additionally, I need to delete them in reverse order of modification time (oldest first) until a specified size is reached.
I came along with the following approach to list the files:
find . -mtime +1 -exec ls -a1rt {} +
Am I right, that this does not work for a large number of files (since more than one 'ls' will be executed)? How can I achieve my goal in that case?
You can use the following command to find the 10 oldest files:
find . -mtime +1 -type f -printf '%T# %p\n' | sort -n | head -10 | awk '{print $2}'
The steps used:
For each file returned by find, we print the modification timestamp along with the filename.
Then we numerically sort by the timestamp.
We take the 10 first.
We print only the filename part.
Later if you want to remove them, you can do the following:
rm $(...)
where ... is the command described above.
Here is a perl script that you can use to delete the oldest files first in a given directory, until the total size of the files in the directory gets down to a given size:
&CleanupDir("/path/to/directory/", 30*1024*1024); #delete oldest files first in /path/to/directory/ until total size of files in /path/to/directory/ gets down to 30MB
sub CleanupDir {
my($dirname, $dirsize) = #_;
my($cmd, $r, #lines, $line, #vals, $b, $dsize, $fname);
$b=1;
while($b) {
$cmd="du -k " . $dirname . " | cut -f1";
$r=`$cmd`;
$dsize=$r * 1024;
#print $dsize . "\n";
if($dsize>$dirsize) {
$cmd=" ls -lrt " . $dirname . " | head -n 100";
$r=`$cmd`;
#lines=split(/\n/, $r);
foreach $line (#lines) {
#vals=split(" ", $line);
if($#vals>=8) {
if(length($vals[8])>0) {
$fname=$dirname . $vals[8];
#print $fname . "\n";
unlink $fname;
}
}
}
} else {
$b=0;
}
}
}

Unix shell group files extensions by size

i want to group and sort files sizes by extensions in current and all subfolders
for i in `find . -type f -name '*.*' | sed 's/.*\.//' | sort | uniq `
do
echo $i
done
got code which gets all files extensions in current and all subfolders
now i need to sum all files sizes by those extensions and print them
Any ideas how this could be done?
example output:
sh (files sizes sum by sh extension)
pl (files sizes sum by pl extension)
c (files sizes sum by c extension)
I would use a loop, so that you can provide a different extension every time and find just the files with that extension:
for extension in c php pl ...
do
find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc
done
The sum is based on the answer in total size of group of files selected with 'find'.
In case you want the very specific output you mention in the question, you can store the last line and then print it together with the extension name:
for extension in c php pl ...
do
sum=$(find . -type f -name "*.$extension" -print0 | du --files0-from=- -hc | tail -1)
echo "$extension ($sum)"
done
If you don't want to name file extensions beforehand, the stat(1) program has a format option (-c) that can make tasks like this a bit easier, if you're on a system that includes it, and xargs(1) usually helps performance.
#!/bin/sh
find . -type f -name '*.*' -print0 |
xargs -0 stat -c '%s %n' |
sed 's/ .*\./ /' |
awk '
{
sums[$2] += $1
}
END {
for (key in sums) {
printf "%s %d\n", key, sums[key]
}
}'

bash script to help identify files and path with a count sorted by day of the week recursivley

i'm not a script-er really (yet), so apologies in advance.
What I need to do is search a path for files within the last 7 days then count the number of files in each directrory for each day (Mon to sun) for each day for each directory.
so for eaxmple
From folder - Rootfiles
Directory 1 :
Number of files Monday
Number of files ..n
Number of files Sunday
Directory 2 :
Number of files Monday
Number of files ..n
Number of files Sunday
So far I have this from my basic command line knowledge and a bit of research.
#!/bin/bash
find . -type f -mtime -7 -exec ls -l {} \; | grep "^-" | awk '{
key=$6$7
freq[key]++
}
END {
for (date in freq)
printf "%s\t%d\n", date, freq[date]
}'
but a couple of problems, I need to print each directory then I need to figure out the Monday, Tuesday, Wednesday sort.
And for some reason works on my test folders with basic folders and names, but isn't on the production folders.
Even some pouinters of where to start thinking would be helpful
Thanks in advance all, you all awesome!
Neil
I found some addtional code that is helping
#!bin/bash
# pass in the directory to search on the command line, use $PWD if not arg received
rdir=${1:-$(pwd)}
# if $rdir is a file, get it's directory
if [ -f $rdir ]; then
rdir=$(dirname $rdir)
fi
# first, find our tree of directories
for dir in $( find $rdir -type d -print ); do
# get a count of directories within $dir.
sdirs=$( find $dir -maxdepth 1 -type d | wc -l );
# only proceed if sdirs is less than 2 ( 1 = self ).
if (( $sdirs < 2 )); then
# get a count of all the files in $dir, but not in subdirs of $dir)
files=$( find $dir -maxdepth 1 -type f | wc -l );
echo "$dir : $files";
fi
done
if I could somehow replace the line
sdirs=$( find $dir -maxdepth 1 -type d | wc -l );
with my original code block that would help.
props to
https://unix.stackexchange.com/questions/22803/counting-files-in-leaves-of-directory-tree
for that bit of code
Neat problem.
I think in your find command you will want to add the --time-style=+%w to get the day of the week.
find . -type f -mtime -7 -exec ls -l --time-style=+%w {} \;
I'm not sure why you are grepping for lines that start with a dash (since you're already only finding files.) This is not necessary, so I would remove it.
Then I would get the directory names from this output by stripping the filenames, or everything after the last slash from each line.
| sed -e 's:/[^/]*$:/:'
Then I would cut out all the tokens before the day of the week. Since you're using . as the starting point, you can expect each directory to start with ./.
| sed -e 's:.*\([0-6]\) \./:\1 ./:'
From here you can sort -k2 to sort by directory name and then day of the week.
Eventually you can pipe this into uniq -c to get the counts of days per week by directory, but I would convert it to human readable days first.
| awk '
/^0/ { $1 = "Monday " }
/^1/ { $1 = "Tuesday " }
/^2/ { $1 = "Wednesday" }
/^3/ { $1 = "Thursday " }
/^4/ { $1 = "Friday " }
/^5/ { $1 = "Saturday " }
/^6/ { $1 = "Sunday " }
{ print $0 }
'
Putting this all together:
find . -type f -mtime -7 -exec ls -l --time-style=+%w {} \; \
| sed -e 's:/[^/]*$:/:' \
| sed -e 's:.*\([0-6]\) \./:\1 ./:' \
| sort -k2 \
| awk '
/^0/ { $1 = "Monday " }
/^1/ { $1 = "Tuesday " }
/^2/ { $1 = "Wednesday" }
/^3/ { $1 = "Thursday " }
/^4/ { $1 = "Friday " }
/^5/ { $1 = "Saturday " }
/^6/ { $1 = "Sunday " }
{ print $0 }
' | uniq -c
On my totally random PWD it looks like this:
1 Monday ./
1 Tuesday ./
5 Saturday ./
2 Sunday ./
17 Monday ./M3Javadoc/
1 Thursday ./M3Javadoc/
1 Saturday ./M3Javadoc/
1 Sunday ./M3Javadoc/

Resources