I'm using an Awk script to split a big text document into independent files. I did it and now I'm working with 14k text files. The problem here is there are a lot of files with just three lines of text and it's not useful for me to keep them.
I know I can delete lines in a text with awk 'NF>=3' file, but I don't want to delete lines inside files, rather I want to delete files which content is just two or three text lines.
Thanks in advance.
Could you please try following findcommand.(tested with GNU awk)
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{if (!f) print FILENAME}' {} \;
So above will print file names who are having lesser than 3 lines on console. Once you are happy with results coming then try following to delete them. Only once you are ok with above command's output run following and even I will suggest run below command in a test directory first and once you are fully satisfied then proceed with below one.(remove echo from below I have still put it for safer side :) )
find /your/path/ -type f -exec awk -v lines=3 'NR>lines{f=1; exit} END{exit !f}' {} \; -exec echo rm -f {} \;
If the files in the current directory are all text files, this should be efficient and portable:
for f in *; do
[ $(head -4 "$f" | wc -l) -lt 4 ] && echo "$f"
done # | xargs rm
Inspect the list, and if it looks OK, then remove the # on the last line to actually delete the unwanted files.
Why use head -4? Because wc doesn't know when to quit. Suppose half of the text files were each more than a terabyte long; if that were the case wc -l alone would be quite slow.
You may use wc to calculate lines and then decide either to delete the file or not. you should write a shell script instead of just awk command.
You can try Perl. The below solution will be efficient as the file handle ARGV will be closed if the line count > 3
perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' *
If you want to pipe the output of some other command (say find) you can use it like
$ find . -name "*" -type f -exec perl -nle ' close(ARGV) if ($.>3) ; $kv{$ARGV}++; END { for(sort keys %kv) { print if $kv{$_}>3 } } ' {} \;
./bing.fasta
./chris_smith.txt
./dawn.txt
./drcatfish.txt
./foo.yaml
./ip.txt
./join_tab.pl
./manoj1.txt
./manoj2.txt
./moose.txt
./query_ip.txt
./scottc.txt
./seats.ksh
./tane.txt
./test_input_so.txt
./ya801.txt
$
the output of wc -l * on the same directory
$ wc -l *
12 bing.fasta
16 chris_smith.txt
8 dawn.txt
9 drcatfish.txt
3 fileA
3 fileB
13 foo.yaml
3 hubbs.txt
8 ip.txt
19 join_tab.pl
6 manoj1.txt
6 manoj2.txt
5 moose.txt
17 query_ip.txt
3 rororo.txt
5 scottc.txt
22 seats.ksh
1 steveman.txt
4 tane.txt
13 test_input_so.txt
24 ya801.txt
200 total
$
Related
Let's suppose that you need to generate a NUL-delimited stream of timestamped filenames.
On Linux & Solaris I can do it with:
stat --printf '%.9Y %n\0' -- *
On BSD, I can get the same info, but delimited by newlines, with:
stat -f '%.9Fm %N' -- *
The man talks about a few escape sequences but the NUL byte doesn't seem supported:
If the % is immediately followed by one of n, t, %, or #, then a newline character, a tab character, a percent character, or the current file number is printed.
Is there a way to work around that? edit: (accurately and efficiently?)
Update:
Sorry, the glob * is misleading. The arguments can contain any path.
I have a working solution that forks a stat call for each path. I want to improve it because of the massive number of files to process.
You may try this work-around solution if running stat command for files:
stat -nf "%.9Fm %N/" * | tr / '\0'
Here:
-n: To suppress newlines in stat output
Added / as terminator for each entry from stat output
tr / '\0': To convert / into NUL byte
Another work-around is to use a control character in stat and use tr to replace it with \0 like this:
stat -nf "%.9Fm %N"$'\1' * | tr '\1' '\0'
This will work with directories also.
Unfortunately, stat out of the box does not offer this option, and so what you ask is not directly achievable.
However, you can easily implement the required functionality in a scripting language like Perl or Python.
#!/usr/bin/env python3
from pathlib import Path
from sys import argv
for arg in argv[1:]:
print(
Path(arg).stat().st_mtime,
arg, end="\0")
Demo: https://ideone.com/vXiSPY
The demo exhibits a small discrepancy in the mtime which does not seem to be a rounding error, but the result could be different on MacOS (the demo platform is Debian Linux, apparently). If you want to force the result to a particular number of decimal places, Python has formatting facilities similar to those of stat and printf.
With any command that can't produce NUL-terminated (or any other character/string terminated) output, you can just wrap it in a function to call the command and then printf it's output with a terminating NUL instead of newline, for example:
nulstat() {
local fmt=$1 file
shift
for file in "$#"; do
printf '%s\0' "$(stat -f "$fmt" "$file")"
done
}
nulstat '%.9Fm %N' *
For example:
$ > foo
$ > $'foo\nbar'
$ nulstat '%.9Fm %N' foo* | od -c
0000000 1 6 6 3 1 6 2 5 3 6 . 4 7 7 9 8
0000020 0 1 4 0 f o o \0 1 6 6 3 1 6 2
0000040 5 3 9 . 3 8 8 0 6 9 9 3 0 f o
0000060 o \n b a r \0
0000066
1. What you can do (accurate but slow):
Fork a stat command for each input path:
for p in "$#"
do
stat -nf '%.9Fm' -- "$p" &&
printf '\t%s\0' "$p"
done
2. What you can do (accurate but twisted):
In the input paths, replace each occurrence of (possibly overlapping) /././ with a single /./, make stat output /././\n at the end of each record, and use awk to substitute each /././\n by a NUL byte:
#!/bin/bash
shopt -s extglob
stat -nf '%.9Fm%t%N/././%n' -- "${#//\/.\/+(.\/)//./}" |
awk -F '/\\./\\./' '{
if ( NF == 2 ) {
printf "%s%c", record $1, 0
record = ""
} else
record = record $1 "\n"
}'
N.B. If you wonder why I chose /././\n as record separator then take a look at Is it "safe" to replace each occurrence of (possibly overlapped) /./ with / in a path?
3. What you should do (accurate & fast):
You can use the following perl one‑liner on almost every UNIX/Linux:
LANG=C perl -MTime::HiRes=stat -e '
foreach (#ARGV) {
my #st = stat($_);
if ( #st > 0 ) {
printf "%.9f\t%s\0", $st[9], $_;
} else {
printf STDERR "stat: %s: %s\n", $_, $!;
}
}
' -- "$#"
note: for perl < 5.8.9, remove the -MTime::HiRes=stat from the command line.
ASIDE: There's a bug in BSD's stat:
When %N is at the end of the format string and the filename ends with a newline character, then its trailing newline might get stripped:
For example:
stat -f '%N' -- $'file1\n' file2
file1
file2
For getting the output that one would expect from stat -f '%N' you can use the -n switch and add an explicit %n at the end of the format string:
stat -nf '%N%n' -- $'file1\n' file2
file1
file2
Is there a way to work around that?
If all you need is to just replace all newlines with NULLs, then following tr should suffice
stat -f '%.9Fm %N' * | tr '\n' '\000'
Explanation: 000 is NULL expressed as octal value.
I executed a command on Linux to list all the files & subfiles (with specific format) in a folder.
This command is:
ls -R | grep -e "\.txt$" -e "\.py$"
In an other hand, I have some filenames stored in a file .txt (line by line).
I want to show the result of my previous command, but I want to filter the result using the file called filters.txt.
If the result is in the file, I keep it
Else, I do not keep it.
How can I do it, in bash, in only one line?
I suppose this is something like:
ls -R | grep -e "\.txt$" -e "\.py$" | grep filters.txt
An example of the files:
# filters.txt
README.txt
__init__.py
EDIT 1
I am trying to a file instead a list of argument because I get the error:
'/bin/grep: Argument list too long'
EDIT 2
# The result of the command ls -R
-rw-r--r-- 1 XXX 1 Oct 28 23:36 README.txt
-rw-r--r-- 1 XXX 1 Oct 28 23:36 __init__.py
-rw-r--r-- 1 XXX 1 Oct 28 23:36 iamaninja.txt
-rw-r--r-- 1 XXX 1 Oct 28 23:36 donttakeme.txt
-rw-r--r-- 1 XXX 1 Oct 28 23:36 donttakeme2.txt
What I want as a result:
-rw-r--r-- 1 XXX 1 Oct 28 23:36 README.txt
-rw-r--r-- 1 XXX 1 Oct 28 23:36 __init__.py
You can use comm :
comm -12 <(ls -R | grep -e "\.txt$" -e "\.py$" ) <(cat filters.txt)
This will give you the intersection of the two lists.
EDIT
It seems that ls is not great for this, maybe find Would be safer
find . -type f | xargs grep $(sed ':a;N;$!ba;s/\n/\\|/g' filters.txt)
That is, for each of your files, take your filters.txt and replace all newlines with \| using sed and then grep for all the entries.
Grep uses \| between items when grepping for more than one item. So the sed command transforms the filters.txt into such a list of items to be used by grep.
grep -f filters.txt -r .
..where . is your current folder.
You can run this script in the target directory, giving the list file as a single argument.
#!/bin/bash -e
# exit early if awk fails (ie. can't read list)
shopt -s lastpipe
find . -mindepth 1 -type f -name '*.txt' -o -name '*.py' -print0 |
awk -v exclude_list_file="${1?:no list file provided}" \
'BEGIN {
while ((getline line < exclude_list_file) > 0) {
exclude_list[c++] = line
}
close(exclude_list_file)
if (c==0) {
exit 1
}
FS = "/"
RS = "\000"
}
{
for (i in exclude_list) {
if (exclude_list[i] == $NF) {
next
}
}
print
}'
It prints all paths, recursively, excluding any filename which exactly matches a line in the list file (so lines not ending .py or .txt wouldn’t do anything).
Only the filename is considered, the preceding path is ignored.
It fails immediately if no argument is given or it can't read a line from the list file.
The question is tagged bash, but if you change the shebang to sh, and remove shopt, then everything in the script except -print0 is POSIX. -print0 is common, it’s available on GNU (Linux), BSDs (including OpenBSD), and busybox.
The purpose of lastpipe is to exit immediately if the list file can't be read. Without it, find keeps runs until completion (but nothing gets printed).
If you specifically want the ls -l output format, you could change awk to use a null output record separator (add ORS = "\000" to the end of BEGIN, directly below RS="\000"), and pipe awk in to xargs -0 ls -ld.
I need to search for files in a directory by month/year and pass them through wc -l or lines and test if [ $lines -le 18 ], or something similar and give me a list of files that match.
In the past I called this with 'file.sh 2020-06' and used something like this to process the files for that month:
find . -name "* $1-*" -exec grep '(1 |2 |3 )' {}
but I now need to test for a line count.
The above -exec worked but when I changed over to passing the file to another exec I get complaints of "too many parameters" because the file name has spaces. I just can't seem to get on track with solving this one.
Any pointers to get me going would be very much appreciated.
Rick
Here's one using find and awk. But first some test files (Notice: it creates files named 16, 17, 18 and 19):
$ for i in 16 17 18 19 ; do seq 1 $i > $i ; done
Then:
$ find . -name 1\[6789\] -exec awk 'NR==18{exit c=1}END{if(!c) print FILENAME}' {} \;
./16
./17
the below script is bin-packing First-fit algorithm,the script is running normally on ubuntu Linux and i can call bin_packing.awk, but when I try to run it on unix solaris I'm getting errors
bin_packing.awk:
function first_fit(v, file) {
# find first bin that can accomodate the volume
for (i=1; i<=n; ++i) {
if (b[i] > v) {
b[i] -= v
bc[i]++
cmd="mv "file" subdir_" i
print cmd
# system(cmd)
return
}
}
# no bin found, create new bin
if (i > n) {
b[++n] = c - v
bc[n]++
cmd="mkdir subdir_"n
print cmd
# system(cmd)
cmd="mv "file" subdir_"n
print cmd
# system(cmd)
}
return
}
BEGIN{ if( (c+0) == 0) exit }
{ first_fit($1,$2) }
END { print "REPORT:"
print "Created",n,"directories"
for(i=1;i<=n;++i) print "- subdir_"i,":", c-b[i],"bytes",bc[i],"files"
}
and to call it:
$ find . -type f -iname '*pdf' -printf "%s %p\n" \
| awk -v c=100000 -f bin_packing.awk
This will create a list of files with the file size in bytes in front of it., value c to be the maximum size a directory can have in bytes. The above value c=100000 is only an example, This will create output like:
...
mv file_47 subdir_6
mv file_48 subdir_6
mv file_49 subdir_5
mv file_50 subdir_6
REPORT:
Created 6 directories
- subdir_1 : 49 bytes 12 files
- subdir_2 : 49 bytes 9 files
- subdir_3 : 49 bytes 8 files
- subdir_4 : 49 bytes 8 files
- subdir_5 : 48 bytes 8 files
- subdir_6 : 37 bytes 5 files
it shows the below erros if i try to run it on Solaris, and based on feedback -printf is a GNU feature, so it isn't available in non-GNU versions of find
find: bad option -printf
find: [-H | -L] path-list predicate-list
awk: syntax error near line 1
awk: bailing out near line 1
using nawk (new awk) or /usr/xpg4/bin/awk (POSIX awk) with Solaris. awk is the original legacy version with Perl to glean the same info as find's -printf:
Here is the soluation:
$ find . -type f -name '*.pdf' -print | perl -lne '$,=" "; #s=stat $_; print $s[7],$_, $s[2]' | nawk -v c=5000000 -f bin_packing.awk
to save the problem with the missing --printf find feature, you can try with:
find . -type f -iname '*pdf' -exec stat --printf="%s %n\n" {} \; \
| awk -v c=100000 -f bin_packing.awk
Updated question based on new information…
Here is a gist of my code, with the general idea that I store items in DropBox at:
~/Dropbox/Public/drops/xx.xx.xx/whatever
Where the date is always 2 chars, 2 chars, and 2 chars, dot separated. Within that folder can be more folders and more files, which is why when I use find I do not set the depth and allow it to scan recursively.
https://gist.github.com/anonymous/ad51dc25290413239f6f
Below is a shortened version of the gist, it won't run as it stands, I don't believe, though the gist will run assuming you have DropBox installed and there are files at the path location that I set up.
General workflow:
SIZE="+250k" # For `find` this is the value in size I am looking for files to be larger than
# Location where I store the output to `find` to process that file further later on.
TEMP="/tmp/drops-output.txt"
Next I rm the tmp file and touch a new one.
I will then cd into
DEST=/Users/$USER/Dropbox/Public/drops
Perform a quick conditional check to make sure that I am working where I want to be,
with all my values as variables, I could mess up easily and not be working where I
thought I would be.
# Conditional check: is the current directory the one I want to be the working directory?
if [ "$(pwd)" = "${DEST}" ]; then
echo -e "Destination and current working directory are equal, this is good!:\n $(pwd)\n"
fi
The meat of step one is the `find` command
# Use `find` to locate a subset of files that are larger than a certain size
# save that to a temp file and process it. I believe this could all be done in
# one find command with -exec or similar but I can't figure it out
find . -type f -size "${SIZE}" -exec ls -lh {} \; >> "$TEMP"
Inside $TEMP will be a data set that looks like this:
-rw-r--r--# 1 me staff 61K Dec 28 2009 /Users/me/Dropbox/Public/drops/12.28.09/wor-10e619e1-120407.png
-rw-r--r--# 1 me staff 230K Dec 30 2009 /Users/me/Dropbox/Public/drops/12.30.09/hijack-loop-d6250496-153355.pdf
-rw-r--r--# 1 me staff 49K Dec 31 2009 /Users/me/Dropbox/Public/drops/12.31.09/mt-5a819185-180538.png
The trouble is, not all files will contains no spaces, though I have done all I can to make sure variables are quoted
and wrapped in parens or braces or quotes where applicable.
With the results in /tmp I run:
# Number of results located as a result of the find `command` above
RESULTS=$(wc -l "$TEMP" | awk '{print $1}')
echo -e "Located: [$RESULTS] total files greater than or equal to $SIZE\n"
# With a result set found via `find`, now use awk to print out the sorted list of file
# sizes and paths.
echo -e "SIZE DATE FILE PATH"
#awk '{print "["$5"] ", $9, $10}' < "$TEMP" | sort -n
awk '{for(i=5;i<=NF;i++) {printf $i " "} ; printf "\n"}' "$TEMP" | sort -n
With the changes to awk from how I had it originally, my result now looks like this:
751K Oct 21 19:00 ./10.21.14/netflix-67-190039.png
760K Sep 14 19:07 ./01.02.15/logos/RCA_old_logo.jpg
797K Aug 21 03:25 ./08.21.14/girl-88-032514.zip
916K Sep 11 21:47 ./09.11.14/small-shot-4d-214727.png
I want it to look like this:
SIZE FILE PATH
========================================
751K ./10.21.14/netflix-67-190039.png
760K ./01.02.15/logos/RCA_old_logo.jpg
797K ./08.21.14/girl-88-032514.zip
916K ./09.11.14/small-shot-4d-214727.png
# All Done
if [ "$?" -ne "0" ]; then
echo "find of drop files larger than $SIZE completed without errors.\n"
exit 1
fi
Original Post to Stack prior to gaining some new information leading to new questions…
Original Post is below, given new information, I tried some new tactics and have left myself with the above script and info.
I have a simple script, Mac OS X, it performs a find on a dir and locates all files of type file and of size greater than +SIZE
These are then appended to a file via >>
From there, I have a file that essentially contains a ls -la listing, so I use awk to get to the file size and the file name with this command:
# With a result set found via `find`, now use awk to print out the sorted list of file
# sizes and paths.
echo -e "SIZE FILE PATH"
awk '{print "["$5"] ", $9, $10}' < "$TEMP" | sort -n
All works as I want it to, but I get some filename truncation right at the above code. The entire file is around 30 lines, I have pinned it to this line. I think if I throw in a different Internal Field Sep that would fix it. I could use \t as there can't be a \t in Mac OS X filenames.
I thought it was just quoting, but I can't seem to see where if that is the case. Here is a sample of the data returned, usually I get about 50 results. The first one I stuffed in this file has filename truncation:
[1.0M] ./11.26.14/Bruna Legal
[1.4M] ./12.22.14/card-88-082636.jpg
[1.6M] ./12.22.14/thrasher-8c-082637.jpg
[11M] ./01.20.15/td-6e-225516.mp3
Bruna Legal is "Bruna Legal Name.pdf" on the filesystem.
You can avoid parsing the output of ls command and do the whole work with find using the printf action, like:
find /tmp -type f -maxdepth 1 -size +4k 2>/dev/null -printf "%kKB %f\n" |
sort -nrk1,1
In my example it outputs every file that is bigger than 4 kilobytes. The issue is that the find command cannot print formatted output with the size in MB. In addition the numeric ordering does not work for me with square brackets surrounding the number, so I omit them. In my test it yields:
140KB +~JF7115171557203024470.tmp
140KB +~JF3757415404286641313.tmp
120KB +~JF8126196619419441256.tmp
120KB +~JF7746650828107924225.tmp
120KB +~JF7068968012809375252.tmp
120KB +~JF6524754220513582381.tmp
120KB +~JF5532731202854554147.tmp
120KB +~JF4394954996081723171.tmp
24KB +~JF8516467789156825793.tmp
24KB +~JF3941252532304626610.tmp
24KB +~JF2329724875703278852.tmp
16KB 578829321_2015-01-23_1708257780.pdf
12KB 575998801_2015-01-16_1708257780-1.pdf
8KB adb.log
EDIT because I've noted that %k is not accurate enough, so you can use %s to print in bytes and transform to KB o MB using awk, like:
find /tmp -type f -maxdepth 1 -size +4k 2>/dev/null -printf "%sKB %f\n" |
sort -nrk1,1 |
awk '{ $1 = sprintf( "%.2f", $1 / 1024) } { print }'
It yields:
136.99KB +~JF7115171557203024470.tmp
136.99KB +~JF3757415404286641313.tmp
117.72KB +~JF8126196619419441256.tmp
117.72KB +~JF7068968012809375252.tmp
117.72KB +~JF6524754220513582381.tmp
117.68KB +~JF7746650828107924225.tmp
117.68KB +~JF5532731202854554147.tmp
117.68KB +~JF4394954996081723171.tmp
21.89KB +~JF8516467789156825793.tmp
21.89KB +~JF3941252532304626610.tmp
21.89KB +~JF2329724875703278852.tmp
14.14KB 578829321_2015-01-23_1708257780.pdf
10.13KB 575998801_2015-01-16_1708257780-1.pdf
4.01KB adb.log