remove files which contain more than 14 lines in a folder - bash

Unix command used
wc -l * | grep -v "14" | rm -rf
However this grouping doesn't seem to do the job. Can anyone point me towards the correct way?
Thanks

wc -l * 2>&1 | while read -r num file; do ((num > 14)) && echo rm "$file"; done
remove "echo" if you're happy with the results.

Here's one way to print out the names of all files with at least 15 lines (assuming you have Gnu awk, for the nextfile command):
awk 'FNR==15{print FILENAME;nextfile}' *
That will produce an error for any subdirectory, so it's not ideal.
You don't actually want to print the filenames, though. You want to delete them. You can do that in awk with the system function:
# The following has been defanged in case someone decides to copy&paste
awk 'FNR==15{system("echo rm "FILENAME);nextfile}' *

for f in *; do if [ $(wc -l $f | cut -d' ' -f1) -gt 14 ]; then rm -f $f; fi; done

There's a few problems with your solution: rm doesn't take input from stdin, and your grep only finds files who don't have exactly 14 lines. Try this instead:
find . -type f -maxdepth 1 | while read f; do [ `wc -l $f | tr -s ' ' | cut -d ' ' -f 2` -gt 14 ] && rm $f; done
Here's how it works:
find . -type f -maxdepth 1 #all files (not directories) in the current directory
[ #start comparison
wc -l $f #get line count of file
tr -s ' ' #(on the output of wc) eliminate extra whitespace
cut -d ' ' -f 2 #pick just the line count out of the previous output
-gt 14 ] #test if all that was greater than 14
&& rm $f #if the comparison was true, delete the file
I tried to figure out a solution just using find with -exec, but I couldn't figure out a way to test the line count. Maybe somebody else can come up with a way for it

Related

Print top N files by word count in two columns

I would like to make a script that prints the filenames for the top n files from two directories (n being the number of files I give in in the command line) in order of number of words they have. My biggest problem however is in the way they should be displayed.
Say my command line looks like this:
myscript.sh 5 dir1 dir2
The output should have 2 columns: on the left the top 5 files in descending order from dir1, and on the right the top 5 files in descending order from dir2.
This is what I have in terms of code, however I'm missing something. I think that pr -m -t should do what i want, but I couldn't make it work.
#!/bin/bash
dir=$1
dir2=$2
for files in "$dir"
do
find ./reuters-topics/$dir -type f -exec wc -l {} + | sort -rn |head -n 15
done
for files in "$dir2"
do
find ./reuters-topics/$dir2 -type f -exec wc -l {} + | sort -rn | head -n 15
done
This is a solution in fish:
for i in (find . -type f); wc -l $i; end | sort -rn | head -n15 | awk '{print $2 "\t" $1}'
As you can see, the re-ordering (filename first, number of words second) is done by awk. As a separator I use a tab character:
awk '{print $2 "\t" $1}'
The difference between my loop and your find call, btw, is that I do not get the "total" line in the output.
I did not test if this (including awk) also works well for files with spaces in the name.
#!/usr/bin/env bash
_top_files_by_words_usage() {
local usage=""
read -r -d '' usage <<-"EOF"
Usage:
top_files_by_words <show_count> <dir1> <dir2>
EOF
1>&2 printf "%s\n" "$usage"
}
top_files_by_words() {
if (( $# != 3 )) || [[ "$1" != +([0-9]) ]]; then
_top_files_by_words_usage
return 1
fi
local -i showCount=0
local dir1=""
local dir2=""
showCount="$1"
dir1="$2"
dir2="$3"
shopt -s extglob
if [[ ! -d "$dir1" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir1"
return 1
fi
if [[ ! -d "$dir2" ]]; then
1>&2 printf "directory '%s' does not exist or is not a directory\n" "$dir2"
return 1
fi
local -a out1=()
local -a out2=()
IFS=$'\n' read -r -d '' -a out1 < <(find "$dir1" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
IFS=$'\n' read -r -d '' -a out2 < <(find "$dir2" -type f -exec wc -w {} \; | sort -k 1gr | head -n "$showCount")
local -i i=0
local -i maxLen=0
local -i len=0;
for ((i = 0; i < showCount; ++i)); do
len="${#out1[$i]}"
if (( len > maxLen )); then
maxLen=$len
fi
# len="${#out2[$i]}"
# if (( len > maxLen )); then
# maxLen=$len
# fi
done
for (( i = 0; i < showCount; ++i)); do
printf "%-*.*s %s\n" "$maxLen" "$maxLen" "${out1[$i]}" "${out2[$i]}"
done
return 0
}
top_files_by_words "$#"
$ ~/tmp/count_words.bash 15 tex tikz
2309328 tex/resume.log 9692402 tikz/tikz-Graphics in LaTeX with TikZ.mp4
2242997 tex/resume_cv.log 2208818 tikz/tikz-Tikz-Graphs and Automata.mp4
2242969 tex/cover_letters/resume_cv.log 852631 tikz/tikz-Drawing Automata with TikZ in LaTeX.mp4
73859 tex/pgfplots/plotdata/heightmap.dat 711004 tikz/tikz-tutorial.mp4
49152 tex/pgfplots/lena.dat 300038 tikz/.ipynb_checkpoints/TikZ 11 Design Principles-checkpoint.ipynb
43354 tex/nancy.mp4 300038 tikz/TikZ 11 Design Principles.ipynb
31226 tex/pgfplots/pgfplotstodo.tex 215583 tikz/texample/bridges-of-konigsberg.svg
26000 tex/pgfplots/plotdata/ou.dat 108040 tikz/Visual TikZ.pdf
20481 tex/pgfplots/pgfplotstable.tex 82540 tikz/worldflags.pdf
19571 tex/pgfplots/pgfplots.reference.3dplots.tex 37608 tikz/texample/india-map.tex
19561 tex/pgfplots/plotdata/risingdrop3d_coord.dat 35798 tikz/.ipynb_checkpoints/TikZ-checkpoint.ipynb
19561 tex/pgfplots/plotdata/risingdrop3d_vel.dat 35656 tikz/texample/periodic_table.svg
18207 tex/pgfplots/ChangeLog 35501 tikz/TikZ.ipynb
17710 tex/pgfplots/pgfplots.reference.markers-meta.tex 25677 tikz/tikz-Graphics in LaTeX with TikZ.info.json
13800 tex/pgfplots/pgfplots.reference.axisdescription.tex 14760 tikz/tikz-Tikz-Graphs and Automata.info.json
column can print files side-by-side in columns. You can use process substitution with <(command) to have those "files" be live commands instead of actual files.
#!/bin/bash
top-files() {
local n="$1"
local dir="$2"
find "$dir" -type f -exec wc -l {} + |
head -n -1 | sort -rn | head -n "$n"
}
n="$1"
dir1="$2"
dir2="$3"
column <(top-files "$n" reuters-topics/"$dir1") \
<(top-files "$n" reuters-topics/"$dir2")

Need to remove the extra empty lines from the output of shell script

i'm trying to write a code which will print all files taking more than min_size (lets say 10G) in a directory. the problem is output off the below code is all files irrespective of the min_size. i will be getting other details like mtime , owner as well later in the code but this part itself doesnt work fine, whats wrong here ?
#!/bin/sh
if (( $# <3 )); then
echo "$0 dirname min_size count"
exit 1
else
dirname="$1";
min_size="$2";
count="$3";
#shift 3
fi
tmpfile=$(mktemp /lawdump/pulkit/files.XXXXXX)
exec 3> "$tmpfile"
find "${dirname}" -type f -print0 2>&1 | grep -v "Permission denied" | xargs -0 -I {} echo "{}" > "$tmpfile"
for i in `cat tmpfile`
do
x="`du -ah $i | awk '{print $1}' | grep G | sort -nr -k 1`"
size=$(echo $x | sed 's/[A-Za-z]*//g')
if [ size > $min_size ];then
echo $size
fi
done
Note : i know this can be done through find or du but i need to write a shell script to have an email sent out regularly with all the details.

Bash Script not working when trying to find large ASCII files

So what I'm trying to do is find large ASCII files and then print out the name of the file and then how many lines, but when I start my script it doesn't find anything.
find / -type f -size +2000c -exec file {} \; 2>/dev/null | awk -F':' '/: ASCII text/ {print $1}' | while read FILENAME; do LINES="$(wc -l)"; if [ $LINES > 10000 ]; then echo $FILENAME && echo $LINES; fi; done
what you got wrong?
if [ $LINES > 10000 ] here > goes for a string comparison. To use a numberic comparion -gt must be used as
if [ $LINES -gt 10000 ]
Please try this:
find / -type f -size +2000c -print0 | xargs.exe -0 grep -Z -L -e '[^[:print:]]' 2>/dev/null | xargs -0 awk 'ENDFILE { if (FNR > 10000) { print FILENAME " " FNR } }'
The idea is to filter out binary files with grep and feed awk with the list of filtered files to finally filter out files with line count less or equal to 10000.
btw, it handles files with white space in names gracefully.

Why is while not not working?

AIM: To find files with a word count less than 1000 and move them another folder. Loop until all under 1k files are moved.
STATUS: It will only move one file, then error with "Unable to move file as it doesn't exist. For some reason $INPUT_SMALL doesn't seem to update with the new file name."
What am I doing wrong?
Current Script:
Check for input files already under 1k and move to Split folder
INPUT_SMALL=$( ls -S /folder1/ | grep -i reply | tail -1 )
INPUT_COUNT=$( cat /folder1/$INPUT_SMALL 2>/dev/null | wc -l )
function moveSmallInput() {
while [[ $INPUT_SMALL != "" ]] && [[ $INPUT_COUNT -le 1003 ]]
do
echo "Files smaller than 1k have been found in input folder, these will be moved to the split folder to be processed."
mv /folder1/$INPUT_SMALL /folder2/
done
}
I assume you are looking for files that has the word reply somewhere in the path. My solution is:
wc -w $(find /folder1 -type f -path '*reply*') | \
while read wordcount filename
do
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
Run the script once, if the output looks correct, then uncomment the mv command and run it for real this time.
Update
The above solution has trouble with files with embedded spaces. The problem occurs when the find command hands its output to the wc command. After a little bit of thinking, here is my revised soltuion:
find /folder1 -type f -path '*reply*' | \
while read filename
do
set $(wc -w "$filename") # $1= word count, $2 = filename
wordcount=$1
if [[ $wordcount -lt 1003 ]]
then
printf "%4d %s\n" $wordcount $filename
#mv "$filename" /folder2
fi
done
A somewhat shorter version
#!/bin/bash
find ./folder1 -type f | while read f
do
(( $(wc -w "$f" | awk '{print $1}' ) < 1000 )) && cp "$f" folder2
done
I left cp instead of mv for safery reasons. Change to mv after validating
I you also want to filter with reply use #Hai's version of the find command
Your variables INPUT_SMALL and INPUT_COUNT are not functions, they're just values you assigned once. You either need to move them inside your while loop or turn them into functions and evaluate them each time (rather than just expanding the variable values, as you are now).

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources