Bash: grabbing the second line and last line of output (ls -lrS) only - bash

I am looking to get the second line and last line of what the ls -lrS command outputs. Ive been using ls -lrS | (head -2 | tail -1) && (tail -n1) But it seems to only get the first line only, and I have to press control C to stop it.
Another problem I am having is using the awk command, I wanted to just grab the file size and file name. If I were to get the correct lines (second and last) my desired output would be
files=$(ls -lrS | (head -2 | tail -1) && (tail -n1) awk '{ print "%s", $5; "%s", $8; }' )
I was hoping it would print:
1234 file.abc
12345 file2.abc

Using the format stable GNU stat command:
stat --format='%s %n' * | sort -n | sed -n '1p;$p'
If you're using BSD stat, adjust accordingly.
If you want a lot more control over what files go into this calculation, and arguably better portability, use find. In this example, I'm getting all non-dot files in the current directory:
find -maxdepth 1 -not -path '*/\.*' -printf '%s %p\n' | sort -n | sed -n '1p;$p'
And take care if your directory contains two or fewer entries, or if any of your entries have a new-line in their name.

Using awk:
ls -lrS | awk 'NR==2 { print; } END { print; }'
It prints when the line number NR is 2 and again on the final line.
Note: As pointed out in the comments, $0 may or may not be available in an END block depending on your awk version.

whatever | awk 'NR==2{x=$0;next} {y=$0} END{if (x!="") print x; if (y!="") print y}'
You need that complexity (and more to be REALLY robust) to handle input that's less than 3 lines.

ls is not a reliable tool for this job: It can't represent all possible filenames (spaces are possible, but also newlines and other special characters -- all but NUL). One robust solution on a system with GNU tools is to use find:
{
# read the first size and name
IFS= read -r -d' ' first_size; IFS= read -r -d '' first_name;
# handle case where only one file exists
last_size=$first_size; last_name=$first_name
# continue reading "last" size and name, until one really is last
while IFS= read -r -d' ' curr_size && IFS= read -r -d '' curr_name; do
last_size=$curr_size; last_name=$curr_name
done
} < <(find . -mindepth 1 -maxdepth 1 -type f -printf '%s %P\0' | sort -n -z)
The above puts results into variables $first_size, $first_name, $last_size and $last_name, usable thusly:
printf 'Smallest file is %d bytes, named %q\n' "$first_size" "$first_name"
printf 'Largest file is %d bytes, named %q\n' "$last_size" "$last_name"
In terms of how it works:
find ... -printf '%s %P\0'
...emits a stream of the following form from find:
<size> <name><NUL>
Running that stream through sort -n -z does a numeric sort on its contents. IFS= read -r -d' ' first_size reads the everything up to the first space; IFS= read -r -d '' first_name reads everything up to the first NUL; and then the loop continues to read and store additional size/name pairs until the last one is reached.

Related

remove duplicate lines with similar prefix

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.
From this,
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/
to this
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
Appreciate any suggestions,
Answer in case reordering the output is allowed.
sort -r file | awk 'a!~"^"$0{a=$0;print}'
sort -r file : sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same pattern
awk 'a!~"^"$0{a=$0;print}' : parse sorted output where a holds the previous line and $0 holds the current line
a!~"^"$0 checks for each line if current line is not a substring at the beginning of the previous line.
if $0 is not a substring (ie. not similar prefix), we print it and save new string in a (to be compared with next line)
The first line $0 is not in a because no value was assigned to a (first line is always printed)
A quick and dirty way of doing it is the following:
$ while read elem; do echo -n "$elem " ; grep $elem file| wc -l; done <file | awk '$2==1{print $1}'
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.
Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:
#!/bin/bash
f=sample.txt # sample data
p='' # previous line = empty
sort -r "$f" | \
while IFS= read -r s || [[ -n "$s" ]]; do # reverse sort, then read string (line)
[[ "$s" = "${p:0:${#s}}" ]] || \
printf "%s\n" "$s" # if s is not prefix of p, then print it
p="$s"
done
Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.
Test:
$ cat sample.txt
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/
123/456/789/
xyz/
$ ./remove-prefix.sh
xyz/
abc/def/ghi/jkl/two/two
abc/def/ghi/jkl/one/one
123/456/789/
Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:
#!/bin/bash
f=sample.txt
p=''
cat -n "$f" | \
sed 's:\t:|:' | \
sort -r -t'|' -k2 | \
while IFS='|' read -r i s || [[ -n "$s" ]]; do
[[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
p="$s"
done | \
sort -n -t'|' -k1 | \
sed 's:^.*|::'
Explanations:
cat -n: numbering all lines
sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
while ... done: similar to solution of step 1
sort -n -t'|' -k1: sort back to original order (numbering sort)
sed 's:^.*|::': remove the numbering
Test:
$ ./remove-prefix.sh
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/789/
xyz/
Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.
In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).
The following awk does what is requested, it reads the file twice.
In the first pass it builds up all possible prefixes per line
The second pass, it checks if the line is a possible prefix, if not print.
The code is:
awk -F'/' '(NR==FNR){s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]};next}
{if (! ($0 in a) ) {print $0}}' <file> <file>
You can also do it with reading the file a single time, but then you store it into memory :
awk -F'/' '{s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]}; b[NR]=$0; next}
END {for(i=1;i<=NR;i++){if (! (b[i] in a) ) {print $0}}}' <file>
Similar to the solution of Allan, but using grep -c :
while read line; do (( $(grep -c $line <file>) == 1 )) && echo $line; done < <file>
Take into account that this construct reads the file (N+1) times where N is the amount of lines.

In loop cat file - echo name of file - count

I trying make oneline command with operation where I can do:
in folder "data" have 570 files - each file have some text line - file are called from 1 to 570.txt
I want cat each file, grep by word and count how manny that word occurs.
For the moment he is trying to get this using ' for '
for FILES in $(find /home/my/data/ -type f -print -exec cat {} \;) ; do echo $FILES; cat $FILES |grep word ; done |wc -l
but if I do that they correctly counts but does not display the counted file
I would like it to look :
----> 1.txt <----
210
---> 2.txt <----
15
etc, etc, etc..
How to get it
grep -o word * | uniq -c
is practically all you need.
grep -o word * gives a line for each hit, but only prints the match, in this case "word". Each line is prefixed with the filename it was found in.
uniq -c gives only one line per file so to say and prefixes it with the count.
You can further format it to your needs with awk or whatever, though, for example like this:
grep -o word * | uniq -c | cut -f1 -d':' | awk '{print "File: " $2 " Count: " $1}'
You can try this :
for file in /path/to/folder/data/* ; do echo "----> $file <----" ; grep -c "word_to_count" /path/to/folder/data/$file ; done
for loop will ierate over file inside folder "data".
For each of these file, print the name and search for number of occurrence of "word_to_count" (grep -c will directly output a count of matching lines).
Be carefull, if there is more than one iteration of your search word inside a line, this solution will count only one for these iteration.
Bit of awk should do it?
awk '{s+=$1} END {print s}' mydatafile
Note: some versions of awk have some odd behaviours if you are going to be adding anything exceeding 2^31 (2147483647). See comments for more background. One suggestion is to use printf rather than print:
awk '{s+=$1} END {printf "%.0f", s}' mydatafile
$ python -c "import sys; print(sum(int(l) for l in sys.stdin))"
If you only want the total number of lines, you could use
find /home/my/data/ -type f -exec cat {} + | wc -l

Append wc lines to filename

Title says it all. I've managed to get just the lines with this:
lines=$(wc file.txt | awk {'print $1'});
But I could use an assist appending this to the filename. Bonus points for showing me how to loop this over all the .txt files in the current directory.
find -name '*.txt' -execdir bash -c \
'mv -v "$0" "${0%.txt}_$(wc -l < "$0").txt"' {} \;
where
the bash command is executed for each (\;) matched file;
{} is replaced by the currently processed filename and passed as the first argument ($0) to the script;
${0%.txt} deletes shortest match of .txt from back of the string (see the official Bash-scripting guide);
wc -l < "$0" prints only the number of lines in the file (see answers to this question, for example)
Sample output:
'./file-a.txt' -> 'file-a_5.txt'
'./file with spaces.txt' -> 'file with spaces_8.txt'
You could use the rename command, which is actually a Perl script, as follows:
rename --dry-run 'my $fn=$_; open my $fh,"<$_"; while(<$fh>){}; $_=$fn; s/.txt$/-$..txt/' *txt
Sample Output
'tight_layout1.txt' would be renamed to 'tight_layout1-519.txt'
'tight_layout2.txt' would be renamed to 'tight_layout2-1122.txt'
'tight_layout3.txt' would be renamed to 'tight_layout3-921.txt'
'tight_layout4.txt' would be renamed to 'tight_layout4-1122.txt'
If you like what it says, remove the --dry-run and run again.
The script counts the lines in the file without using any external processes and then renames them as you ask, also without using any external processes, so it quite efficient.
Or, if you are happy to invoke an external process to count the lines, and avoid the Perl method above:
rename --dry-run 's/\.txt$/-`grep -ch "^" "$_"` . ".txt"/e' *txt
Use rename command
for file in *.txt; do
lines=$(wc ${file} | awk {'print $1'});
rename s/$/${lines}/ ${file}
done
#/bin/bash
files=$(find . -maxdepth 1 -type f -name '*.txt' -printf '%f\n')
for file in $files; do
lines=$(wc $file | awk {'print $1'});
extension="${file##*.}"
filename="${file%.*}"
mv "$file" "${filename}${lines}.${extension}"
done
You can adjust maxdepth accordingly.
you can do like this as well:
for file in "path_to_file"/'your_filename_pattern'
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
example:
for file in /oradata/SCRIPTS_EL/text*
do
lines=$(wc $file | awk {'print $1'})
mv $file $file'_'$lines
done
This would work, but there are definitely more elegant ways.
for i in *.txt; do
mv "$i" ${i/.txt/}_$(wc $i | awk {'print $1'})_.txt;
done
Result would put the line numbers nicely before the .txt.
Like:
file1_1_.txt
file2_25_.txt
You could use grep -c '^' to get the number of lines, instead of wc and awk:
for file in *.txt; do
[[ ! -f $file ]] && continue # skip over entries that are not regular files
#
# move file.txt to file.txt.N where N is the number of lines in file
#
# this naming convention has the advantage that if we run the loop again,
# we will not reprocess the files which were processed earlier
mv "$file" "$file".$(grep -c '^' "$file")
done
{ linecount[FILENAME] = FNR }
END {
linecount[FILENAME] = FNR
for (file in linecount) {
newname = gensub(/\.[^\.]*$/, "-"linecount[file]"&", 1, file)
q = "'"; qq = "'\"'\"'"; gsub(q, qq, newname)
print "mv -i -v '" gensub(q, qq, "g", file) "' '" newname "'"
}
close(c)
}
Save the above awk script in a file, say wcmv.awk, the run it like:
awk -f wcmv.awk *.txt
It will list the commands that need to be run to rename the files in the required way (except that it will ignore empty files). To actually execute them you can pipe the output to a shell for execution as follows.
awk -f wcmv.awk *.txt | sh
Like it goes with all irreversible batch operations, be careful and execute commands only if they look okay.
awk '
BEGIN{ for ( i=1;i<ARGC;i++ ) Files[ARGV[i]]=0 }
{Files[FILENAME]++}
END{for (file in Files) {
# if( file !~ "_" Files[file] ".txt$") {
fileF=file;gsub( /\047/, "\047\"\047\"\047", fileF)
fileT=fileF;sub( /.txt$/, "_" Files[file] ".txt", fileT)
system( sprintf( "mv \047%s\047 \047%s\047", fileF, fileT))
# }
}
}' *.txt
Another way with awk to manage easier a second loop by allowing more control on name (like avoiding one having already the count inside from previous cycle)
Due to good remark of #gniourf_gniourf:
file name with space inside are possible
tiny code is now heavy for such a small task

One line command with variable, word count and zcat

I have many files on a server which contains many lines:
201701010530.contentState.csv.gz
201701020530.contentState.csv.gz
201701030530.contentState.csv.gz
201701040530.contentState.csv.gz
I would like with one line command this result:
170033|20170101
169865|20170102
170010|20170103
170715|20170104
The goal is to have the number of lines of each file, just by keeping the date which is already in the filename of the file.
I tried this but the result is not in one line but two...
for f in $(ls -1 2017*gz);do zcat $f | wc -l;echo $f | awk '{print substr($0,1,8)}';done
Thanks in advance guys.
Just use zcat file | wc -l to get the number of lines.
For the name, I understand it is enough to extract the first 8 characters:
$ t="201701030530.contentState.csv.gz"
$ echo "${t:0:8}"
20170103
All together:
for file in 2017*gz;
do
lines=$(zcat "$file" | wc -l)
printf "%s|%s\n" "$lines" "${file:0:8}"
done > myresult.csv
Note the usage of for file in 2017*gz; to go through the files matching the 2017*gz pattern: this suffices, no need to parse ls!
Use zgrep -c ^ file to count the lines, here encapsulated in awk:
$ awk 'FNR==1{ "zgrep -c ^ " FILENAME | getline s; print s "|" substr(FILENAME,1,8) }' *.gz
12|20170101
The whole "zgrep -c ^ " FILENAME should probably be in a var (s) and then s | getline s.

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources