Bash find files and filter by date and size - bash

I have a directory with a lot of files in it.
Each day, new files are added automatically.
The filenames are formatted like that :
[GROUP_ID]_[RANDOM_NUMBER].txt
Example : 012_1234.txt
For every day, for every GROUP_ID (032, 024, 044...etc), I want to keep only the biggest file of the day.
So for example, for the two days 27 and 28 march I have :
March 27 - 012_1234.txt - 12ko
March 27 - 012_0243.txt - 3000ko
March 27 - 016_5647.txt - 25ko
March 27 - 024_4354.txt - 20ko
March 27 - 032_8745.txt - 40ko
March 28 - 032_1254.txt - 16ko
March 28 - 036_0456.txt - 30ko
March 28 - 042_7645.txt - 500ko
March 28 - 042_2310.txt - 25ko
March 28 - 042_2125.txt - 34ko
March 28 - 044_4510.txt - 35ko
And I want to have :
March 27 - 012_0243.txt - 3000ko
March 27 - 016_5647.txt - 25ko
March 27 - 024_4354.txt - 20ko
March 27 - 032_8745.txt - 40ko
March 28 - 032_1254.txt - 16ko
March 28 - 036_0456.txt - 30ko
March 28 - 042_7645.txt - 500ko
March 28 - 044_4510.txt - 35ko
I don't find the right bash ls/find command to do that, somebody have an idea ?
With this command, I can display the biggest file for each day.
ls -l *.txt --time-style=+%s |
awk '{$6 = int($6/86400); print}' |
sort -nk6,6 -nrk5,5 | sort -sunk6,6
But I want the biggest file of each GROUP_ID file of each day.
So, if there is one file for "012" group_id file, of 10ko, I want to display it, even if there is bigger files for others group id...

I found myself the solution:
ls -l | tail -n+2 |
awk '{ split($0,var,"_"); group_id=var[5]; print $0" "group_id }' |
sort -k9,9 -k5,5nr |
awk '$10 != x { print } { x = $10 }'
This gives me the biggest file for each group_id, so now I just add to handle the day part.
For information:
tail -n+2: hide the "total" part of the ls command's output
First awk: get the group_id part (012, 036...) and display it after the original line ($0)
Sort: sort on filename and size
Take the biggest size of each group_id (column 10 added by awk at beginning)

Related

how to calculate total elapsed time

how to calculate elapsed time based on
start time=
[user001a#dev51 logs]# grep 'Recovery Manager'
refresh_03Jun2019_0250.log|head -1|awk -F'on ' '{print $NF}';
Jun 3 02:50:02 2019
[user001a#dev51 logs]#
end time=
[user001a#dev51 logs]# ls -l refresh_03Jun2019_0250.log
-rw-r--r--. 1 user001a grp001a 170050 Jun 3 05:06
refresh_03Jun2019_0250.log
[user001a#dev51 logs]#
Note - stat is missing birth time so stat might not be a good option time calculate file create and modify time:
[user001a#dev51 logs]# stat refresh_03Jun2019_0250.log
File: `refresh_03Jun2019_0250.log'
Size: 170050 Blocks: 344 IO Block: 4096 regular file
Device: 811h/2065d Inode: 1474545 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 219/ user001a) Gid: ( 219/grp001a)
Access: 2019-06-03 05:06:40.830829026 -0400
Modify: 2019-06-03 05:06:40.827828883 -0400
Change: 2019-06-03 05:06:40.827828883 -0400
[user001a#dev51 logs]#
Sample1 output:
StartTime=June 3, 2019 at 2:50:02 am
EndTime=June 3, 2019 at 5:06:40 am
ElapsedTime=2 hours, 16 minutes and 38 seconds
Sample2 output:
ElapsedTime=2 hours, 16 minutes and 38 seconds
Limitation of this solution: Max 23 hours. For more, days need to be added.
StartTime="June 3, 2019 at 2:50:02 am"
EndTime="June 3, 2019 at 5:06:40 am"
StartTimeInEpoch=`echo $StartTime | sed 's/at //g' | date -f- +"%s"`
EndTimeInEpoch=`echo $EndTime | sed 's/at //g' | date -f- +"%s"`
echo $EndTimeInEpoch-$StartTimeInEpoch | bc | sed 's/^/#/g' | date -u -f- "+%_H hours %_M minutes %_S seconds"
Output:
2 hours 16 minutes 38 seconds
Assuming you've got your dates in variables StartTime and EndTime. It's necessary to remove at from them, sed do this. Then both dates are converted to epoch time +"%s" do the trick. -f- tells date to take date from stdin (pipe). Then we can subtract the dates, add # to the beginning and format with date. -u mean UTC time - no time shift.

Number of logins on Linux using Shell script and AWK

How can I get the number of logins of each day from the beginning of the wtmp file using AWK?
I thought about using an associative array but I don't know how to implement it in AWK..
myscript.sh
#!/bin/bash
awk 'BEGIN{numberoflogins=0}
#code goes here'
The output of the last command:
[fnorbert#localhost Documents]$ last
fnorbert tty2 /dev/tty2 Mon Apr 24 13:25 still logged in
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 16:25 still running
reboot system boot 4.8.6-300.fc25.x Mon Apr 24 13:42 still running
fnorbert tty2 /dev/tty2 Fri Apr 21 16:14 - 21:56 (05:42)
reboot system boot 4.8.6-300.fc25.x Fri Apr 21 19:13 - 21:56 (02:43)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:31 - 10:02 (01:30)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:30 - 10:02 (00:-27)
fnorbert tty2 /dev/tty2 Tue Apr 4 08:14 - 08:26 (00:11)
reboot system boot 4.8.6-300.fc25.x Tue Apr 4 10:13 - 08:26 (-1:-47)
wtmp begins Mon Mar 6 09:39:43 2017
The shell script's output should be:
Apr 4: 4
Apr 21: 2
Apr 24: 3
, using associative array if it's possible
In awk, arrays can be indexed by strings or numbers, so you can use it like an associative array.
However, what you're asking will be hard to do with awk reliably because the delimiters are whitespace, therefore empty fields will throw off the columns, and if you use FIELDWIDTHS you'll also get thrown off by columns longer than their assigned width.
If all you're looking for is just the number of logins per day you might want to use a combination of sed and awk (and sort):
last | \
sed -E 's/^.*(Mon|Tue|Wed|Thu|Fri|Sat|Sun) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) ([ 0-9]{2}).*$/\2 \3/p;d' | \
awk '{arr[$0]++} END { for (a in arr) print a": " arr[a]}' | \
sort -M
The sed -E uses extended regular expressions, and the pattern just prints the date of each line that is emitted by last (This matches on the day of week, but only prints the Month and Date)
We could have used uniq -c to get the counts, but using awk we can do an associative array as you hinted.
Finally using sort -M we're sorting on the abbreviated date formats like Apr 24, Mar 16, etc.
Try the following awk script(assuming that the month is the same, points to current month):
myscript.awk:
#!/bin/awk -f
{
a[NR]=$0; # saving each line into an array indexed by line number
}
END {
for (i=NR-1;i>1;i--) { # iterating lines in reverse order(except the first/last line)
if (match(a[i],/[A-Z][a-z]{2} ([A-Z][a-z]{2}) *([0-9]{1,2}) [0-9]{2}:[0-9]{2}/, b))
m=b[1]; # saving month name
c[b[2]]++; # accumulating the number of occurrences
}
for (i in c) print m,i": "c[i]
}
Usage:
last | awk -f myscript.awk
The output:
Apr 4: 4
Apr 21: 2
Apr 24: 3

How to merge text files with common pair of strings in their lines

I have two text files with the following line format:
Value - Value - Number
I need to merge these files in a new one that contains only the lines with the common Value - Value pairs followed by the two Number values.
For example if I have these files:
File1.txt
Jack - Mark - 12
Alex - Ryan - 15
Jack - Ryan - 22
File2.txt
Paul - Bill - 11
Jack - Mark - 18
Jack - Ryan - 20
The merged file will contain:
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
How can I do this?
awk to the rescue!
awk -F' - ' 'BEGIN{OFS=FS}
NR==FNR{a[$1,$2]=$3;next}
($1,$2) in a{print $1,$2,a[$1,$2],$3}' file1 file2
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20
alternatively, with decorate/join/undecorate
$ join <(sort file1 | sed 's/ - /-/') <(sort file2 | sed 's/ - /-/') |
sed 's/-/ - /'
Jack - Mark - 12 - 18
Jack - Ryan - 22 - 20

pick up files based on dates in ksh script

I have this list of files . Now I will have to pick the latest file based on some condition
3679 Jul 21 23:59 belk_rpo_error_**po9324892**_07212014.log
0 Jul 22 23:59 belk_rpo_error_**po9324892**_07222014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
0 Jul 20 05:50 belk_rpo_error_**po9999992**_07202014.log
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
742 Jul 21 07:30 belk_rpo_error_**po9999991**_07212014.log
0 Jul 23 2014 belk_rpo_error_**po9999991**_07232014.log
For a PATRICULAR Order_No(Marked with ** **)
If the latest file is 0 kB then we will discard it (rest of the files with same Order_no as well)
if the latest file is non Zero then I will take it.(Only the latest one)
Then append the contents in a txt file .
My expected output would be ::
411 Jul 21 06:30 belk_rpo_error_**po9999992**_07212014.log
3679 Jul 23 23:59 belk_rpo_error_**po9324892**_07232014.log
22 Jul 22 06:30 belk_rpo_error_**po9324267**_07012014.log
I am at my wits end here. I cant seem to figure out how to compare dates in Unix. Any help is very appreciated.
You can try something like:
touch test.txt
for var in ` find . ! -empty -exec ls -r {} \;`
do
cat $var>>test.txt
done
untested
use stat to emit date (epoch time), size and filename.
use awk to filter out zero-length files and extract order number.
sort by order number and date
awk to pick up the last filename for each order number
stat -c $'%Y\t%s\t%n' *.log |
awk -F'\t' -v OFS='\t' '
$2 > 0 {
split($3, a, /_/)
print a[4], $1, $3
}' |
sort -t $'\t' -k1,1 -k2,2n |
awk -F'\t' '
NR > 1 && $1 != prev_order {print filename}
{filename = $3; prev_order = $1}
END {print filename}
'
The sort command might be wrong: In order to group by order number, you might need to sort first by file time then by order number.
If I understand your question, the resulting files need to be concatenated and appended to a file. If the above pipeline is working OK, then pipe into | xargs cat >> something.log

using awk to do exact match in a file

i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'

Resources