Remove lines having older end time - bash

In my bash script I want to add a code which remove all entries older than x days.
To simplify this problem, I have divided into 3 parts. 2 parts are
done looking answer for 3rd part.)
a) To find the latest log date - Done
b) evaluate earliest epoch time. (All entries before this epoch
time should be deleted) - Done
No_OF_DAYS=2
One_Day=86400000
Latest_Time=`find . -name '*.tps' -exec sed '/endTime/!d; s/{//; s/,.*//' {} + | sort -r | head -1 | cut -d: -f2` #latest epoch time
Days_in_Epoch=$(($One_Day * $No_OF_DAYS))
Earliest_Time=$((Latest_Time - $Days_in_Epoch)) #earliest epoch time
c) delete all log entries older than evaluated earliest time.
PS:
there are multiple files and distributed in different sub folders.
All files having extension as ".tps".
time is in epoch format. endTime will be considered for calculations.("endTime":1488902735220)
sample data
Code:
{"endTime":1488902734775,"startTime":1488902734775,"operationIdentity":"publishCacheStatistics","name":"murex.risk.control.excesses.cache.CacheStatisticsTracer","context":{"parentContext":{"id":-1,"parentContext":null},"data":[{"value":"excessCacheExcessKeysToContexts","key":"name"},{"value":"0","key":"hits"},{"value":"0","key":"misses"},{"value":"0","key":"count"},{"value":"0","key":"maxElements"},{"value":"0","key":"evictions"},{"value":"N/A","key":"policy"}],"id":0}}
{"endTime":1488902735220,"startTime":1488902735220,"operationIdentity":"publishCacheStatistics","name":"murex.risk.control.excesses.cache.CacheStatisticsTracer","context":{"parentContext":{"id":-1,"parentContext":null},"data":[{"value":"excessCacheExcessKeysToContexts","key":"name"},{"value":"0","key":"hits"},{"value":"0","key":"misses"},{"value":"0","key":"count"},{"value":"0","key":"maxElements"},{"value":"0","key":"evictions"},{"value":"N/A","key":"policy"}],"id":8}}
{"endTime":1488902735550,"startTime":1488902735550,"operationIdentity":"publishCacheStatistics","name":"murex.risk.control.excesses.cache.CacheStatisticsTracer","context":{"parentContext":{"id":-1,"parentContext":null},"data":[{"value":"excessCacheContextsToExcessIds","key":"name"},{"value":"0","key":"hits"},{"value":"0","key":"misses"},{"value":"0","key":"count"},{"value":"0","key":"maxElements"},{"value":"0","key":"evictions"},{"value":"N/A","key":"policy"}],"id":9}}
For Example:
a)
latest epoch time = 1488902735550
b)
earliest epoch time = 1488902735220
Problem: Now I am looking for command which delete all the entries which is older/lesses than earliest epoch time. In above example 1st line should be deleted.
Any help/suggestions are appreciated. Thank you Linux

This will do the trick buddy. Be careful to test it with backup files first as it will overwrite your logs directly. Also change the TIME variable for whatever you want to compare.
while read file
do
awk -v FS=':|,' -v TIME='1488902735220' '{ if (! ($2 > TIME) && !( $0 ~ /^ *$/ ) ) { print $0 } }' $file > tmp.txt && cat tmp.txt > $file
done < <( find ./ -name '*.tps' 2>/dev/null )
Regards!

Based on your current solution, I'd use a simple loop to read the file line by line and only output those whose endTime is greater than your earliest time :
while read line; do
line_endTime=$(awk -F '[:,]' '{print $2}' <<< $line)
if [ "$line_endTime" -le "$Earliest_Time" ]; then echo $line; fi
done < input_file > filtered_output_file

Related

Linux - Finding the max modified date of each set of files in each directory

path/mydir contains a list of directories. The names of these directories tell me which database they relate to.
Inside each directory is a bunch of files, but the filenames tell me nothing of importance.
I'm trying to write a command in linux bash that accomplishes the following:
For each directory in path/mydir, find the max timestamp of the last modified file within that directory
Print the last modified file's timestamp next to the parent directory's name
Exclude any timestamps less than 30 days old
Exclude specific directory names using regex
Order by oldest timestamp
Given this directory structure in path/mydir:
database_1
table_1.file (last modified 2021-11-01)
table_2.file (last modified 2021-11-01)
table_3.file (last modified 2021-11-05)
database_2
table_1.file (last modified 2021-05-01)
table_2.file (last modified 2021-05-01)
table_3.file (last modified 2021-08-01)
database_3
table_1.file (last modified 2020-01-01)
table_2.file (last modified 2020-01-01)
table_3.file (last modified 2020-06-01)
I would want to output:
database_3 2020-06-01
database_2 2021-08-01
This half works, but looks at the modified date of the parent directory instead of the max timestamp of files under the directory:
find . -maxdepth 1 -mtime +30 -type d -ls | grep -vE 'name1|name2'
I'm very much a novice with bash, so any help and guidance is appreciated!
Would you please try the following
#!/bin/bash
cd "path/mydir/"
for d in */; do
dirname=${d%/}
mdate=$(find "$d" -maxdepth 1 -type f -mtime +30 -printf "%TY-%Tm-%Td\t%TT\t%p\n" | sort -rk1,2 | head -n 1 | cut -f1)
[[ -n $mdate ]] && echo -e "$mdate\t$dirname"
done | sort -k1,1 | sed -E $'s/^([^\t]+)\t(.+)/\\2 \\1/'
Output with the provided example:
database_3 2020-06-01
database_2 2021-08-01
for d in */; do loops over the subdirectories in path/mydir/.
dirname=${d%/} removes the trailing slash just for the printing purpose.
printf "%TY-%Tm-%Td\t%TT\t%p\n" prepends the modification date and time
to the filename delimited by a tab character. The result will look like:
2021-08-01 12:34:56 database_2/table_3.file
sort -rk1,2 sorts the output by the date and time fields in descending order.
head -n 1 picks the line with the latest timestamp.
cut -f1 extracts the first field with the modification date.
[[ -n $mdate ]] skips the empty mdate.
sort -k1,1 just after done performs the global sorting across the
outputs of the subdirectories.
sed -E ... swaps the timestamp and the dirname. It just considers
the case the dirname may contain a tab character. If not, you can
omit the sed command by switching the order of timestamp and dirname
in the echo command and changing the sort command to sort -k2,2.
As for the mentioned Exclude specific directory names using regex, add
your own logic to the find command or whatever.
[Edit]
In order to print the directory name if the last modified file in the subdirectories is older than the specified date, please try instead:
#!/bin/bash
cd "path/mydir/"
now=$(date +%s)
for d in */; do
dirname=${d%/}
read -r secs mdate < <(find "$d" -type f -printf "%T#\t%TY-%Tm-%Td\n" | sort -nrk1,1 | head -n 1)
secs=${secs%.*}
if (( secs < now - 3600 * 24 * 30 )); then
echo -e "$secs\t$dirname $mdate"
fi
done | sort -nk1,1 | cut -f2-
now=$(date +%s) assigns the variable now to the current time as
the seconds since the epoch.
for d in */; do loops over the subdirectories in path/mydir/.
dirname=${d%/} removes the trailing slash just for the printing purpose.
-printf "%T#\t%TY-%Tm-%Td\n" prints the modificaton time as seconds since
the epoch and the modification date delimited by a tab character.
The result will look like:
1627743600 2021-08-01
sort -nrk1,1 sorts the output by the modification time in descending order.
head -n 1 picks the line with the latest timestamp.
read -r secs mdate < <( stuff ) assigns secs and mdate to the
outputs of the command in order.
secs=${secs%.*} removes the fractional part.
The condition (( secs < now - 3600 * 24 * 30 )) meets if secs
is 30 days or more older than now.
echo -e "$secs\t$dirname $mdate" prints dirname and mdate
prepending the secs for the sorting purpose.
sort -nk1,1 just after done performs the global sorting across the
outputs of the subdirectories.
cut -f2- removes secs portion.

Awk exctracting column but just from one row

I have a script in the following form:
2017-12-11 10:20:16.993 ...
2017-12-12 10:19:16.993 ...
2017-12-13 10:17:16.993 ...
and I want to extract the first column via awk - F. , and compare it to actual system time in seconds and print the line if the difference is less than 300 seconds.
> SYSTEM_TIME=$(date +%s)
> awk -F. -v system_time=$SYSTEM_TIME '{gsub(/[-:]/," ",$1); if(system_time-mktime($1) <= 300) {print $0}}' log.txt
This is my code, but I can't use mktime because it's not in the POSIX norm. Can it be done without it?
Thanks,
Ahmed
General Remark: logfiles are often incomplete. A date-time format is given, but often the time-zone is missing. When daylight-saving comes into-play it can mess up your complete karma if you are missing your timezone.
Note: In all commands below, it will be assumed that the date in the logfile is in UTC and that the system runs in UTC. If this is not the case, be aware that daylight saving time will create problems when running any of the commands below arround the time daylight-saving kicks in.
Combination of date and awk: (not POSIX)
If your date command has the -d flag (not POSIX), you can run the following:
awk -v r="(date -d '300 seconds ago' '+%F %T.%3N)" '(r < $0)'
GNU awk only:
If you want to make use of mktime, it is then easier to just do:
awk 'BEGIN{s=systime();FS=OFS="."}
{t=$1;gsub(/[-:]/," ",t); t=mktime(t)}
(t-s < 300)' logfile
I will be under the assumption that the log-files are not created in the future, so all times are always smaller than system time.
POSIX:
If you cannot make use of mktime but want to use posix only, which also implies that date does not have the -d flag, you can create your own implementation of mktime. Be aware, that the version presented here does not do any timezone corrections as is done with mktime. mktime_posix assumes that the datestring is in UTC
awk -v s="$(date +%s)" '
# Algorithm from "Astronomical Algorithms" By J.Meeus
function mktime_posix(datestring, a,t) {
split(datestring,a," ")
if (a[1] < 1970) return -1
if (a[2] <= 2) { a[1]--; a[2]+=12 }
t=int(a[1]/100); t=2-t+int(t/4)
t=int(365.25*a[1]) + int(30.6001*(a[2]+1)) + a[3] + t - 719593
return t*86400 + a[4]*3600 + a[5]*60 + a[6]
}
BEGIN{FS=OFS="."}
{t=$1;gsub(/[-:]/," ",t); t=mktime_posix(t)}
(t-s <= 300)' logfile
Related: this answer, this answer
I can think in doing this as its shorter.
#!/bin/bash
SYSTEM_TIME=$(date +%s)
LOGTIME=$( date "+%s" -d "$( awk -F'.' '{print $1}' <( head -1 inputtime.txt ))" )
DIFFERENCEINSECONDS=$( echo "$SYSTEM_TIME $LOGTIME" | awk '{ print ($1 - $2)}' )
if [[ "$DIFFERENCEINSECONDS" -gt 300 ]]
then
echo "TRIGGERED!"
fi
Hope its useful for you. Let me know.
Note : I assumed your input log could be called inputtime.txt. You need to change for your actual filename of course.

Is it really slow to handle text file(more than 10K lines) with shell script?

I have a file with more than 10K lines of record.
Within each line, there are two date+time info. Below is an example:
"aaa bbb ccc 170915 200801 12;ddd e f; g; hh; 171020 122030 10; ii jj kk;"
I want to filter out the lines the days between these two dates is less than 30 days.
Below is my source code:
#!/bin/bash
filename="$1"
echo $filename
touch filterfile
totalline=`wc -l $filename | awk '{print $1}'`
i=0
j=0
echo $totalline lines
while read -r line
do
i=$[i+1]
if [ $i -gt $[j+9] ]; then
j=$i
echo $i
fi
shortline=`echo $line | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'`
date1=`echo $shortline | awk '{print $1}'`
date2=`echo $shortline | awk '{print $2}'`
if [ $date1 -gt 700000 ]
then
continue
fi
d1=`date -d $date1 +%s`
d2=`date -d $date2 +%s`
diffday=$[(d2-d1)/(24*3600)]
#diffdays=`date -d $date2 +%s` - `date -d $date1 +%s`)/(24*3600)
if [ $diffday -lt 30 ]
then
echo $line >> filterfile
fi
done < "$filename"
I am running it in cywin. It took about 10 second to handle 10 lines. I use echo $i to show the progress.
Is it because i am using some wrong way in my script?
This answer does not answer your question but gives an alternative method to your shell script. The answer to your question is given by Sundeep's comment :
Why is using a shell loop to process text considered bad practice?
Furthermore, you should be aware that everytime you call sed, awk, echo, date, ... you are requesting the system to execute a binary which needs to be loaded into memory etc etc. So if you do this in a loop, it is very inefficient.
alternative solution
awk programs are commonly used to process log files containing timestamp information, indicating when a particular log record was written. gawk extended the awk standard with time-handling functions. The one you are interested in is :
mktime(datespec [, utc-flag ]) Turn datespec into a timestamp in the
same form as is returned by systime(). It is similar to the function
of the same name in ISO C. The argument, datespec, is a string of the
form "YYYY MM DD HH MM SS [DST]". The string consists of six or seven
numbers representing, respectively, the full year including century,
the month from 1 to 12, the day of the month from 1 to 31, the hour of
the day from 0 to 23, the minute from 0 to 59, the second from 0 to
60, and an optional daylight-savings flag.
The values of these numbers need not be within the ranges specified;
for example, an hour of -1 means 1 hour before midnight. The
origin-zero Gregorian calendar is assumed, with year 0 preceding year
1 and year -1 preceding year 0. If utc-flag is present and is either
nonzero or non-null, the time is assumed to be in the UTC time zone;
otherwise, the time is assumed to be in the local time zone. If the
DST daylight-savings flag is positive, the time is assumed to be
daylight savings time; if zero, the time is assumed to be standard
time; and if negative (the default), mktime() attempts to determine
whether daylight savings time is in effect for the specified time.
If datespec does not contain enough elements or if the resulting time
is out of range, mktime() returns -1.
As your date format is of the form yymmdd HHMMSS we need to write a parser function convertTime for this. Be aware in this function we will pass times of the form yymmddHHMMSS. Furthermore, using a space delimited fields, your times are located in field $4$5 and $11$12. As mktime converts the time to seconds since 1970-01-01 onwards, all we need to do is to check if the delta time is smaller than 30*24*3600 seconds.

awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ t1=convertTime($4$5); t2=convertTime($11$12)}
(t2-t1 < 30*3600*24) { print }' <file>
If you are not interested in the real delta time (your sed line removes the actual time of the day), than you can adopt it to :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s "00 00 00"
return mktime(s)
}
{ t1=convertTime($4); t2=convertTime($11)}
(t2-t1 < 30*3600*24) { print }' <file>
If the dates are not in the fields, you can use match to find them :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ match($0,/[0-9]{6} [0-9]{6}/);
t1=convertTime(substr($0,RSTART,RLENGTH));
a=substr($0,RSTART+RLENGTH)
match(a,/[0-9]{6} [0-9]{6}/)
t2=convertTime(substr(a,RSTART,RLENGTH))}
(t2-t1 < 30*3600*24) { print }' <file>
With some modifications, often without speed in mind, I can reduce the processing time by 50% - which is a lot:
#!/bin/bash
filename="$1"
echo "$filename"
# touch filterfile
totalline=$(wc -l < "$filename")
i=0
j=0
echo "$totalline" lines
while read -r line
do
i=$((i+1))
if (( i > ((j+9)) )); then
j=$i
echo $i
fi
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
if (( date1 > 700000 ))
then
continue
fi
d1=$(date -d "$date1" +%s)
d2=$(date -d "$date2" +%s)
diffday=$(((d2-d1)/(24*3600)))
# diffdays=$(date -d $date2 +%s) - $(date -d $date1 +%s))/(24*3600)
if (( diffday < 30 ))
then
echo "$line" >> filterfile
fi
done < "$filename"
Some remarks:
# touch filterfile
Well - the later CMD >> filterfile overwrites this file and creates one, if it doesn't exist.
totalline=$(wc -l < "$filename")
You don't need awk, here. The filename output is surpressed if wc doesn't see the filename.
Capturing the output in an array:
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
allows us array access and saves another call to awk.
On my machine, your code took about 42s for 2880 lines (on your machine 2880 s?) and about 19s for the same file with my code.
So I suspect, if you aren't running it on an i486-machine, that cygwin might be a slowdown. It's a linux environment for windows, isn't it? Well, I'm on a core Linux system. Maybe you try the gnu-utils for Windows - the last time I looked for them, they were advertised as gnu-utils x32 or something, maybe there is an a64-version available by now.
And the next thing I would have a look at, is the date calculation - that might be a slowdown too.
2880 lines isn't that much, so I don't suspect that my SDD drive plays a huge role in the game.

How to calculate this difference in unix

I have a file named Test1.dat and its content's are as follows
Abcxxxxxxxxxxx_123.dat#10:10:15
Bcdxxxxxxxxxxx_145.dat#10:15:23
Cssxxxxxxxxxxx_567.dat#10:26:56
Fgsxxxxxxxxxxx_823.dat#10:46:56
Kssxxxxxxxxxxx_999.dat#11:15:23
Please note that after the # symbol it is the HH:MM:SS format that follows. My question now is I want to calculate the time difference between the current time and the time present in the files and fetch only those filenames where time difference is more than 30 Mins. So if current time is 11:00:00, I want to fetch files that have arrived 30 minutes before so basically the first three files.
This awk should do:
awk -F# '$2>=from && $2<=to' from="$(date +%H:%M:%S -d -30min)" to="$(date +%H:%M:%S)" file
If you only need to get the last 30 min (In you case you need the first one since you do not like 11:30):
awk -F# '$2>=from' from="$(date +%H:%M:%S -d -30min)" file
You can also use bash script and get the result
#!/bin/bash
current=`date '+%s'`
needed_time=`echo "$current - 60 * 30" | bc`
while read line ; do
time=`echo $line |sed -r 's/[^#]*.(.*)/\1/g'`
user_time=`date -d $time '+%s'`
if [ $user_time -le $needed_time ] ; then
echo "$line"
fi
done < file_name
Could be this the answer?
awk -F# '{ts = systime(); thirty_mins = 30 * 60; thirty_mins_ago = ts - thirty_mins; if ($2 < strftime("%H:%M:%S", thirty_mins_ago)) print $2 }' <file.txt
strftime is a GAWK (gnu awk) extention.

Filtering Filenames with bash

I have a directory full of log files in the form
${name}.log.${year}{month}${day}
such that they look like this:
logs/
production.log.20100314
production.log.20100321
production.log.20100328
production.log.20100403
production.log.20100410
...
production.log.20100314
production.log.old
I'd like to use a bash script to filter out all the logs older than x amount of month's and dump it into *.log.old
X=6 #months
LIST=*.log.*;
for file in LIST; do
is_older = file_is_older_than_months( ${file}, ${X} );
if is_older; then
cat ${c} >> production.log.old;
rm ${c};
fi
done;
How can I get all the files older than x months? and... How can I avoid that *.log.old file is included in the LIST attribute?
The following script expects GNU date to be installed. You can call it in the directory with your log files with the first parameter as the number of months.
#!/bin/sh
min_date=$(date -d "$1 months ago" "+%Y%m%d")
for log in *.log.*;do
[ "${log%.log.old}" "!=" "$log" ] && continue
[ "${log%.*}.$min_date" "<" "$log" ] && continue
cat "$log" >> "${log%.*}.old"
rm "$log"
done
Presumably as a log file, it won't have been modified since it was created?
Have you considered something like this...
find ./ -name "*.log.*" -mtime +60 -exec rm {} \;
to delete files that have not been modified for 60 days. If the files have been modified more recently then this is no good of course.
You'll have to compare the logfile date with the current date. Start with the year, multiply by 12 to get the difference in months. Do the same with months, and add them together. This gives you the age of the file in months (according to the file name).
For each filename, you can use an AWK filter to extract the year:
awk -F. '{ print substr($3,0,4) }'
You also need the current year:
date "+%Y"
To calculate the difference:
$(( current_year - file_year ))
Similarly for months.
assuming you have possibility of modifying the logs and the filename timestamp is the more accurate one. Here's an gawk script.
#!/bin/bash
awk 'BEGIN{
months=6
current=systime() #get current time in sec
sec=months*30*86400 #months in sec
output="old.production" #output file
}
{
m=split(FILENAME,fn,".")
yr=substr(fn[m],0,4)
mth=substr(fn[m],5,2)
day=substr(fn[m],7,2)
t=mktime(yr" "mth" "day" 00 00 00")
if ( (current-t) > sec){
print "file: "FILENAME" more than "months" month"
while( (getline line < FILENAME )>0 ){
print line > output
}
close(FILENAME)
cmd="rm \047"FILENAME"\047"
print cmd
#system(cmd) #uncomment to use
}
}' production*

Resources