how do you through loop a certain date range - shell

I have files in a directory that are date based but not obviously date-stamped.
File_yyyymmdd_record.log
These are lying around in a directory for a few years worth of time.
Now if these were simply numbers all I needed to do was get the difference and incremenet a counter to push the value
var=substring( File_yyyymmdd_record.log ) /* get the yyyymmdd part */
var2=substring( File2_yyyymmdd_record.log ) /* get the yyyymmdd part */
delta=var2-var1
set i=delta and loop through to get the values for all these recordID's ( record ID is the yyyymmdd part )
The problem is if I have 2 different months and also years in the directory say 20131210 and 20140110
the difference not going to gimme all the recordID's in that directory , since, when it spills over to the next month the plain numeric calculation is not applicable- it should be a date based calculation.
what I want to do is use 2 input parameters to the shell
shell.sh recordID1 recordID2
and based on these it will find all records and store them some place and loop through each record as an input like this
find <dir> -iname recordID* ...<some awk and sed here> |
while read recordID ;
do <stuff >
done
Anyway this can be achieved esp in 2 contexts-
First the date calculation part and the other is to store these recordID's so I can cycle through them. Maybe echo them to a tmp file is what comes off the bat.
For the date calculation part - I tried this and it works . But not sure if it will falter some time / situation
echo $((($(date -u -d 2010-04-29 +%s) - $(date -u -d 2010-03-28 +%s)) / 86400))
So given recordID1 as 20100328 I have 32 days recordID's to look for in that directory.
You have to advance dates for 32 days from recordID1 and store them some place.
How best can all this be done.

I got your points, you need find out log files with file name between 20131210 and 20140110 .
(no need convert to epoch time)
#! /usr/bin/bash
sep=20131210
eep=20140110
find /DIR -type f -name "*.log" |while read file
do
d=${file:5:8}
if [ "$d" -ge "$sep" ] && [ "$d" -le "$eep" ]; then
do <stuff >
fi
done

Something like this should do:
s=20130102 # start date
e=20130202 # end date
sep=$(date +"%s" -d"$s") # conv to epoch
eep=$(date +"%s" -d"$e")
for f in *.log; do
d=$(date +"%s" -d$(sed -n 's/^[^_]*_\([^_]*\)_[^_]*.log/\1/p' <<< "$f"))
if [ "$d" -ge "$sep" ] && [ "$d" -le "$eep" ]; then
echo $f
fi
done

Related

Is it really slow to handle text file(more than 10K lines) with shell script?

I have a file with more than 10K lines of record.
Within each line, there are two date+time info. Below is an example:
"aaa bbb ccc 170915 200801 12;ddd e f; g; hh; 171020 122030 10; ii jj kk;"
I want to filter out the lines the days between these two dates is less than 30 days.
Below is my source code:
#!/bin/bash
filename="$1"
echo $filename
touch filterfile
totalline=`wc -l $filename | awk '{print $1}'`
i=0
j=0
echo $totalline lines
while read -r line
do
i=$[i+1]
if [ $i -gt $[j+9] ]; then
j=$i
echo $i
fi
shortline=`echo $line | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'`
date1=`echo $shortline | awk '{print $1}'`
date2=`echo $shortline | awk '{print $2}'`
if [ $date1 -gt 700000 ]
then
continue
fi
d1=`date -d $date1 +%s`
d2=`date -d $date2 +%s`
diffday=$[(d2-d1)/(24*3600)]
#diffdays=`date -d $date2 +%s` - `date -d $date1 +%s`)/(24*3600)
if [ $diffday -lt 30 ]
then
echo $line >> filterfile
fi
done < "$filename"
I am running it in cywin. It took about 10 second to handle 10 lines. I use echo $i to show the progress.
Is it because i am using some wrong way in my script?
This answer does not answer your question but gives an alternative method to your shell script. The answer to your question is given by Sundeep's comment :
Why is using a shell loop to process text considered bad practice?
Furthermore, you should be aware that everytime you call sed, awk, echo, date, ... you are requesting the system to execute a binary which needs to be loaded into memory etc etc. So if you do this in a loop, it is very inefficient.
alternative solution
awk programs are commonly used to process log files containing timestamp information, indicating when a particular log record was written. gawk extended the awk standard with time-handling functions. The one you are interested in is :
mktime(datespec [, utc-flag ]) Turn datespec into a timestamp in the
same form as is returned by systime(). It is similar to the function
of the same name in ISO C. The argument, datespec, is a string of the
form "YYYY MM DD HH MM SS [DST]". The string consists of six or seven
numbers representing, respectively, the full year including century,
the month from 1 to 12, the day of the month from 1 to 31, the hour of
the day from 0 to 23, the minute from 0 to 59, the second from 0 to
60, and an optional daylight-savings flag.
The values of these numbers need not be within the ranges specified;
for example, an hour of -1 means 1 hour before midnight. The
origin-zero Gregorian calendar is assumed, with year 0 preceding year
1 and year -1 preceding year 0. If utc-flag is present and is either
nonzero or non-null, the time is assumed to be in the UTC time zone;
otherwise, the time is assumed to be in the local time zone. If the
DST daylight-savings flag is positive, the time is assumed to be
daylight savings time; if zero, the time is assumed to be standard
time; and if negative (the default), mktime() attempts to determine
whether daylight savings time is in effect for the specified time.
If datespec does not contain enough elements or if the resulting time
is out of range, mktime() returns -1.
As your date format is of the form yymmdd HHMMSS we need to write a parser function convertTime for this. Be aware in this function we will pass times of the form yymmddHHMMSS. Furthermore, using a space delimited fields, your times are located in field $4$5 and $11$12. As mktime converts the time to seconds since 1970-01-01 onwards, all we need to do is to check if the delta time is smaller than 30*24*3600 seconds.

awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ t1=convertTime($4$5); t2=convertTime($11$12)}
(t2-t1 < 30*3600*24) { print }' <file>
If you are not interested in the real delta time (your sed line removes the actual time of the day), than you can adopt it to :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s "00 00 00"
return mktime(s)
}
{ t1=convertTime($4); t2=convertTime($11)}
(t2-t1 < 30*3600*24) { print }' <file>
If the dates are not in the fields, you can use match to find them :
awk 'function convertTime(t) {
s="20"substr(t,1,2)" "substr(t,3,2)" "substr(t,5,2)" "
s= s substr(t,7,2)" "substr(t,9,2)" "substr(t,11,2)"
return mktime(s)
}
{ match($0,/[0-9]{6} [0-9]{6}/);
t1=convertTime(substr($0,RSTART,RLENGTH));
a=substr($0,RSTART+RLENGTH)
match(a,/[0-9]{6} [0-9]{6}/)
t2=convertTime(substr(a,RSTART,RLENGTH))}
(t2-t1 < 30*3600*24) { print }' <file>
With some modifications, often without speed in mind, I can reduce the processing time by 50% - which is a lot:
#!/bin/bash
filename="$1"
echo "$filename"
# touch filterfile
totalline=$(wc -l < "$filename")
i=0
j=0
echo "$totalline" lines
while read -r line
do
i=$((i+1))
if (( i > ((j+9)) )); then
j=$i
echo $i
fi
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
if (( date1 > 700000 ))
then
continue
fi
d1=$(date -d "$date1" +%s)
d2=$(date -d "$date2" +%s)
diffday=$(((d2-d1)/(24*3600)))
# diffdays=$(date -d $date2 +%s) - $(date -d $date1 +%s))/(24*3600)
if (( diffday < 30 ))
then
echo "$line" >> filterfile
fi
done < "$filename"
Some remarks:
# touch filterfile
Well - the later CMD >> filterfile overwrites this file and creates one, if it doesn't exist.
totalline=$(wc -l < "$filename")
You don't need awk, here. The filename output is surpressed if wc doesn't see the filename.
Capturing the output in an array:
shortline=($(echo "$line" | sed 's/.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*\([0-9]\{6\}\)[ ][0-9]\{6\}.*/\1 \2/'))
date1=${shortline[0]}
date2=${shortline[1]}
allows us array access and saves another call to awk.
On my machine, your code took about 42s for 2880 lines (on your machine 2880 s?) and about 19s for the same file with my code.
So I suspect, if you aren't running it on an i486-machine, that cygwin might be a slowdown. It's a linux environment for windows, isn't it? Well, I'm on a core Linux system. Maybe you try the gnu-utils for Windows - the last time I looked for them, they were advertised as gnu-utils x32 or something, maybe there is an a64-version available by now.
And the next thing I would have a look at, is the date calculation - that might be a slowdown too.
2880 lines isn't that much, so I don't suspect that my SDD drive plays a huge role in the game.

Randomly loop over days in bash-script

At the moment, I have a while-loop that takes a starting date, runs a python script with the day as the input, then takes the day + 1 until a certain due date is reached.
day_start=2016-01-01
while [ "$day_start"!=2018-01-01 ] ;
do
day_end=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
python script.py --start="$day_start" --end="$day_end";
day_start=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
done
I would like to do the same thing, but now to pick a random day between 2016-01-01 and 2018-01-01 and repeat until all days have been used once. I think it should be a for-loop instead of this while loop, but I have trouble to specify the for-loop over this date-range in bash. Does anyone have an idea how to formulate this?
It can take quite a long time if you randomly choose the dates because of the Birthday Problem. (You'll hit most of the dates over and over again but the last date can take quite some time).
The best idea I can give you is this:
Create all dates as before in a while loop (only the day_start-line)
Output all dates into a temporary file
Use sort -R on this file ("shuffles" the contents and prints the result)
Loop over the output from sort -R and you'll have dates randomly picked until all were reached.
Here's an example script which incorporates my suggestions:
#!/bin/bash
day_start=2016-01-01
TMPFILE="$(mktemp)"
while [ "$day_start" != "2018-01-01" ] ;
do
day_start=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
echo "${day_start}"
done > "${TMPFILE}"
sort -R "${TMPFILE}" | while read -r day_start
do
day_end=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
python script.py --start="$day_start" --end="$day_end";
done
rm "${TMPFILE}"
By the way, without the spaces in the while [ "$day_start" != "2018-01-01" ];, bash won't stop your script.
Fortunately, from 16 to 18 there was no leap year (or was it, and it just works because of that)?
Magic number: 2*365 = 730
The i % 100, just to have less output.
for i in {0..730}; do nd=$(date -d "2016/01/01"+${i}days +%D); if (( i % 100 == 0 || i == 730 )); then echo $nd ; fi; done
01/01/16
04/10/16
07/19/16
10/27/16
02/04/17
05/15/17
08/23/17
12/01/17
12/31/17
With the format instruction (here +%D), you might transform the output to your needs, date --help helps.
In a better readable format, and with +%F:
for i in {0..730}
do
nd=$(date -d "2016/01/01"+${i}days +%F)
echo $nd
done
2016-01-01
2016-04-10
2016-07-19
...
For a random distribution, use shuf (here, for bevity, with 7 days):
for i in {0..6}; do nd=$(date -d "2016/01/01"+${i}days +%D); echo $nd ;done | shuf
01/04/16
01/07/16
01/05/16
01/01/16
01/03/16
01/06/16
01/02/16

BASH ERROR: syntax error: operand expected (error token is ")

I am new to bash scripting, and I'm having an issue with one of my scripts. I'm trying to compose a list of Drivers Under 25 after reading their birthdates in from a folder filled with XML files and calculating their ages. Once I have determined they are under 25, the filename of the driver's data is saved to a text file. The script is working up until a certain point and then it stops. The error I'm getting is:
gdate: extra operand ‘+%s’
Try 'gdate --help' for more information.
DriversUnder25.sh: line 24: ( 1471392000 - )/60/60/24 : syntax error: operand expected (error token is ")/60/60/24 ")
Here is my code:
#!/bin/bash
# define directory to search and current date
DIRECTORY="/*.xml"
CURRENT_DATE=$(date '+%Y%m%d')
# loop over files in a directory
for FILE in $DIRECTORY;
do
# grab user's birth date from XML file
BIRTH_DATE=$(sed -n '/Birthdate/{s/.*<Birthdate>//;s/<\/Birthdate.*//;p;}' $FILE)
# calculate the difference between the current date
# and the user's birth date (seconds)
DIFFERENCE=$(( ( $(gdate -ud $CURRENT_DATE +'%s') - $(gdate -ud $BIRTH_DATE +'%s') )/60/60/24 ))
# calculate the number of years between
# the current date and the user's birth date
YEARS=$(($DIFFERENCE / 365))
# if the user is under 25
if [ "$YEARS" -le 25 ]; then
# save file name only
FILENAME=`basename $FILE`
# output filename to text file
echo $FILENAME >> DriversUnder25.txt
fi
done
I'm not sure why it correctly outputs the first 10 filenames and then stops. Any ideas why this may be happening?
You need to quote the expansion of $BIRTH_DATE to prevent word splitting on the whitespace in the value. (It is good practice to quote all your parameter expansions, unless you have a good reason not to, for this very reason.)
DIFFERENCE=$(( ( $(gdate -ud "$CURRENT_DATE" +'%s') - $(gdate -ud "$BIRTH_DATE" +'%s') )/60/60/24 ))
(Based on your comment, this would probably at least allow gdate to give you a better error message.)
A best-practices implementation would look something like this:
directory=/ # patch as appropriate
current_date_unix=$(date +%s)
for file in "$directory"/*.xml; do
while IFS= read -r birth_date; do
birth_date_unix=$(gdate -ud "$birth_date" +'%s')
difference=$(( ( current_date_unix - birth_date_unix ) / 60 / 60 / 24 ))
years=$(( difference / 365 ))
if (( years < 25 )); then
echo "${file%.*}"
fi
done < <(xmlstarlet sel -t -m '//Birthdate' -v . -n <"$file")
done >DriversUnder25.txt
If this script needs to be usable my folks who don't have xmlstarlet installed, you can generate an XSLT template and then use xsltproc (which is available out-of-the-box on modern opertaing systems).
That is to say, if you run this once, and bundle its output with your script:
xmlstarlet sel -C -t -m '//Birthdate' -v . -n >get-birthdays.xslt
...then the script can be modified to replace xmlstarlet with:
xsltproc get-birthdays.xslt - <"$file"
Notes:
The XML input files are being read with an actual XML parser.
When expanding for file in "$directory"/*.xml, the expansion is quoted but the glob is not (thus allowing the script to operate on directories with spaces, glob characters, etc. in their names).
The output file is being opened once, for the loop, rather than once per line of output (reducing overhead unnecessarily opening and closing files).
Lower-case variable names are in use to comply with POSIX conventions (specifying that variables with meaning to the operating system and shell have all-upper-case names, and that the set of names with at least one lower-case character is reserved for application use; while the docs in question are with respect to environment variables, shell variables share a namespace, making the convention relevant).
The issue was that there were multiple drivers in some files, thus importing multiple birth dates into the same string. My solution is below:
#!/bin/bash
# define directory to search and current date
DIRECTORY="/*.xml"
CURRENT_DATE=$(date '+%Y%m%d')
# loop over files in a directory
for FILE in $DIRECTORY;
do
# set flag for output to false initially
FLAG=false
# grab user's birth date from XML file
BIRTH_DATE=$(sed -n '/Birthdate/{s/.*<Birthdate>//;s/<\/Birthdate.*//;p;}' $FILE)
# loop through birth dates in file (there can be multiple drivers)
for BIRTHDAY in $BIRTH_DATE;
do
# calculate the difference between the current date
# and the user's birth date (seconds)
DIFFERENCE=$(( ( $(gdate -ud $CURRENT_DATE +'%s') - $(gdate -ud $BIRTHDAY +'%s') )/60/60/24))
# calculate the number of years between
# the current date and the user's birth date
YEARS=$(($DIFFERENCE / 365))
# if the user is under 25
if [ "$YEARS" -le 25 ]; then
# save file name only
FILENAME=`basename $FILE`
# set flag to true (driver is under 25 years of age)
FLAG=true
fi
done
# if there is a driver under 25 in the file
if $FLAG == true; then
# output filename to text file
echo $FILENAME >> DriversUnder25.txt
fi
done

How to find files older than N days from a given timestamp

I want to find files older than N days from a given timestamp in format YYYYMMDDHH
I can find file older than 2 days with the below command, but this finds files with present time:
find /path/to/dir -mtime -2 -type f -ls
Lets say I give the input timestamp=2011093009 I want to find files older than 2 days from 2011093009.
Been doing my research, but can't seem to figure it out.
put one of the answers from here and using $() as suggested here
(updated as per comment by sputnick)
date=2011060109; find /home/kenjal/ -mtime $(( $(date +%Y%m%d%H) - $(date -d $date +%Y%m%d%H) ))
Basically this is accomplished by finding files in a range of dates...
I used perl to calculate the days from today to the given timestamp since GNU date is not available in my system, so -d is not an option. Code Below accepts date in format YYYYDDMM. See below:
#!/usr/bin/perl
use Time::Local;
my($day, $month, $year) = (localtime)[3,4,5];
$month = sprintf '%02d', $month+1;
$day = sprintf '%02d', $day;
my($currentYear, $currentDM) = ($year+1900, "$day$month");
my $todaysDate = "$currentYear$currentDM";
#print $todaysDate;
sub to_epoch {
my ($t) = #_;
my ($y, $d, $m) = ($t =~ /(\d{4})(\d{2})(\d{2})/);
return timelocal(0, 0, 0, $d+0, $m-1, $y-1900);
}
sub diff_days {
my ($t1, $t2) = #_;
return (abs(to_epoch($t2) - to_epoch($t1))) / 86400;
}
print diff_days($todaysDate, $ARGV[0]);
**Note: I'm no expert in Perl and this is the very first piece of code I modify/write. Having said that, I'm sure there are better ways to accomplish the above in Perl
Then the below korn script to perform what I needed.
#!/bin/ksh
daysFromToday=$(dateCalc.pl 20110111)
let daysOld=$daysFromToday+31
echo $daysFromToday "\t" $daysOld
find /path/to/dir/ -mtime +$daysFromToday -mtime -$daysOld -type f -ls
I'm finding all files older than +$daysFromToday, then narrowing the search to days newer than -$daysOld
#!/usr/bin/env bash
# getFiles arrayName olderDate newerDate [ pathName ]
getFiles() {
local i
while IFS= read -rd '' "$1"'[(_=$(read -rd "" x; echo "${x:-0}")) < $2 && _ > $3 ? ++i : 0]'; do
:
done < <(find "${4:-.}" -type f -printf '%p\0%Ts\0')
}
# main date1 date2 [ pathName ]
main() {
local -a dates files
local x
for x in "${#:1:2}"; do
dates+=( "$(date -d "$x" +%s)" ) || return 1
done
_=$dates let 'dates[1] > dates && (dates=dates[1], dates[1]=_)'
getFiles files "${dates[#]}" "$3"
declare -p files
}
main "$#"
# vim: set fenc=utf-8 ff=unix ts=4 sts=4 sw=4 ft=sh nowrap et:
This Bash script takes two dates and a pathname for find. getFiles takes an array name and the files with mtimes between the two dates are assigned to that array. This example script simply prints the array.
Requires a recent Bash, and GNU date. If it really has to be "N days", or you don't have GNU date, then there is no possible solution. You'll need to use a different language. No shell can do that using standard utilities.
Technically, you can calculate an offset in days using printf '%(%s)T' ... and some arithmetic, but there is no possible way to get the base date from a timestamp without GNU date, so I'm afraid you're out of luck.
Edit
I see this question has a ksh tag, in which case I lied, apparently ksh93's printf accepts a GNU date -d like string. I have no idea whether it's portable, and of course requires a system with ksh93 installed. You could do it in that case with some modification to the above script.

Filtering Filenames with bash

I have a directory full of log files in the form
${name}.log.${year}{month}${day}
such that they look like this:
logs/
production.log.20100314
production.log.20100321
production.log.20100328
production.log.20100403
production.log.20100410
...
production.log.20100314
production.log.old
I'd like to use a bash script to filter out all the logs older than x amount of month's and dump it into *.log.old
X=6 #months
LIST=*.log.*;
for file in LIST; do
is_older = file_is_older_than_months( ${file}, ${X} );
if is_older; then
cat ${c} >> production.log.old;
rm ${c};
fi
done;
How can I get all the files older than x months? and... How can I avoid that *.log.old file is included in the LIST attribute?
The following script expects GNU date to be installed. You can call it in the directory with your log files with the first parameter as the number of months.
#!/bin/sh
min_date=$(date -d "$1 months ago" "+%Y%m%d")
for log in *.log.*;do
[ "${log%.log.old}" "!=" "$log" ] && continue
[ "${log%.*}.$min_date" "<" "$log" ] && continue
cat "$log" >> "${log%.*}.old"
rm "$log"
done
Presumably as a log file, it won't have been modified since it was created?
Have you considered something like this...
find ./ -name "*.log.*" -mtime +60 -exec rm {} \;
to delete files that have not been modified for 60 days. If the files have been modified more recently then this is no good of course.
You'll have to compare the logfile date with the current date. Start with the year, multiply by 12 to get the difference in months. Do the same with months, and add them together. This gives you the age of the file in months (according to the file name).
For each filename, you can use an AWK filter to extract the year:
awk -F. '{ print substr($3,0,4) }'
You also need the current year:
date "+%Y"
To calculate the difference:
$(( current_year - file_year ))
Similarly for months.
assuming you have possibility of modifying the logs and the filename timestamp is the more accurate one. Here's an gawk script.
#!/bin/bash
awk 'BEGIN{
months=6
current=systime() #get current time in sec
sec=months*30*86400 #months in sec
output="old.production" #output file
}
{
m=split(FILENAME,fn,".")
yr=substr(fn[m],0,4)
mth=substr(fn[m],5,2)
day=substr(fn[m],7,2)
t=mktime(yr" "mth" "day" 00 00 00")
if ( (current-t) > sec){
print "file: "FILENAME" more than "months" month"
while( (getline line < FILENAME )>0 ){
print line > output
}
close(FILENAME)
cmd="rm \047"FILENAME"\047"
print cmd
#system(cmd) #uncomment to use
}
}' production*

Resources