I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.
I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?
for k in *.csv; do
sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
You read each CSV five times. It would be better to read each CSV only once.
You extract the same data multiple times. All but one parts are subsets of the others.
2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
6 months ago is a subset of 3 months ago and 1 month ago.
3 months ago is a subset of 1 month ago.
This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.
The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)
sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv
Current performance-related issues:
reading each input file 5x times => we want to limit this to a single read per input file
calling date 5x times (necessary) for each input file (unnecessary) => make the 5x date calls prior to the for k in *.csv loop [NOTE: the overhead for repeated date calls will pale in comparison to the repeated reads of the input files]
Potential operational issue:
sed is not designed for doing comparisons of data (eg, look for a string that is >= a search pattern); consider an input file like such:
$ cat input.csv
2021-01-25
2021-03-01
If 'today' is 2021-03-14 then for the 1month dataset the current sed solution is:
sed '/2012-02/,$p'
But because there are no entries for 2012-02 the sed command returns 0 rows, even though we should see the row for 2021-03-01.
Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.
Anyhoo, back to the question at hand ...
Assumptions:
input files are comma-delimited (otherwise need to adjust the proposed solution)
the date to be tested is of the format YYYY-MM-...
the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
output filename prefix is the input filename sans the .csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash and prior to OP's for k in *.csv loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk idea (replaces all of the sed calls in OP's current for k in *.csv loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk can dyncially process the input filename to determine the filename prefix (see FILENAME variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
[especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the awk solution
processing X number of input files in parallel, eg, awk code could be placed in a bash function and then called via <function_name> <input_file> &; this can be done via bash loop controls, parallel, xargs, etc
if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention
Something like this may work if you have GNU Parallel (version >= 20191022):
do_one() {
cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv
If your disks are fast: Remove -j1.
Related
At the moment, I have a while-loop that takes a starting date, runs a python script with the day as the input, then takes the day + 1 until a certain due date is reached.
day_start=2016-01-01
while [ "$day_start"!=2018-01-01 ] ;
do
day_end=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
python script.py --start="$day_start" --end="$day_end";
day_start=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
done
I would like to do the same thing, but now to pick a random day between 2016-01-01 and 2018-01-01 and repeat until all days have been used once. I think it should be a for-loop instead of this while loop, but I have trouble to specify the for-loop over this date-range in bash. Does anyone have an idea how to formulate this?
It can take quite a long time if you randomly choose the dates because of the Birthday Problem. (You'll hit most of the dates over and over again but the last date can take quite some time).
The best idea I can give you is this:
Create all dates as before in a while loop (only the day_start-line)
Output all dates into a temporary file
Use sort -R on this file ("shuffles" the contents and prints the result)
Loop over the output from sort -R and you'll have dates randomly picked until all were reached.
Here's an example script which incorporates my suggestions:
#!/bin/bash
day_start=2016-01-01
TMPFILE="$(mktemp)"
while [ "$day_start" != "2018-01-01" ] ;
do
day_start=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
echo "${day_start}"
done > "${TMPFILE}"
sort -R "${TMPFILE}" | while read -r day_start
do
day_end=$(date +"%Y-%m-%d" -d "$day_start + 1 day")
python script.py --start="$day_start" --end="$day_end";
done
rm "${TMPFILE}"
By the way, without the spaces in the while [ "$day_start" != "2018-01-01" ];, bash won't stop your script.
Fortunately, from 16 to 18 there was no leap year (or was it, and it just works because of that)?
Magic number: 2*365 = 730
The i % 100, just to have less output.
for i in {0..730}; do nd=$(date -d "2016/01/01"+${i}days +%D); if (( i % 100 == 0 || i == 730 )); then echo $nd ; fi; done
01/01/16
04/10/16
07/19/16
10/27/16
02/04/17
05/15/17
08/23/17
12/01/17
12/31/17
With the format instruction (here +%D), you might transform the output to your needs, date --help helps.
In a better readable format, and with +%F:
for i in {0..730}
do
nd=$(date -d "2016/01/01"+${i}days +%F)
echo $nd
done
2016-01-01
2016-04-10
2016-07-19
...
For a random distribution, use shuf (here, for bevity, with 7 days):
for i in {0..6}; do nd=$(date -d "2016/01/01"+${i}days +%D); echo $nd ;done | shuf
01/04/16
01/07/16
01/05/16
01/01/16
01/03/16
01/06/16
01/02/16
I'm trying to convert strings, describing a time interval, to the corresponding number of seconds.
After some experimenting I figured out that I can use date like this:
soon=$(date -d '5 minutes 10 seconds' +%s); now=$(date +%s)
echo $(( $soon-$now ))
but I think there should be an easier way to convert strings like "5 minutes 10 seconds" to the corresponding number of seconds, in this example 310. Is there a way to do this in one command?
Note: although portability would be useful, it isn't my top priority.
You could start at epoch
date -d"1970-01-01 00:00:00 UTC 5 minutes 10 seconds" "+%s"
310
You could also easily sub in times
Time="1 day"
date -d"1970-01-01 00:00:00 UTC $Time" "+%s"
86400
There is one way to do it, without using date command in pure bash (for portability)
Assuming you just have an input string to convert "5 minutes 10 seconds" in a bash variable with a : de-limiter as below.
$ convertString="00:05:10"
$ IFS=: read -r hour minute second <<< "$convertString"
$ secondsValue=$(((hour * 60 + minute) * 60 + second))
$ printf "%s\n" "$secondsValue"
310
You can run the above commands directly on the command-line without the $ mark.
This will do (add the epoch 19700101):
$ date -ud '19700101 5 minutes 10 seconds' +%s
310
It is important to add a -u to avoid local time (and DST) effects.
$ TZ=America/Los_Angeles date -d '19700101 5 minutes 10 seconds' +%s
29110
Note that date could do some math:
$ date -ud '19700101 +5 minutes 10 seconds -47 seconds -1 min' +%s
203
The previous suggestions didn't work properly on alpine linux, so here's a small helper function that is POSIX compliant, is easy to use and also supports calculations (just as a side effect of the implementation).
The function always returns an integer based on the provided parameters.
$ durationToSeconds '<value>' '<fallback>'
$ durationToSeconds "1h 30m"
5400
$ durationToSeconds "$someemptyvar" 1h
3600
$ durationToSeconds "$someemptyvar" "1h 30m"
5400
# Calculations also work
$ durationToSeconds "1h * 3"
10800
$ durationToSeconds "1h - 1h"
0
# And also supports long forms for year, day, hour, minute, second
$ durationToSeconds "3 days 1 hour"
262800
# It's also case insensitive
$ durationToSeconds "3 Days"
259200
function durationToSeconds () {
set -f
normalize () { echo $1 | tr '[:upper:]' '[:lower:]' | tr -d "\"\\\'" | sed 's/years\{0,1\}/y/g; s/months\{0,1\}/m/g; s/days\{0,1\}/d/g; s/hours\{0,1\}/h/g; s/minutes\{0,1\}/m/g; s/min/m/g; s/seconds\{0,1\}/s/g; s/sec/s/g; s/ //g;'; }
local value=$(normalize "$1")
local fallback=$(normalize "$2")
echo $value | grep -v '^[-+*/0-9ydhms]\{0,30\}$' > /dev/null 2>&1
if [ $? -eq 0 ]
then
>&2 echo Invalid duration pattern \"$value\"
else
if [ "$value" = "" ]; then
[ "$fallback" != "" ] && durationToSeconds "$fallback"
else
sedtmpl () { echo "s/\([0-9]\+\)$1/(0\1 * $2)/g;"; }
local template="$(sedtmpl '\( \|$\)' 1) $(sedtmpl y '365 * 86400') $(sedtmpl d 86400) $(sedtmpl h 3600) $(sedtmpl m 60) $(sedtmpl s 1) s/) *(/) + (/g;"
echo $value | sed "$template" | bc
fi
fi
set +f
}
Edit : Yes. I developed for OP after comment and checked on Mac OS X, CentOS and Ubuntu. One liner, POSIX compliant command for converting "X minutes Y seconds" format to seconds. That was the question.
echo $(($(echo "5 minutes 10 seconds" | cut -c1-2)*60 + $(echo "5 minutes 10 seconds" | cut -c1-12 | awk '{print substr($0,11)}')))
OP told me via comment that he wants for "X minutes Y seconds" format not for HH:MM:SS format. The command with date and "+%s" is throwing error on (my) Mac. OP wanted to grab the numerical values from "X minutes Y seconds" format and convert it to seconds. First I extracted the minute in digit (take it as equation A) :
echo "5 minutes 10 seconds" | cut -c1-2)
then I extracted the seconds part (take it as equation B) :
echo "5 minutes 10 seconds" | cut -c1-12 | awk '{print substr($0,11)}'
Now multiply minute by 60 then add with the other :
echo $((equation A)*60) + (equation B))
OP should ask the others to check my developmental version (but working) of command before using it for automatic repeated usage like we do with cron on a production server.
If we want to run this on a log file with values in "X minutes Y seconds" format, we have to change echo "5 minutes 10 seconds" to cat file | ... like command. I kept a gist of it too if I or others ever need we can use it with cat to run on server log files with x minutes y seconds like log format.
Although off-topic (what I understood, question has not much to do with current time), this is not working for POSIX-compliant OS to get current time in seconds :
date -d "1970-01-01 00:00:00 UTC 5 minutes 10 seconds" "+%s"
It will throw error on MacOS X but work on most GNU/Linux distro. That +%s part will throw error on POSIX-compliant OS upon complicated usage. These commands are mostly suitable to get current time in seconds on POSIX compliant to any kind of unix like OS :
awk 'BEGIN{srand(); print srand()}'
perl -le 'print time'
If OP needs can extend it by generating current time in seconds and subtract. I hope it will help.
---- OLD Answer before EDIT ----
You can get the current time without that date -- echo | awk '{print systime();}' or wget -qO- http://www.timeapi.org/utc/now?\\s. Other way to convert time to second is echo "00:20:40.25" | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }'.
The example with printf shown in another answer is near perfect.
That thing you want is always needed by the basic utilities of GNU/Linux - gnu.org/../../../../../Setting-an-Alarm.html
Way to approach really depends how much foolproof way you need.
I have a file named dates.txt which contains the following:
DATE_1
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
I just want to replace DATE_i with some dates that are stored in v array using sed command.
To do that I tried a for loop and a sed command (file test.sh):
#!/bin/bash
v[1]=`date -d "7 days ago" '+%d\/%m\/%y'`
v[2]=`date -d "6 days ago" '+%d\/%m\/%y'`
v[3]=`date -d "5 days ago" '+%d\/%m\/%y'`
v[4]=`date -d "4 days ago" '+%d\/%m\/%y'`
v[5]=`date -d "3 days ago" '+%d\/%m\/%y'`
v[6]=`date -d "2 days ago" '+%d\/%m\/%y'`
v[7]=`date -d "1 days ago" '+%d\/%m\/%y'`
cat dates.txt|for j in {1..7};do sed "s/DATE_$j/${v[$j]}/";done
The problem is that this command replaces only the first date. If you run test.sh:
$ ./test.sh
14/03/16
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
The output I am expecting is:
14/03/16
15/03/16
16/03/16
17/03/16
18/03/16
19/03/16
20/03/16
I cannot understand why this is not working.
Could anyone please explain why this is happening and propose a proper solution for this problem?
Thanks!!
Explanation: What's happening is that the first iteration of the for loop is consuming all of the lines you're piping to its standard input. First of all, let's modify test.sh to contain an echo statement in the last line so that we can see what's happening:
cat dates.txt|for j in {1..7};do echo $j; sed "s/DATE_$j/${v[$j]}/";done
You'll see the output from test.sh is the following:
1
13/03/16
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
2
3
4
5
6
7
Next, modify dates.txt to read:
DATE_1
DATE_2
DATE_1
DATE_4
DATE_1
DATE_6
DATE_1
, where we've turned every other line into DATE_1 for demonstration purposes. Now, the output reads:
1
13/03/16
DATE_2
13/03/16
DATE_4
13/03/16
DATE_6
13/03/16
2
3
4
5
6
7
So you see that the first iteration of the for loop (when $j == 1) is processing every line that cat is passing to the for loop. After that, the subsequent iterations of the for loop ($j == 2..7) still run, but they don't receive any input stream (so, in the above example, they just echo the current value of $j and don't pass any input to sed). That's why you were observing that it was changing only the first line.
Solution: Modify the last line to read:
for j in {1..7}; do head -$j dates.txt | tail -1 | sed "s/DATE_$j/${v[$j]}/"; done
cat dates.txt | sed 's/^DATE_//g'
This will strip DATE_ at the beginning of a line (^), leaving the number (or anything else on the line). No need for a loop at all!
Trying to pull the last 5 minutes of logs with (grep matches)
so i do a tac syslog.log | sed / date -d "5 minutes ago"
every line on the log shows this format
Jun 14 14:03:58
Jul 3 08:04:35
so i really want to get the check of data from
Jul 4 08:12
Jul 4 08:17
i tried this method but KINDA works (though its still going through every day from this that 08:12: through 08:17: fits in)
e=""
for (( i = 5; i >= 0; i-- ))
do
e='-e /'`date +\%R -d "-$i min"`':/p '$e;
done
tac /var/log/syslog.log | sed -n $e
e=""
for (( i = 5; i >= 0; i-- ))
do
if [[ -z $e ]]
then e=`date +\%R -d "-$i min"`
else e=$e'\|'`date +\%R -d "-$i min"`
fi
done
re=' \('$e'\):'
tac /var/log/syslog.log | sed -n -e "/$re/p" -e "/$re/!q"
This creates a single regular expression listing all the times from the last 5 minutes, connected with \|. It prints the lines that matches them. Then it uses the ! modifier to quit on the first line that doesn't match the RE.
If you know the format of the dates then why not do:
tac syslog.log | awk '/Jul 4 08:17/,/Jul 4 08:12/ { print } /Jul 4 08:11/ {exit}'
/ .. /,/ .. / is regex range. It will print everything in this range. So as soon as you see /Jul 4 08:11/ on your line that would mean your 5 minutes window has been captured, you exit perusing the file.
So it didnt really work for the above method But i think i got it to work
if i see this i added a RANGE for the {exit}
awk '/'"$dtnow"'/,/'"$dt6min"'/ { print } /'"$dt7min"'/,/'"$dt11min"'/ {exit}'
Seems to work im testing it again
OK Finally looks like it really works this time (where it exits after the hour using SED instead of awk finally got it to work running through some tests.
tac /var/log/syslog.log | sed -e "$( date -d '-1 hour -6 minutes' '+/^%b %e %H:/q;'
date -d '-1 day -6 minutes' '+/^%b %e /q;'
date -d '-1 month -6 minutes' '+/^%b /q;'
for ((o=0;o<=5;o++)) do date -d "-$o minutes" '+/^%b %e %R:/p;'; done ; echo d)"
It works if log entries begins from "May 14 11:41". Variable LASTMINUTES is used to set the last n minutes in the log:
cat log | awk 'BEGIN{ LASTMINUTES=30; for (L=0;L<=LASTMINUTES;L++) TAB[strftime("%b %d %H:%M",systime()-L*60)] } { if (substr($0,0,12) in TAB) print $0 }'
To run the above script you need gawk which can be installed by:
apt-get install gawk
or
yum install gawk
in BASH I can't think of a good way to do this but I only want to see the past 30 days of entries in /var/log/messages*. The issue to me is how do I do that with just the Month and Day. For example:
Sep 2 14:26:13 <SOME ENTRY>
Sep 4 14:26:13 <SOME ENTRY>
Sep 9 14:26:13 <SOME ENTRY>
Sep 14 14:26:13 <SOME ENTRY>
etc..
Any ideas ? HELP! ha ha
I think this is close. This will give you a sorted list of entries (most recent first) through the start of August. Depending on when you run it, it will give you as much as ~60 days instead of 30. On average, I suppose it would give you about 45. The other downside is that you need to adjust the grep statement at the end of the pipe as the date advances.
sort -k1Mr -k2nr <file> | grep -E "Aug|Sep"
a little late but...
egrep "^$(date '+%b %e' -2d)" /var/log/messages
-- This works --- but ugly --
-- Print only the searches that meet the date in each loop iteration (i.e last X num days)
for (( i=0; i<=${MAXSEARCHDAYS}; i++)) ;do
egrep $(date --date "now -${i} days" +%b) ${USBFOUND} | grep $(date --date "now -${i} days" +%e) >> ${TEMPFILE}
done
sort -k1,1M -k2,2n ${TEMPFILE} | uniq >> ${LOGFILE}