I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.
I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?
for k in *.csv; do
sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
You read each CSV five times. It would be better to read each CSV only once.
You extract the same data multiple times. All but one parts are subsets of the others.
2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
6 months ago is a subset of 3 months ago and 1 month ago.
3 months ago is a subset of 1 month ago.
This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.
The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)
sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv
Current performance-related issues:
reading each input file 5x times => we want to limit this to a single read per input file
calling date 5x times (necessary) for each input file (unnecessary) => make the 5x date calls prior to the for k in *.csv loop [NOTE: the overhead for repeated date calls will pale in comparison to the repeated reads of the input files]
Potential operational issue:
sed is not designed for doing comparisons of data (eg, look for a string that is >= a search pattern); consider an input file like such:
$ cat input.csv
2021-01-25
2021-03-01
If 'today' is 2021-03-14 then for the 1month dataset the current sed solution is:
sed '/2012-02/,$p'
But because there are no entries for 2012-02 the sed command returns 0 rows, even though we should see the row for 2021-03-01.
Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.
Anyhoo, back to the question at hand ...
Assumptions:
input files are comma-delimited (otherwise need to adjust the proposed solution)
the date to be tested is of the format YYYY-MM-...
the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
output filename prefix is the input filename sans the .csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash and prior to OP's for k in *.csv loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk idea (replaces all of the sed calls in OP's current for k in *.csv loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk can dyncially process the input filename to determine the filename prefix (see FILENAME variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
[especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the awk solution
processing X number of input files in parallel, eg, awk code could be placed in a bash function and then called via <function_name> <input_file> &; this can be done via bash loop controls, parallel, xargs, etc
if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention
Something like this may work if you have GNU Parallel (version >= 20191022):
do_one() {
cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv
If your disks are fast: Remove -j1.
I have a list of month, date, and timestamps in a file like this:
Jan19 03:05
Jan19 15:05
Jan20 03:05
Jan20 15:05
Jan21 03:05
Jan21 15:06
Jan22 03:05
Jan22 15:06
Dec25 15:05
Dec26 14:06
Dec27 15:06
Dec28 15:06
Dec29 14:05
Dec30 14:06
Dec31 15:06
I need to just get the most recent 30 entries. My code is:
cat file | sort -k1.1,1.3M -k1.4n -k2V
This sort is sorting the Dec entries as more recent than the Jan. I think it's because 12 is bigger than 1 but is there a way to get Jan to come to the end of this file?
Assuming the all dates in the list are within last one year from now, how about:
now=$(date "+%m%d %H:%M") # current date and time
declare -A m2n=([Jan]="01" [Feb]="02" [Mar]="03" [Apr]="04"
[May]="05" [Jun]="06" [Jul]="07" [Aug]="08"
[Sep]="09" [Oct]="10" [Nov]="11" [Dec]="12"
)
while IFS= read -r line; do
if [[ $line =~ (^[A-Z][a-z]{2})(.+) ]]; then
datetime="${m2n[${BASH_REMATCH[1]}]}${BASH_REMATCH[2]}"
if [[ $datetime > $now ]]; then
datetime="0$datetime" # previous year
else
datetime="1$datetime" # this year
fi
printf "%s\t%s\n" "$datetime" "$line"
else
echo "$line" # does not match the expected format
fi
done < file | sort | cut -f 2-
It compares each date string with current date/time in dictionary order.
If the latter is larger, the date is assumed to be this year and put "1"
in the "year" field, else put "0" there.
Then prepend the generated date string in front of the original lines,
sort, and finally remove the portion.
Just replace with regex to prepare the data for sorting with a value for Jan that will come after Dec:
# JanXY -> 13 XY JanXY ; DecXY -> 12 XY DecXY
sed 's/^Jan\([0-9][0-9]\)/13 \1 &/; s/^Dec\([0-9][0-9]\)/12 \1 &/' |
sort -n -k1 -k2 | cut -d' ' -f3-5
hi I want to get the week number of the current month in bash, has someone an idea on how to do this? I search the date option, there is noting for week number only absolute week number.
WEEK => current week of month eg. today 22.05 WEEK=4
I found a few options, but there is no option, which works for every month and every date some give error if leading 0 some if month is even.
Here is a solution:
this_month=$(ncal -w | tail -1 | awk '{print NF}')
total_this_month=$(ncal -w | tail -1 | awk '{print $NF}')
this_week=$(date +%W)
this_monthweek=$(($this_week - $total_this_month + $this_month))
echo -e "today $(date +%d.%m) WEEK=$this_monthweek"
Hope, this is what you are looking for.
Edit:
A solution without ncal
this_month=$(date -d "20170501" +%V)
end_this_month=$(date -d "20170531" +%V)
total_this_month=$(($end_this_month - $this_month + 1))
this_week=$(date +%W)
this_monthweek=$(($this_week - $end_this_month + $total_this_month))
echo -e "today $(date +%d.%m) WEEK=$this_monthweek"
I found this solution working with AIX, but it works on any Unix
#!/bin/bash
year=$(date +'%Y')
month=$(date +'%m')
day=$(date +'%e' | tr -d '[:blank:]')
WEEKNUM=$(cal ${month} ${year} | grep -v "[[:alpha:]]" | grep -nw ${day} | cut -f1 -d':')
echo "WEEKNUM ${WEEKNUM}"
On Linux you can switch first day of the week to fit your needs
On AIX, first day of the week is inherited by locale
I figured something out, but I am not sure if it works for every month and week of the year, and I have not idea on how to manipulate the date to test it
current_week="$(date +%V)"
last_of_month="$(date --date "-"$(date +%d)" days +1 month" +%Y-%m-%d)"
last_of_month_week="$(date --date="${last_of_month}" +%V)"
week_number="$((${current_week} - ${last_of_month_week} + 5))"
I am having the below outputs and I need to get the time difference in seconds.
------------------------------
Wed Nov 23 15:09:20 2016
------------------------------
Wed Nov 23 15:27:47 2016
------------------------------
Generally month should be the same on all cases so we can escape it, the same for the year, I may get different values for the day of week and the day for sure, the difference for sure will be in seconds and minutes and might be in hours ...
I tried some awks and cut by : but I still having an issue.
Thanks in advance !
Any help appreciated !
My first perl script ever :
# extract two dates and calculate difference in s
# http://stackoverflow.com/questions/40781429/get-the-time-difference-in-seconds/
#
# cat time_diff.txt | grep -e "20[0-2][0-9]" | perl time_difference.pl
use Date::Parse;
$date_str1 = <STDIN>;
$date_str2 = <STDIN>;
$date1 = str2time($date_str1);
$date2 = str2time($date_str2);
print $date2-$date1;
print "\n";
Too bad you cannot use date -d, I was proud of this one-liner :
cat time_diff.txt | grep -e "20[0-2][0-9]" | xargs -i date -d{} +%s | (read -d "\n" t1 t2; echo $t2-$t1 | bc)
Tested with bash and zsh on Linux Mint 17.3
Trying to pull the last 5 minutes of logs with (grep matches)
so i do a tac syslog.log | sed / date -d "5 minutes ago"
every line on the log shows this format
Jun 14 14:03:58
Jul 3 08:04:35
so i really want to get the check of data from
Jul 4 08:12
Jul 4 08:17
i tried this method but KINDA works (though its still going through every day from this that 08:12: through 08:17: fits in)
e=""
for (( i = 5; i >= 0; i-- ))
do
e='-e /'`date +\%R -d "-$i min"`':/p '$e;
done
tac /var/log/syslog.log | sed -n $e
e=""
for (( i = 5; i >= 0; i-- ))
do
if [[ -z $e ]]
then e=`date +\%R -d "-$i min"`
else e=$e'\|'`date +\%R -d "-$i min"`
fi
done
re=' \('$e'\):'
tac /var/log/syslog.log | sed -n -e "/$re/p" -e "/$re/!q"
This creates a single regular expression listing all the times from the last 5 minutes, connected with \|. It prints the lines that matches them. Then it uses the ! modifier to quit on the first line that doesn't match the RE.
If you know the format of the dates then why not do:
tac syslog.log | awk '/Jul 4 08:17/,/Jul 4 08:12/ { print } /Jul 4 08:11/ {exit}'
/ .. /,/ .. / is regex range. It will print everything in this range. So as soon as you see /Jul 4 08:11/ on your line that would mean your 5 minutes window has been captured, you exit perusing the file.
So it didnt really work for the above method But i think i got it to work
if i see this i added a RANGE for the {exit}
awk '/'"$dtnow"'/,/'"$dt6min"'/ { print } /'"$dt7min"'/,/'"$dt11min"'/ {exit}'
Seems to work im testing it again
OK Finally looks like it really works this time (where it exits after the hour using SED instead of awk finally got it to work running through some tests.
tac /var/log/syslog.log | sed -e "$( date -d '-1 hour -6 minutes' '+/^%b %e %H:/q;'
date -d '-1 day -6 minutes' '+/^%b %e /q;'
date -d '-1 month -6 minutes' '+/^%b /q;'
for ((o=0;o<=5;o++)) do date -d "-$o minutes" '+/^%b %e %R:/p;'; done ; echo d)"
It works if log entries begins from "May 14 11:41". Variable LASTMINUTES is used to set the last n minutes in the log:
cat log | awk 'BEGIN{ LASTMINUTES=30; for (L=0;L<=LASTMINUTES;L++) TAB[strftime("%b %d %H:%M",systime()-L*60)] } { if (substr($0,0,12) in TAB) print $0 }'
To run the above script you need gawk which can be installed by:
apt-get install gawk
or
yum install gawk