Sed command inside a for loop with variables - bash

I have a file named dates.txt which contains the following:
DATE_1
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
I just want to replace DATE_i with some dates that are stored in v array using sed command.
To do that I tried a for loop and a sed command (file test.sh):
#!/bin/bash
v[1]=`date -d "7 days ago" '+%d\/%m\/%y'`
v[2]=`date -d "6 days ago" '+%d\/%m\/%y'`
v[3]=`date -d "5 days ago" '+%d\/%m\/%y'`
v[4]=`date -d "4 days ago" '+%d\/%m\/%y'`
v[5]=`date -d "3 days ago" '+%d\/%m\/%y'`
v[6]=`date -d "2 days ago" '+%d\/%m\/%y'`
v[7]=`date -d "1 days ago" '+%d\/%m\/%y'`
cat dates.txt|for j in {1..7};do sed "s/DATE_$j/${v[$j]}/";done
The problem is that this command replaces only the first date. If you run test.sh:
$ ./test.sh
14/03/16
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
The output I am expecting is:
14/03/16
15/03/16
16/03/16
17/03/16
18/03/16
19/03/16
20/03/16
I cannot understand why this is not working.
Could anyone please explain why this is happening and propose a proper solution for this problem?
Thanks!!

Explanation: What's happening is that the first iteration of the for loop is consuming all of the lines you're piping to its standard input. First of all, let's modify test.sh to contain an echo statement in the last line so that we can see what's happening:
cat dates.txt|for j in {1..7};do echo $j; sed "s/DATE_$j/${v[$j]}/";done
You'll see the output from test.sh is the following:
1
13/03/16
DATE_2
DATE_3
DATE_4
DATE_5
DATE_6
DATE_7
2
3
4
5
6
7
Next, modify dates.txt to read:
DATE_1
DATE_2
DATE_1
DATE_4
DATE_1
DATE_6
DATE_1
, where we've turned every other line into DATE_1 for demonstration purposes. Now, the output reads:
1
13/03/16
DATE_2
13/03/16
DATE_4
13/03/16
DATE_6
13/03/16
2
3
4
5
6
7
So you see that the first iteration of the for loop (when $j == 1) is processing every line that cat is passing to the for loop. After that, the subsequent iterations of the for loop ($j == 2..7) still run, but they don't receive any input stream (so, in the above example, they just echo the current value of $j and don't pass any input to sed). That's why you were observing that it was changing only the first line.
Solution: Modify the last line to read:
for j in {1..7}; do head -$j dates.txt | tail -1 | sed "s/DATE_$j/${v[$j]}/"; done

cat dates.txt | sed 's/^DATE_//g'
This will strip DATE_ at the beginning of a line (^), leaving the number (or anything else on the line). No need for a loop at all!

Related

Improve performance of bash script

I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.
I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?
for k in *.csv; do
sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done
You read each CSV five times. It would be better to read each CSV only once.
You extract the same data multiple times. All but one parts are subsets of the others.
2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
6 months ago is a subset of 3 months ago and 1 month ago.
3 months ago is a subset of 1 month ago.
This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.
The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)
sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv
Current performance-related issues:
reading each input file 5x times => we want to limit this to a single read per input file
calling date 5x times (necessary) for each input file (unnecessary) => make the 5x date calls prior to the for k in *.csv loop [NOTE: the overhead for repeated date calls will pale in comparison to the repeated reads of the input files]
Potential operational issue:
sed is not designed for doing comparisons of data (eg, look for a string that is >= a search pattern); consider an input file like such:
$ cat input.csv
2021-01-25
2021-03-01
If 'today' is 2021-03-14 then for the 1month dataset the current sed solution is:
sed '/2012-02/,$p'
But because there are no entries for 2012-02 the sed command returns 0 rows, even though we should see the row for 2021-03-01.
Granted, for this particular question we're looking for dates based on the month, and the application likely generated at least one row on a monthly basis, so this issue likely won't be an issue but, we need to be aware of this issue in general.
Anyhoo, back to the question at hand ...
Assumptions:
input files are comma-delimited (otherwise need to adjust the proposed solution)
the date to be tested is of the format YYYY-MM-...
the data to be tested is the 1st field of the comma-delimited input file (otherwise need to adjust the proposed solution)
output filename prefix is the input filename sans the .csv
Sample input:
$ cat input.csv
2019-09-01,line 1
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
We only need to do the date calculations once so we'll do this in bash and prior to OP's for k in *.csv loop:
# date as of writing this answer: 2021-10-12
$ yr2=$(date -d "2 year ago" '+%Y-%m')
$ yr1=$(date -d "1 year ago" '+%Y-%m')
$ mon6=$(date -d "6 month ago" '+%Y-%m')
$ mon3=$(date -d "3 month ago" '+%Y-%m')
$ mon1=$(date -d "1 month ago" '+%Y-%m')
$ typeset -p yr2 yr1 mon6 mon3 mon1
declare -- yr2="2019-10"
declare -- yr1="2020-10"
declare -- mon6="2021-04"
declare -- mon3="2021-07"
declare -- mon1="2021-09"
One awk idea (replaces all of the sed calls in OP's current for k in *.csv loop):
# determine prefix to be used for output files ...
$ k=input.csv
$ prefix="${k//.csv/}"
$ echo "${prefix}"
input
awk -v yr2="${yr2}" \
-v yr1="${yr1}" \
-v mon6="${mon6}" \
-v mon3="${mon3}" \
-v mon1="${mon1}" \
-v prefix="${prefix}" \
-F ',' ' # define input field delimiter as comma
{ split($1,arr,"-") # date to be compared is in field #1
testdate=arr[1] "-" arr[2]
if ( testdate >= yr2 ) print $0 > prefix".2years.csv"
if ( testdate >= yr1 ) print $0 > prefix".1year.csv"
if ( testdate >= mon6 ) print $0 > prefix".6months.csv"
if ( testdate >= mon3 ) print $0 > prefix".3months.csv"
if ( testdate >= mon1 ) print $0 > prefix".1month.csv"
}
' "${k}"
NOTE: awk can dyncially process the input filename to determine the filename prefix (see FILENAME variable) but would still need to know the target directory name (assuming writing to a different directory from where the input file resides)
This generates the following files:
for f in "${prefix}".*.csv
do
echo "############# ${f}"
cat "${f}"
echo ""
done
############# input.2years.csv
2019-10-01,line 2
2019-10-03,line 3
2019-12-01,line 4
2020-05-01,line 5
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1year.csv
2020-10-01,line 6
2020-10-03,line 7
2020-12-01,line 8
2021-03-01,line 9
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.6months.csv
2021-04-01,line 10
2021-05-01,line 11
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.3months.csv
2021-07-01,line 12
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
############# input.1month.csv
2021-09-01,line 13
2021-09-01,line 14
2021-10-11,line 15
2021-10-12,line 16
Additional performance improvements:
[especially for largish files] read from one filesystem, write to a 2nd/different filesystem; even better would be a separate filesystem for each of the 5x different output files - would require a minor tweak to the awk solution
processing X number of input files in parallel, eg, awk code could be placed in a bash function and then called via <function_name> <input_file> &; this can be done via bash loop controls, parallel, xargs, etc
if running parallel operations, will need to limit the number of parallel operations based primarily on disk subsystem throughput, ie, how many concurrent reads/writes can disk subsystem handle before slowing down due to read/write contention
Something like this may work if you have GNU Parallel (version >= 20191022):
do_one() {
cat "$1" | parallel --pipe --tee sed -n '/$(date -d {}" ago" '+%Y-%m')/,\$p' "> temp_data_store/$1.{}.csv" \
::: "2 year" "1 year" "6 month" "3 month" "1 month"
}
export -f do_one
parallel -j1 do_one ::: *.csv
If your disks are fast: Remove -j1.

grep last 10 minutes of a log with bad data format

my log has this date format at the beginning of each line:
2018 Sep 21 17:16:27:796
I need to grep the last 10 minutes of this log... any help?
my current experiments:
tenminutesago=$(date --date='10 minutes ago' +"%Y %b %e %H:%M:%S"):999
My idea was to convert the log format to a progressive number and then check everything greater than that number.
I see that the command: date +"%Y %b %e %H:%M:%S" gives a date in the same format of the log. The command: date +"%Y%m%e%H%M%S" gives a date in a progressive number (201810041204019)
You could do
for i in {10..0}; do
d=$(date -d "$i minutes ago" +'%Y %b %e %H:%M')
grep "$d" logfile
done
This just divides the problem in the 11 sequential subtasks of getting all lines from 10 minutes ago, all lines from 9 minutes ago, etc. until the current minute.
Edit:
Here's an alternate solution that prints all lines following the first one where a date stamp from the last 10 minutes was found, not only those that carry a date stamp, and also avoids reading the file over from start several times:
# build a regex pattern that matches any date in the given format from the last 10 minutes
pattern=$(date +'%Y %b %e %H:%M')
for i in {10..1}; do
pattern+=\|$(date -d "$i minutes ago" +'%Y %b %e %H:%M')
done
# print all lines starting from the first one that matches one of the dates in the pattern
awk "/$pattern/,0" logfile
Under the assumption that your loglines looks like
YYYY Bbb dd HH:MM:SS:sss Some random log message is here
You can do the following:
awk -v d=$(date -d "10 minutes ago" "+%Y %m %d %T") '
{ mm = sprintf("%0.2d",(index("JanFebMarAprMayJunJulAugSepOctNovDec",$2)+2)/3)
s = $1 " " mm " " $3 " "$4 }
(s >= d){print}' logfile
The idea is to convert your date format into a Sortable format (Note that "Jan" < "Mar" but "Feb" < "Jan"). This is done by converting your month into a number with two digits and then compare it stringwise against the correct date.
Try your current approach without the seconds and milliseconds.
tenminutesago=$(date --date='10 minutes ago' +"%Y %b %e %H:%M")
Is not exactly the last ten minutes to a second level, but I think it is enough for most of the cases. That will give you the first line in the log within the time window. Now you can get the total lines and subtract the line number of your previous grep, and then tail the file. The script could be like this:
LOGFILE="filename.log"
tenminutesago=$(date --date='9000 minutes ago' +"%Y %b %e %H:%M") # matching pattern
tlines=$(cat $LOGFILE | wc -l) # Total lines in file
let lines=$tlines-$(grep -n "$tenminutesago" $LOGFILE | grep -m 1 -oP "^[0-9]*" || echo $tlines) # lines after matching occurence
echo "$lines lines FOUND from the last X minutes"
tail -n $lines $LOGFILE # last lines in file
As suggested by #Gem Taylor, this could be reduced using +N option in tail.
LOGFILE="filename.log"
tenminutesago=$(date --date='9000 minutes ago' +"%Y %b %e %H:%M") # matching pattern
lines=$(grep -n "$tenminutesago" $LOGFILE | grep -m 1 -oP "^[0-9]*" || echo "0") # lines after matching occurence
echo "$lines lines FOUND from the last X minutes"
let lines -eq 0 && tail -n +$lines $LOGFILE # last lines in file if lines is not 0

echo to file loop in bash script

say that i am trying to do a echo TZ=GMT-24 date +%Y%m%d >> echoed.
This is in solaris.
Now, i would like to do a loop that reads a specific number of days and echoes with GMT-24/GMT-48 etc... until the number of days ends... this is a 5 times loop.... basicly from monday to friday. i will set this script on crontab that will run in one day and generates that echo output to a file so other script that i already have created can check those dates and work with them.
thanks in advance
This is ksh on Solaris 8:
$ date +%Y%m%d
20130919
$ for i in 1 2 3 4 5; do TZ=GMT-$(($i * 24)) date +%Y%m%d; done
20130920
20130921
20130922
20130923
20130924
$ for i in 1 2 3 4 5; do TZ=GMT+$(($i * 24)) date +%Y%m%d; done
20130918
20130917
20130916
20130915
20130914
To redirect to a file, add > filename after the done keyword

pull last 5 minutes of syslog data (750mb) with tac combo sed/awk/grep/?

Trying to pull the last 5 minutes of logs with (grep matches)
so i do a tac syslog.log | sed / date -d "5 minutes ago"
every line on the log shows this format
Jun 14 14:03:58
Jul 3 08:04:35
so i really want to get the check of data from
Jul 4 08:12
Jul 4 08:17
i tried this method but KINDA works (though its still going through every day from this that 08:12: through 08:17: fits in)
e=""
for (( i = 5; i >= 0; i-- ))
do
e='-e /'`date +\%R -d "-$i min"`':/p '$e;
done
tac /var/log/syslog.log | sed -n $e
e=""
for (( i = 5; i >= 0; i-- ))
do
if [[ -z $e ]]
then e=`date +\%R -d "-$i min"`
else e=$e'\|'`date +\%R -d "-$i min"`
fi
done
re=' \('$e'\):'
tac /var/log/syslog.log | sed -n -e "/$re/p" -e "/$re/!q"
This creates a single regular expression listing all the times from the last 5 minutes, connected with \|. It prints the lines that matches them. Then it uses the ! modifier to quit on the first line that doesn't match the RE.
If you know the format of the dates then why not do:
tac syslog.log | awk '/Jul 4 08:17/,/Jul 4 08:12/ { print } /Jul 4 08:11/ {exit}'
/ .. /,/ .. / is regex range. It will print everything in this range. So as soon as you see /Jul 4 08:11/ on your line that would mean your 5 minutes window has been captured, you exit perusing the file.
So it didnt really work for the above method But i think i got it to work
if i see this i added a RANGE for the {exit}
awk '/'"$dtnow"'/,/'"$dt6min"'/ { print } /'"$dt7min"'/,/'"$dt11min"'/ {exit}'
Seems to work im testing it again
OK Finally looks like it really works this time (where it exits after the hour using SED instead of awk finally got it to work running through some tests.
tac /var/log/syslog.log | sed -e "$( date -d '-1 hour -6 minutes' '+/^%b %e %H:/q;'
date -d '-1 day -6 minutes' '+/^%b %e /q;'
date -d '-1 month -6 minutes' '+/^%b /q;'
for ((o=0;o<=5;o++)) do date -d "-$o minutes" '+/^%b %e %R:/p;'; done ; echo d)"
It works if log entries begins from "May 14 11:41". Variable LASTMINUTES is used to set the last n minutes in the log:
cat log | awk 'BEGIN{ LASTMINUTES=30; for (L=0;L<=LASTMINUTES;L++) TAB[strftime("%b %d %H:%M",systime()-L*60)] } { if (substr($0,0,12) in TAB) print $0 }'
To run the above script you need gawk which can be installed by:
apt-get install gawk
or
yum install gawk

bash shell date parsing, start with specific date and loop through each day in month

I need to create a bash shell script starting with a day and then loop through each subsequent day formatting that output as %Y_%m_d
I figure I can submit a start day and then another param for the number of days.
My issue/question is how to set a DATE (that is not now) and then add a day.
so my input would be 2010_04_01 6
my output would be
2010_04_01
2010_04_02
2010_04_03
2010_04_04
2010_04_05
2010_04_06
[radical#home ~]$ cat a.sh
#!/bin/bash
START=`echo $1 | tr -d _`;
for (( c=0; c<$2; c++ ))
do
echo -n "`date --date="$START +$c day" +%Y_%m_%d` ";
done
Now if you call this script with your params it will return what you wanted:
[radical#home ~]$ ./a.sh 2010_04_01 6
2010_04_01 2010_04_02 2010_04_03 2010_04_04 2010_04_05 2010_04_06
Very basic bash script should be able to do this:
#!/bin/bash
start_date=20100501
num_days=5
for i in `seq 1 $num_days`
do
date=`date +%Y/%m/%d -d "${start_date}-${i} days"`
echo $date # Use this however you want!
done
Output:
2010/04/30
2010/04/29
2010/04/28
2010/04/27
2010/04/26
Note: NONE of the solutions here will work with OS X. You would need, for example, something like this:
date -v-1d +%Y%m%d
That would print out yesterday for you. Or with underscores of course:
date -v-1d +%Y_%m_%d
So taking that into account, you should be able to adjust some of the loops in these examples with this command instead. -v option will easily allow you to add or subtract days, minutes, seconds, years, months, etc. -v+24d would add 24 days. and so on.
#!/bin/bash
inputdate="${1//_/-}" # change underscores into dashes
for ((i=0; i<$2; i++))
do
date -d "$inputdate + $i day" "+%Y_%m_%d"
done
Very basic bash script should be able to do this.
Script:
#!/bin/bash
start_date=20100501
num_days=5
for i in seq 1 $num_days
do
date=date +%Y/%m/%d -d "${start_date}-${i} days"
echo $date # Use this however you want!
done
Output:
2010/04/30
2010/04/29
2010/04/28
2010/04/27
2010/04/26
You can also use cal, for example
YYYY=2014; MM=02; for d in $(cal $MM $YYYY | grep "^ *[0-9]"); do DD=$(printf "%02d" $d); echo $YYYY$MM$DD; done
(originally posted here on my commandlinefu account)
You can pass a date via command line option -d to GNU date handling multiple input formats:
http://www.gnu.org/software/coreutils/manual/coreutils.html#Date-input-formats
Pass starting date as command line argument or use current date:
underscore_date=${1:-$(date +%y_%m_%d)}
date=${underscore_date//_/-}
for days in $(seq 0 6);do
date -d "$date + $days days" +%Y_%m_%d;
done
you can use gawk
#!/bin/bash
DATE=$1
num=$2
awk -vd="$DATE" -vn="$num" 'BEGIN{
m=split(d,D,"_")
t=mktime(D[1]" "D[2]" "D[3]" 00 00 00")
print d
for(i=1;i<=n;i++){
t+=86400
print strftime("%Y_%m_%d",t)
}
}'
output
$ ./shell.sh 2010_04_01 6
2010_04_01
2010_04_02
2010_04_03
2010_04_04
2010_04_05
2010_04_06
2010_04_07

Resources