I have a data file where dates are alphanumeric at each 10 minutes. e.g.
00 hour 00 minute (00:00H)
00 hour 10 minute (00:10H)
00 hour 20 minute (00:20H)
and so on
$ ifile.txt
00:00H01JUN2021 1.900
00:10H01JUN2021 2.400
00:20H01JUN2021 2.100
00:30H01JUN2021 2.300
00:40H01JUN2021 2.00
00:50H01JUN2021 2.300
01:00H01JUN2021 2.300
01:10H01JUN2021 0.000
01:20H01JUN2021 2.200
01:30H01JUN2021 0.100
To understand the data:
1st column is date; second column is the value at that time
First 6 letters YY:XXH indicats as YY -> Hour; XX -> Minute (as explained in the begining)
I would like to convert it into a CSV file with numeric dates. The desire outfile is
$ ofile.txt
yyyy-mm-dd hh-mn-sc,val
2021-06-01 00:00:00,1.900
2021-06-01 00:10:00,2.400
2021-06-01 00:20:00,2.100
2021-06-01 00:30:00,2.300
2021-06-01 00:40:00,2.000
2021-06-01 00:50:00,2.300
2021-06-01 01:00:00,2.300
2021-06-01 01:10:00,0.000
2021-06-01 01:20:00,2.200
2021-06-01 01:30:00,0.100
My script is:
#!/bin/sh
gawk '
BEGIN {
month["Jan"] = "01"; month["Feb"] = "02"; month["Mar"] = "03";
month["Apr"] = "04"; month["May"] = "05"; month["Jun"] = "06";
month["Jul"] = "07"; month["Aug"] = "08"; month["Sep"] = "09";
month["Oct"] = "10"; month["Nov"] = "11"; month["Dec"] = "12";
}
function timestamp_to_numeric(s) {
# 00:00H01JUN2021 => 2021-06-01 00:00:00
return substr(s,12,4)"-"month[substr(s,9,3)]"-"substr(s,7,2) substr(s,1,2)":"substr(s,4,2)":""00"
}
NR==1 {next}
END {
printf "%s",timestamp_to_numeric($1),$2
printf "\n"
}
' ifile.txt
This script is not priniting my desired output.
Using GNU awk (since you're already using it) for the 4th arg to split():
$ cat tst.awk
function timestamp_to_numeric(s, mthNr,t,m) {
# 00:00H01JUN2021 => 2021-06-01 00:00:00
split(s,t,/[[:alpha:]]+/,m)
mthNr = index(" JANFEBMARAPRMAYJUNJULAUGSEPOCTNOVDEC",m[2]) / 3
return sprintf("%04d-%02d-%02d %s:00", t[3], mthNr, t[2], t[1])
}
BEGIN {
OFS=","
print "yyyy-mm-dd hh-mn-sc","val"
}
{ print timestamp_to_numeric($1), $2 }
$ awk -f tst.awk ifile.txt
yyyy-mm-dd hh-mn-sc,val
2021-06-01 00:00:00,1.900
2021-06-01 00:10:00,2.400
2021-06-01 00:20:00,2.100
2021-06-01 00:30:00,2.300
2021-06-01 00:40:00,2.00
2021-06-01 00:50:00,2.300
2021-06-01 01:00:00,2.300
2021-06-01 01:10:00,0.000
2021-06-01 01:20:00,2.200
2021-06-01 01:30:00,0.100
Change
return substr(s,12,4)"-"month[substr(s,9,3)]"-"substr(s,7,2) substr(s,1,2)":"substr(s,4,2)":""00"
to
return substr(s,12,4)"-"month[substr(s,9,3)]"-"substr(s,7,2)" "substr(s,1,2)":"substr(s,4,2)":""00"
# .................................,........................^^^
so that you actually have a space between the date and the time.
Perhaps more readable would be:
return sprintf("%4d-%02d-%02d %02d:%02d:00", substr(s,12,4), month[substr(s,9,3)], substr(s,7,2), substr(s,1,2), substr(s,4,2))
To put it together in awk, without being dependent on gawk to get seps from array-splitting (confirmed working on mawk | gawk | nawk) :
echo "${...input_data...}" | awk '
function ____(__,_,___) {
return \
sprintf((_ = (_ = "%s%.*d")_)(_)_,
_ = ___, (_+=++_)^_++,
substr(__, ___ = length(__)-_--),
"-",_,
index("=ANEBARPRAYUNULUGEPCTOVEC",
substr(__, ___-= _,_) ) /_,
"-",_, substr(__, ___-= _+!!_),
" ",_, __, ___ = ":",_,
substr(__,_+_),___,_,!_)
}
BEGIN {
print "yyyy-mm-dd hh-mn-sc" (OFS = ",\t") "val"
}
($++NF = ____($!_))^_'
yyyy-mm-dd hh-mn-sc, val
00:00H01JUN2021, 1.900, 2021-06-01 00:00:00
00:10H01JUN2021, 2.400, 2021-06-01 00:10:00
00:20H01JUN2021, 2.100, 2021-06-01 00:20:00
00:30H01JUN2021, 2.300, 2021-06-01 00:30:00
00:40H01JUN2021, 2.00, 2021-06-01 00:40:00
00:50H01JUN2021, 2.300, 2021-06-01 00:50:00
01:00H01JUN2021, 2.300, 2021-06-01 01:00:00
01:10H01JUN2021, 0.000, 2021-06-01 01:10:00
01:20H01JUN2021, 2.200, 2021-06-01 01:20:00
01:30H01JUN2021, 0.100, 2021-06-01 01:30:00
===================
To map english month names (full or abbr.) of any casing to month #, this extremely-odd-looking lookup string suffices —
it pre-segregates the input by whether 2nd letter is A|a - i.e. Jan / March / May
then performs reference string position lookup of the 3rd letter
function month_name_to_num(__,_) {
return \
index(substr("n_r_yb_r_nlgptvc",
((_+=++_)-+-++_)^(__!~"^.[Aa]")),
tolower(substr(__,_--,--_) ) )
}
OCT 10
AUGUST 8
March 3
May 5
October 10
November 11
February 2
JUNE 6
NOV 11
JUL 7
December 12
OCTOBER 10
FEBRUARY 2
JANUARY 1
MARCH 3
APRIL 4
June 6
April 4
September 9
NOVEMBER 11
January 1
FEB 2
MAY 5
DEC 12
MAY 5
JAN 1
JULY 7
SEP 9
August 8
SEPTEMBER 9
July 7
DECEMBER 12
MAR 3
APR 4
JUN 6
AUG 8
if you don't want to use regex, this function variant bypasses the need to allocate extra temp variable(s) by repurposing the data input one(s) —- something uniquely convenient in weakly-typed languages like awk :
function monthname2num(_) {
return \
index("=anebarprayunulugepctovec",
tolower(substr(_ "",_+=_^=_,_)))/_
}
Related
I'm trying to make a script that sorts column 2 for highest value, prints said value, and prints column 3 for every row matching this value. Here's an example of unsorted csv:
Argentina,4.6,2016,some data
Argentina,4.2,2018,some data
Argentina,4.6,1998,some data
Argentina,4.5,2001,some data
Desired output would be:
4.6
2016
1998
Here's what I've got so far, but I'm feeling unsure if I'm going about it correctly:
grep "$2*" "$1"> new.csv
sort -t, -k2,2nr new.csv > new2.csv
cut -f3 -d"," new2.csv
Wherein $2 is the name of country in first column and $1 is the filename. While it sorts the values in the 2nd column just fine, I'd like to show the years for only the rows with max value in column 2. This route just prints the years for all of the rows, and I understand why that's happening, but not sure the best course to get the intended result from there. What are some ways to go about this? Thanks in advance
You could do something like that:
declare maxvalue_found=no
declare maxvalue=''
while read -r line; do
IFS=',' read -r <<< "$line" country value year data
if [[ "${maxvalue_found}" == no ]]; then
echo "$value"
maxvalue="${value}"
maxvalue_found=yes
fi
if [[ "${value}" == "${maxvalue}" ]]; then
echo "$year"
fi
done < new2.csv
new2.csv is your sorted file: we simply read it line by line, then read said line by splitting using ',' (https://www.gnu.org/software/bash/manual/bash.html#Word-Splitting):
The first value should be the highest due to sort.
The next value must be tested because you want only those that matches.
The year are print in same order than in new2.csv
Assumptions:
comma only shows up as a field delimiter (ie, comma is not part of any data)
no sorting requirement has been defined for the final result
One awk idea requiring 2 passes through the unsorted file:
awk -F, ' # set input field delimiter as comma
FNR==NR { max=($2>max ? $2 : max); next} # 1st pass of file (all rows): keep track of max value from field #2
FNR==1 { print max } # 2nd pass of file (1st row ): print max
$2==max { print $3 } # 2nd pass of file (all rows): if field #2 matches "max" then print field #3
' unsorted.csv unsorted.csv
This generates:
4.6
2016
1998
Another GNU awk idea that requires a single pass through the unsorted file:
awk -F, ' # set input field delimiter as comma
{ arr[$2][$3] # save fields #2 and #3 as indices in array "arr[]"
max = ( $2 > max ? $2 : max) # keep track of max value from field #2
}
END { print max # after file has been processed ... print max and then ...
for (i in arr[max]) # loop through indices of 2nd dimension where 1st dimension == max
print i # print 2nd dimension index (ie, field #3)
}
' unsorted.csv
This generates:
4.6
1998
2016
NOTES:
GNU awk required for arrays of arrays (ie, multidimensional arrays)
while field #3 appeara to be sorted this is not guaranteed unless we modify the code to explicitly sort the 2nd dimension of the array
How about single-pass in awk instead of multi-pass ? I've generated this synthetic version of the file, plus randomizing some data, to create a 6.24 mn row version of it :
INPUT
out9: 177MiB 0:00:01 [ 105MiB/s] [ 105MiB/s] [ <=> ]
rows = 6243584. | UTF8 chars = 186289540. | bytes = 186289540.
CODE
default value initialized to gigantically negative value of
-2^512, or more elegantly, -4^4^4, *to ensure it'll always take on the value on row 1
if you really want to play it safe, then make it something very close to* negative infinity :
e.g. -(3+4+1)^341, -16^255, -256^127, or -1024^102
=
{m,g}awk '
BEGIN {
1 _= -(_^= __= _+= _^= FS= OFS = ",")^_^_
1 ___= split("",____)
}
# Rule(s)
6243584 +_ <= +$__ { # 2992
2992 __= $(NF = ++__)
2992 if ((+_)< +$--NF) {
7 _= $NF
7 ___= split("",____)
}
2992 ____[++___]=__
2992 __=NF
}
END {
1 print _
2984 for (__^=_<_; __<=___; __++) {
2984 print ____[__]
}
}
OUTPUT (column 3 printed exactly in input row order)
.
53.6 1834 1999 1866 1938 1886 1973 1968 1921 1984 1957 1891 1864 1992
1998 1853 1950 1985 1962 2018 1897 1979 2020 1954 1995 1980 1900 1997
1856 1975 1851 1853 1988 1897 1973 1875 1917 1861 1912 1912 1954 1871
1952 1877 2003 1886 1863 1899 1897 1853 2013 1956 1965 1854 1873 1915
1983 1961 1965 1979 1919 1970 1946 1843 1856 1954 1965 1831 1926 1964
1994 1969 1831 1945 1942 1971 1988 1879 1998 1986 1844 1846 1994 1894
2008 1851 1877 1979 1970 1852 1942 1889 1986 2013 1905 1932 2021 1944
1866 1892 1940 1989 1907 1982 2016 1966 1975 1831 1851 2003 1980 1963
1869 1983 1972 2013 1972 1948 1843 1928 1959 1911 1844 1920 1943 1864
1985 1978 1855 1986 1975 1880 2001 1914 1877 1900 1964 1995 1992 1968
1868 1974 2012 1827 1849 1849 1992 1942 1884 1876 2021 1866 1977 1857
1866 1937 1920 1983 1915 1887 1890 1852 1871 1972 1903 1944 1943 1957
1844 1932 1854 1890 1891 1866 1923 1924 1941 1845 1907 2019
(further rows truncated for readability)
A single pass awk:
$ awk -F, '{
if($2>=m||m=="") {
b= ($2==m?b:$2) ORS $3 # b is the record buffer
m=$2 # m holds the maximum of $2 so far
}
}
END {
print b
}' file
Output:
4.6
2016
1998
We have green zone logic where the job has to run only between first Sunday to Saturday, i.e. 7 days starting from first Sunday of every month. I'm using the below awk command to get that, but somewhere it is breaking. I'm just trying for first 3 months i.e Jan to March
seq 75 | awk ' BEGIN {ti=" 0 0 0"}
function dtf(fmt,dy) { return strftime(fmt,mktime("2020 1 " dy ti)) }
{ day=dtf("%A %F",$0);mm=dtf("%m",$0);if(day~/Sunday/ || a[mm]) a[mm]++ ; if(a[mm]<8) print day } '
My output is below, which is incorrect:
Wednesday 2020-01-01
Thursday 2020-01-02
Friday 2020-01-03
Saturday 2020-01-04
Sunday 2020-01-05
Monday 2020-01-06
Tuesday 2020-01-07
Wednesday 2020-01-08
Thursday 2020-01-09
Friday 2020-01-10
Saturday 2020-01-11
Saturday 2020-02-01
Sunday 2020-02-02
Monday 2020-02-03
Tuesday 2020-02-04
Wednesday 2020-02-05
Thursday 2020-02-06
Friday 2020-02-07
Saturday 2020-02-08
Sunday 2020-03-01
Monday 2020-03-02
Tuesday 2020-03-03
Wednesday 2020-03-04
Thursday 2020-03-05
Friday 2020-03-06
Saturday 2020-03-07
Expected output:
Sunday 2020-01-05
Monday 2020-01-06
Tuesday 2020-01-07
Wednesday 2020-01-08
Thursday 2020-01-09
Friday 2020-01-10
Saturday 2020-01-11
Sunday 2020-02-02
Monday 2020-02-03
Tuesday 2020-02-04
Wednesday 2020-02-05
Thursday 2020-02-06
Friday 2020-02-07
Saturday 2020-02-08
Sunday 2020-03-01
Monday 2020-03-02
Tuesday 2020-03-03
Wednesday 2020-03-04
Thursday 2020-03-05
Friday 2020-03-06
Saturday 2020-03-07
How can I adjust the awk command to get the expected output?
Any other solutions using other bash tools are also welcome.
I suggest the following alternative to awk:
#! /usr/bin/env bash
for month in {01..03}; do
for day in {01..13}; do
date -d "2020-$month-$day" '+%A %F'
done |
grep -A6 -m1 -F Sunday
done
The script is not very efficient, but does the job. For each month, we simply print the dates of the 13 first days in that month. We know that the green zone has to be in that area, therefore we do not need the remaining days of the month.
The date format is Weekday YYYY-MM-DD. We use grep to find and print the first Sunday, print the 6 days behind that Sunday (-A6) and exit because we limited the search to one match (-m1).
The procedure described above is done for each of the months 1 to 3.
Here's a simple way to get GNU awk to create a list of dates and day names for any given year:
$ cat tst.awk
BEGIN {
year = (year == "" ? 2020 : year)
beg = mktime(year " 1 1 12 0 0")
for (i=0; i<=400; i++) {
dateday = strftime("%F %A", beg+24*60*60*i)
split(dateday,d,/[ -]/)
if ( d[1] != year ) {
break
}
print d[1], d[2], d[3], d[4]
}
}
.
$ awk -f tst.awk | head -20
2020 01 01 Wednesday
2020 01 02 Thursday
2020 01 03 Friday
2020 01 04 Saturday
2020 01 05 Sunday
2020 01 06 Monday
2020 01 07 Tuesday
2020 01 08 Wednesday
2020 01 09 Thursday
2020 01 10 Friday
2020 01 11 Saturday
2020 01 12 Sunday
2020 01 13 Monday
2020 01 14 Tuesday
2020 01 15 Wednesday
2020 01 16 Thursday
2020 01 17 Friday
2020 01 18 Saturday
2020 01 19 Sunday
2020 01 20 Monday
I'm starting at noon and looping from 0 to 400 days and breaking when the year changes just so I don't have to try to accommodate DST or leap years or leap seconds in the determination of days in the year in a more accurate calculation.
Just add some code to test for the current month being different from the previous and the current day name being a Sunday and print 7 days starting there, e.g.:
$ cat tst.awk
BEGIN {
year = (year == "" ? 2020 : year)
beg = mktime(year " 1 1 12 0 0")
for (i=0; i<=400; i++) {
dateday = strftime("%F %A", beg+24*60*60*i)
split(dateday,d,/[ -]/)
if ( d[1] != year ) {
break
}
dayName[d[2]+0][d[3]+0] = d[4]
}
for (monthNr=1; monthNr<=3; monthNr++) {
for (dayNr=1; dayNr in dayName[monthNr]; dayNr++) {
if (dayName[monthNr][dayNr] == "Sunday") {
for (i=0; i<7; i++) {
printf "%s %04d-%02d-%02d\n", dayName[monthNr][dayNr+i], year, monthNr, dayNr+i
}
break
}
}
}
}
.
$ awk -f tst.awk
Sunday 2020-01-05
Monday 2020-01-06
Tuesday 2020-01-07
Wednesday 2020-01-08
Thursday 2020-01-09
Friday 2020-01-10
Saturday 2020-01-11
Sunday 2020-02-02
Monday 2020-02-03
Tuesday 2020-02-04
Wednesday 2020-02-05
Thursday 2020-02-06
Friday 2020-02-07
Saturday 2020-02-08
Sunday 2020-03-01
Monday 2020-03-02
Tuesday 2020-03-03
Wednesday 2020-03-04
Thursday 2020-03-05
Friday 2020-03-06
Saturday 2020-03-07
There are slightly more efficient ways to do it but the above is clear and simple and will run in the blink of an eye.
A (rather wordy - I don't have time to make it shorter:-) ) Perl solution:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Time::Piece;
use Time::Seconds;
my $year = shift || localtime->year;
first_week($year, $_) for 1 ..12;
sub first_week {
my ($yr, $mn) = #_;
$mn = sprintf '%02d', $mn;
# Use midday to avoid DST issues
my $start = Time::Piece->strptime(
"$year-$mn-01 12:00:00",
'%Y-%m-%d %H:%M:%S'
);
$start += ONE_DAY while $start->day ne 'Sun';
for (1 .. 7) {
say $start->strftime('%A %Y-%m-%d');
$start += ONE_DAY;
}
}
Try this
for i in $(seq 12); do cal ${i} 2020 | awk -v month=${i} 'NF==7 && !/^Su/{ for (j=0;j<=6;j++){print "2020-"month"-"$1+j;}exit}'
EDIT : Updated code for printing day
for i in $(seq 2); do cal ${i} 2020 | awk -v month=${i} 'NF==7 && !/^Su/{for (j=0;j<=6;j++){print strftime("%A %F", mktime("2020 " month " " $1+j " 0 0 0"))}exit}'; done;
Demo for Jan and Feb
$for i in $(seq 2); do cal ${i} 2020 | awk -v month=${i} 'NF==7 && !/^Su/{a[0]="Sunday";a[1]="Monday";a[2]="Tuesday";a[3]="Wednesday";a[4]="Thursday";a[5]="Friday";a[6]="Saturday";for (j=0;j<=6;j++){print a[j]" " "2020-"month"-"$1+j}exit}'; done;
Sunday 2020-1-5
Monday 2020-1-6
Tuesday 2020-1-7
Wednesday 2020-1-8
Thursday 2020-1-9
Friday 2020-1-10
Saturday 2020-1-11
Sunday 2020-2-2
Monday 2020-2-3
Tuesday 2020-2-4
Wednesday 2020-2-5
Thursday 2020-2-6
Friday 2020-2-7
Saturday 2020-2-8
$
With Perl, using DateTime
use warnings;
use strict;
use feature 'say';
use DateTime;
my $dt = DateTime->new(year => 2020, month => 1, day => 1);
my $first_sunday = 7 - $dt->day_of_week + 1; # day of month for first Sun
while (1) {
my $day = $dt->day;
if ($day >= $first_sunday and $day < $first_sunday + 7) {
say $dt->ymd, " (", $dt->day_abbr, ")";
}
}
continue {
$dt->add(days => 1);
if ($dt->day == 1) { # new month
last if $dt->month > 3;
$first_sunday = 7 - $dt->day_of_week + 1;
}
}
This keeps a state (on the first in a month in finds out what day the first Sunday is), what is quite suitable if the program is meant to generate and go through all dates from the span of interest.
On the other hand, the program may need to check for a given day; perhaps it runs daily and needs to check for that day. Then it is simpler to see whether the day is between the first and second Sunday in the month
my $dt = DateTime->today;
while ( $dt->add(days => 1)->month <= 3) {
if ($dt->day_of_week == 7) { # it's a Sunday
if ($dt->weekday_of_month == 1) { # first Sunday in the month
say $dt->ymd, " (", $dt->day_abbr, ")";
}
}
else {
my $sdt = $dt->clone; # preserve $dt
$sdt->subtract( $dt->day_of_week ); # drop to previous Sunday
if ($sdt->weekday_of_month == 1) { # was first Sunday in the month
say $dt->ymd, " (", $dt->day_abbr, ")";
}
}
}
The while loop around the code is there to facilitate a check.
For days other than Sunday we drop to the past Sunday, to check whether that was the first Sunday in the month. If so, then our day is within the required interval. If the day is a Sunday we only need to check whether it is the first one in the month.
The code can be made a bit more efficient and concise if that matters
if ( (my $dow = $dt->day_of_week) == 7) {
if ($dt->weekday_of_month == 1) {
say $dt->ymd, " (", $dt->day_abbr, ")";
}
}
elsif ( $dt->clone->subtract(days => $dow)->weekday_of_month == 1 ) {
say $dt->ymd, " (", $dt->day_abbr, ")";
}
... on the account of readability.
$ printf "%s\n" 2020-{01..03}-01 \
| xargs -I{} date -d "{}" "+{} %u" \
| join -j3 - <(seq 0 6) \
| xargs -n3 sh -c 'date -d "$1 + 7 days - $2 days + $3 days" "+%A %F"' --
There is some nasty stuff in here, but I'll try to explain. The idea is to compute the day of the week of the first day of the month (assume u). If you know that, you know directly which day is the first Sunday (7-u days later). So from that point forward you only need to compute the next 6 days.
Use brace expansion to generate the months you are interested in
Use xargs to compute the day of the week and output it as YYYY-MM-DD u
Per day, we want to create a list of 7 strings YYYY-MM-DD u d where d runs from 0 to 6. For this we use a nasty join hack. By telling join to join to files on a non-existing field, we create an outer product.
Use xargs in combination with sh to create a command that accepts 3 arguments and do the computation.
This method is now easily expanded to other months and years:
$ printf "%s\n" 20{20..30}-{01..12}-01 | xargs ...
The above looks a bit messy, and you might be more interested in the loop version:
for yyyymm in {2020..2030}-{01..03}; do
u=$(date -d "$yyyymm-01" "+%u");
for ((dd=7-u;dd<14-u;++dd)); do
date -d "$yyyymm-01 + $dd days" "+%A %F"
done
done
Previous solution:
This is for the first 3 months of 2020:
$ printf "%s\n" 2020-{01..03}-{01..13} \
| xargs -n1 -I{} date -d '{}' '+%A %F' \
| awk -F"[- ]" '/Sun/{a[$3]++} a[$3]==1'
This is for the first years 2020 till 2030
$ printf "%s\n" 20{20..30}-{01..12}-{01..13} \
| xargs -n1 -I{} date -d '{}' '+%A %F' \
| awk -F"[- ]" '/Sun/{a[$2,$3]++} a[$2,$3]==1'
This is understood in 3 steps:
Use brace-expansion to create a list of the first 13 days of months and years you are interested in. This works nicely because the bash starts expanding left to right. This means that the day is the fast-running index. We ask for the first 13 days, because we know that the first Sunday must be within the first 7 days.
Convert the days to the expected format using xargs and date
Use awk to do the filtering.
By adding one more condition, I'm able to make it work. a[mm]<8 && a[mm]>0
seq 75 | awk '
BEGIN { ti=" 0 0 0" }
function dtf(fmt,dy) {
return strftime(fmt,mktime("2020 1 " dy ti))
}
{ day=dtf("%A %F",$0);
mm=dtf("%m",$0);
if(day~/Sunday/ || a[mm]) a[mm]++ ;
if(a[mm]<8 && a[mm]>0 ) print day
}'
Output:
Sunday 2020-01-05
Monday 2020-01-06
Tuesday 2020-01-07
Wednesday 2020-01-08
Thursday 2020-01-09
Friday 2020-01-10
Saturday 2020-01-11
Sunday 2020-02-02
Monday 2020-02-03
Tuesday 2020-02-04
Wednesday 2020-02-05
Thursday 2020-02-06
Friday 2020-02-07
Saturday 2020-02-08
Sunday 2020-03-01
Monday 2020-03-02
Tuesday 2020-03-03
Wednesday 2020-03-04
Thursday 2020-03-05
Friday 2020-03-06
Saturday 2020-03-07
As a additional note, though I hardcoded 1 for the month, when the day parameter is >31 mktime() just moves to the next month. So in a way you can pass julian day to mktime with month set to 1.
echo -e "1\n31\n32\n60\n61\n366" | awk '
BEGIN { ti=" 0 0 0" }
function dtf(fmt,dy) {
return strftime(fmt,mktime("2020 1 " dy ti))
}
{
day=dtf("%A %F",$0);
j=dtf("%j",$0);
print j,day
}'
Output:
001 Wednesday 2020-01-01
031 Friday 2020-01-31
032 Saturday 2020-02-01
060 Saturday 2020-02-29
061 Sunday 2020-03-01
366 Thursday 2020-12-31
I'm trying to convert the string in REC_TIME column to a timestamp format in hive.
Ex: Sun Jul 31 09:28:20 UTC 2016 => 2016-07-31 09:28:20
SELECT xxx, UNIX_TIMESTAMP(REC_TIME, "E M dd HH:mm:ss z yyyy") FROM wlogs LIMIT 10;
When I execute the above SQL it returns a NULL value.
Try this :
select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy"));
This works fine if your hive cluster has UTC timezone. Say suppose your server is in CST then you need to do as below to get to UTC;
select to_utc_timestamp(from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy")),'CST');
Hope this helps.
EDIT
Hive date functions use the JAVA simple date formater for the patterns . Refer this for the patterns.
Be aware my computers runs on PDT
[cloudera#quickstart ~]$ date +%Z
PDT
So the UTC time is converted to 2:28:20 PDT. Anyway this is not the point.
You are using HH for hours, use hh and you need at least 3 M for the month.
0: jdbc:hive2://quickstart:10000/default> select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016", 'E MMM dd hh:mm:ss z yyyy')) as date;
+----------------------+--+
| date |
+----------------------+--+
| 2016-07-31 02:28:20 |
+----------------------+--+
I want to get the data between two times in a log file of different months and date.Suppose if my startime is not present in the logfile, then I want to extract the data from the nearest next time in the logfile. And also it has to end before the endtime, if the entered endtime is not present in the log file.
My log file data,
Apr 10 16 02:07:20 Data 1
Apr 11 16 02:07:20 Data 1
May 10 16 04:11:09 Data 2
May 12 16 04:11:09 Data 2
Jun 11 16 06:22:35 Data 3
Jun 12 16 06:22:35 Data 3
The solution I am using is,
awk -v start="$StartTime" -v stop="$EndTime" 'start <= $StartTime && $EndTime <= stop' $file
where, I am storing my starttime in $StartTime and endtime in $EndTimeBut Iam not getting the exact output. Please help.
Something like this maybe:
$ BashVarStart="16 05 10 00 00 00" # the same format that awk function will reformat to
$ BashVarStop="16 06 11 00 00 00"
$ awk -v start="$BashVarStart" -v stop="$BashVarStop" -F"[ :]" -v OFS=\ '
function reformatdate(m,d,y,h,mm,s) { # basically throw year to the beginning
monstr="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"; # numerize the months
split(monstr,monarr," "); # split monstr to an array to enumerate the months
# monarr[1]="Jan", monarr[2]="Feb" etc
for(i in monarr) { # iterate over all month numbers in monarr index
if(monarr[i]==m) # when month number matches
m=sprintf("%02d",i) # zeropad if month number below 10: 9 -> 09
};
return y" "m" "d" "h" "mm" "s # return in different order
}
start < reformatdate($1,$2,$3,$4,$5,$6) && stop > reformatdate($1,$2,$3,$4,$5,$6)
' test.in
May 10 16 04:11:09 Data 2
May 12 16 04:11:09 Data 2
i'm just wondering how can we use awk to do exact matches.
for eg
$ cal 09 09 2009
September 2009
Su Mo Tu We Th Fr Sa
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30
$ cal 09 09 2009 | awk '{day="9"; col=index($0,day); print col }'
17
0
0
11
20
0
8
0
As you can see the above command outputs the index number of all the lines that contain the string/number "9", is there a way to make awk output index number in only the 4th line of cal output above.??? may be an even more elegant solution?
I'm using awk to get the day name using the cal command. here's the whole line of code:
$ dayOfWeek=$(cal $day $month $year | awk '{day='$day'; split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array); column=index($o,day); dow=int((column+2)/3); print array[dow]}')
The problem with the above code is that if multiple matches are found then i get multiple results, whereas i want it to output only one result.
Thanks!
Limit the call to index() to only those lines which have your "day" surrounded by spaces:
awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Proof of Concept
$ cal 02 1956
February 1956
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
$ day=18; cal 02 1956 | awk -v day=$day 'BEGIN{split("Sunday Monday Tuesday Wednesday Thursday Friday Saturday", array)} $0 ~ "\\<"day"\\>"{for(i=1;i<=NF;i++)if($i == day){print array[i]}}'
Saturday
Update
If all you are looking for is to get the day of the week from a certain date, you should really be using the date command like so:
$ day=9;month=9;year=2009;
$ dayOfWeek=$(date +%A -d "$day/$month/$year")
$ echo $dayOfWeek
Wednesday
you wrote
cal 09 09 2009
I'm not aware of a version of cal that accepts day of month as an input,
only
cal ${mon} (optional) ${year} (optional)
But, that doesn't affect your main issue.
you wrote
is there a way to make awk output index number in only the 4th line of cal output above.?
NR (Num Rec) is your friend
and there are numerous ways to use it.
cal 09 09 2009 | awk 'NR==4{day="9"; col=index($0,day); print col }'
OR
cal 09 09 2009 | awk '{day="9"; if (NR==4) {col=index($0,day); print col } }'
ALSO
In awk, if you have variable assignments that should be used throughout your whole program, then it is better to use the BEGIN section so that the assignment is only performed once. Not a big deal in you example, but why set bad habits ;-)?
HENCE
cal 09 2009 | awk 'BEGIN{day="9"}; NR==4 {col=index($0,day); print col }'
FINALLY
It is not completely clear what problem you are trying to solve. Are you sure you always want to grab line 4? If not, then how do you propose to solve that?
Problems stated as " 1. I am trying to do X. 2. Here is my input. 3. Here is my output. 4. Here is the code that generated that output" are much easier to respond to.
It looks like you're trying to do date calculations. You can be much more robust and general solutions by using the gnu date command. I have seen numerous useful discussions of this tagged as bash, shell, (date?).
I hope this helps.
This is so much easier to do in a language that has time functionality built-in. Tcl is great for that, but many other languages are too:
$ echo 'puts [clock format [clock scan 9/9/2009] -format %a]' | tclsh
Wed
If you want awk to only output for line 4, restrict the rule to line 4:
$ awk 'NR == 4 { ... }'