Filtering Month in CSV using Bash

Filtering Month in CSV using Bash - bash

Hey guys I have been working on this problem where I have got a CSV File where I have to filter specific months based on the input from the user.
Record Format
firstName,lastName,YYYYMMDD
But the thing is the input is in String and the month in the file is in numbers.
For example
> cat guest.csv
Micheal,Scofield,20000312
Lincon,Burrows,19981009
Sara,Tancredi,20040923
Walter,White,20051024
Barney,Stinson,20041230
Ted,Mosbey,20031126
Eric,Forman,20070430
Jake,Peralta,20030808
Amy,Santiago,19990405
Colt,Bennett,19990906
> ./list.sh March guest.csv
Micheal,Scofield,20000312

Oneliner:
MONTH=March; REGEX=`date -d "1 ${MONTH} 2022" +%m..$`; grep $REGEX guest.csv

Awk can easily translate month names to numbers and do the filtering.
awk -v month="March" -F , '
BEGIN { split("January February March April May June July August September October November December", mon, " ");
for(i=1; i<=12; i++) mm[i] = mon[i] }
mm[0 + substr($3, 5, 2)] == month' guest.csv
The BEGIN block sets up a pair of associative arrays which can be used in the main script to look up a month number by name. Change -v month="April" to search for a different month.
If you want to wrap this in a shell script, you can easily parse out the arguments into variables:
#!/bin/sh
monthname=$1
shift
awk -v month="$monthname" -F , '
BEGIN { split("January February March April May June July August September October November December", mon, " ");
for(i=1; i<=12; i++) mm[i] = mon[i] }
mm[0 + substr($3, 5, 2)] == month' "$#"

Related

Need a script to split a large file by month that can determine year based off order of the logs

I need to split a large syslog file that goes from October 2015 to February 2016 and be separated by month. Due to background log retention, the format of these logs are similar to:
Oct 21 08:00:00 - Log info
Nov 16 08:00:00 - Log Info
Dec 25 08:00:00 - Log Info
Jan 11 08:00:00 - Log Info
Feb 16 08:00:00 - Log Info
This large file is the result of an initial zgrep search across a large amount of log files split by day. Example being, user activity on a network across multiple services such as Windows/Firewall/Physical access logs.
For a previous request, I used the following:
gawk 'BEGIN{
m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",mth,"|")
}
{
for(i=1;i<=m;i++){ if ( mth[i]==$1){ month = i } }
tt="2015 "month" "$2" 00 00 00"
date= strftime("%Y%m",mktime(tt))
print $0 > FILENAME"."date".txt"
}
' logfile
output file examples (note sometimes I add "%d" to get the day but not this time:
Test.201503.txt
Test.201504.txt
Test.201505.txt
Test.201506.txt
This script however adds 2015 manually to the output log file name. What I attempted, and failed to do, was a script that creates variables out of each month at 1-12 and then sets 2015 as a variable (a) and 2016 as variable (b). Then the script would be able to compare when going in the order of 10, 11, 12, 1, 2 which would go in order and once it gets to 1 < 12 (the previous month) it would know to use 2016 instead of 2015. Odd request I know, but any ideas would at least help me get in the right mindset.

You could use date to parse the date and time. E.g.
#!/bin/bash
while IFS=- read -r time info; do
mon=$(date --date "$time" +%m | sed 's/^0//')
if (( mon < 10 )); then
year=2016
else
year=2015
fi
echo $time - $info > Test.$year$(printf "02d%" $mon).txt
done

Here is a gawk solution based on your script and your observation in the question. The idea is to detect a new year when the number of the month suddenly gets smaller, eg from 12 to 1. (Of course that will not work if the log has Jan 2015 directly followed by Jan 2016.)
script.awk
BEGIN { START_YEAR= 2015
# configure months and a mapping month -> nr, e.g. "Feb" |-> "02"
split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",monthNames,"|")
for( nr in monthNames) { month2Nr[ monthNames[ nr ] ] = sprintf("%02d", nr ) }
yearCounter=0
}
{
currMonth = month2Nr[ $1 ]
# detect a jump to the next year by a reset in the month number
if( prevMonth > currMonth) { yearCounter++ }
newFilename = sprintf("%s.%d%s.txt", FILENAME, (START_YEAR + yearCounter), currMonth)
prevMonth = currMonth
print $0 > newFilename
}
Use it like this: awk -f script.awk logfile

Change date format - bash or php

I have been gathering data for the last 20 days using a bash script that runs every 5 minutes. I started the script with no idea how I was going to output the data. I have since found a rather cool js graph that reads from a CSV.
Only issue is my date is currently in the format of:
Fri Nov 6 07:52:02
and for the CSV I need it to be
2015-11-06 07:52:02
So I need to cat my results grep-ing for the date and convert it.
The cat/grep for the date is:
cat speeds.txt | grep 2015 | awk '{print $1" "$2" "$3" "$4}'
Any brainwaves on how I can switch this around either using bash or php?
Thanks
PS - Starting the checks again using date +%Y%m%d" "%H:%M:%S is sadly not an option :(

Assuming all of your lines contains dates:
$ cat file
Fri Nov 6 07:52:02
...
$ awk 'BEGIN {
months["Jan"] = 1;
months["Feb"] = 2;
months["Mar"] = 3;
months["Apr"] = 4;
months["May"] = 5;
months["Jun"] = 6;
months["Jul"] = 7;
months["Aug"] = 8;
months["Sep"] = 9;
months["Oct"] = 10;
months["Nov"] = 11;
months["Dec"] = 12;
}
{
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
}' file > out
$ cat out
2015-11-06 07:52:02
...
If you only need to modify a some of the lines you can tweak the awk script a little bit, eg. match every line containing 2015:
...
# Match every line containing 2015
/2015/ {
month = months[$2];
printf("%s-%02d-%02d %s\n", 2015, month, $3, $4);
# Use next to prevent this the other print to happen for these lines
# Like 'continue' in while iterations
next;
};
# This '1' will print all other lines as well:
# Same as writing { print $0 }
1

You can use the date format to epoch time format in bash script.
date -d 'Fri Nov 6 07:52:02' +%s;
1446776522
date -d #1446776522 +"%Y-%m-%d %T "
2015-11-06 07:52:02

Since you didn't provide the input, I'll assume you have a file called speeds.txt that contains:
Fri Oct 31 07:52:02 3
Fri Nov 1 08:12:04 4
Fri Nov 2 07:43:22 5
(the 3, 4, and 5 above are just to show that you could have other data in the row, but are not necessary).
Using this command:
cat speeds.txt | cut -d ' ' -f2,3,4 | while read line ; do date -d"$line" "+2015-%m-%d %H:%M:%S" ; done;
You get the output:
2015-10-31 07:52:02
2015-11-01 08:12:04
2015-11-02 07:43:22

How to decode Seagate's hard drive date code in a Bash script

Seagate hard drives display a code instead of the manufacturing date. The code is described here and an online decoder is available here.
In short, it's a 4 or 5 digit number of the form YYWWD or YYWD, where:
YY is the year, 00 is year 1999
W or WW is the week number beginning 1
D is day of week beginning 1
Week 1 begins on the first saturday of July in the stated year
Examples
06212 means Sunday 20 November 2005
0051 means Saturday 31 July 1999
How can this be decoded in a bash script ?

This is what I did, it should work:
#!/bin/bash
DATE=$1
REGEX="^(..)(..?)(.)$"
[[ $DATE =~ $REGEX ]]
YEAR=$(( ${BASH_REMATCH[1]} + 1999 ))
WEEK=$(( ${BASH_REMATCH[2]} - 1))
DAYOFWEEK=$(( ${BASH_REMATCH[3]} - 1))
OFFSET=$(( 6 - $(date -d "$YEAR-07-01" +%u) ))
DATEOFFIRSTSATURDAY=$(date -d "$YEAR-7-01 $OFFSET days" +%d)
FINALDATE=`date -d "$YEAR-07-$DATEOFFIRSTSATURDAY $WEEK weeks $DAYOFWEEK days"`
echo $FINALDATE
It worked for the two dates given above...
If you want to customize the date output, add a format string at the end of the FINALDATe assignment.

Here is a short script, it takes two arguments: $1 is the code to convert and $2 is an optional format (see man date), otherwise defaulted (see code).
It uses the last Saturday in June instead of the first one in July because I found it easer to locate and it allowed me to just add the relevant number of weeks and days to it.
#!/bin/bash
date_format=${2:-%A %B %-d %Y}
code=$1
[[ ${#code} =~ ^[4-5]$ ]] || { echo "bad code"; exit 1; }
let year=1999+${code:0:2}
[[ ${#code} == 4 ]] && week=${code:2:1} || week=${code:2:2}
day=${code: -1}
june_last_saturday=$(cal 06 ${year} | awk '{ $6 && X=$6 } END { print X }')
date -d "${year}-06-${june_last_saturday} + ${week} weeks + $((${day}-1)) days" "+${date_format}"
Examples:
$ seadate 06212
Sunday November 20 2005
$ seadate 0051
Saturday July 31 1999

I created a Seagate Date Code Calculator that actually works with pretty good accuracy. I've posted it here on this forum for anyone to use: https://www.data-medics.com/forum/seagate-date-code-conversion-translation-tool-t1035.html#p3261
It's far more accurate than the other ones online which often point to the entirely wrong year. I know it's not a bash script, but will still get the job done for anyone else who's searching how to do this.
Enjoy!

Running shell commands within AWK

I'm trying to work on a logfile, and I need to be able to specify the range of dates. So far (before any processing), I'm converting a date/time string to timestamp using date --date "monday" +%s.
Now, I want to be able to iterate over each line in a file, but check if the date (in a human readable format) is within the allowed range. To do this, I'd like to do something like the following:
echo `awk '{if(`date --date "$3 $4 $5 $6 $7" +%s` > $START && `date --date "" +%s` <= $END){/*processing code here*/}}' myfile`
I don't even know if thats possible... I've tried a lot of variations, plus I couldn't find anything understandable/usable online.
Thanks
Update:
Example of myfile is as follows. Its logging IPs and access times:
123.80.114.20 Sun May 01 11:52:28 GMT 2011
144.124.67.139 Sun May 01 16:11:31 GMT 2011
178.221.138.12 Mon May 02 08:59:23 GMT 2011

Given what you have to do, its really not that hard AND it is much more efficient to do your date processing by converting to strings and comparing.
Here's a partial solution that uses associative arrays to convert the month value to a number. Then you rely on the %02d format specifier to ensure 2 digits. You can reformat the dateTime value with '.', etc or leave the colons in the hr:min:sec if you really need the human readability.
The YYYYMMDD format is a big help in these sort of problems, as LT, GT, EQ all work without any further formatting.
echo "178.221.138.12 Mon May 02 08:59:23 GMT 2011" \
| awk 'BEGIN {
mons["Jan"]=1 ; mons["Feb"]=2; mons["Mar"]=3
mons["Apr"]=4 ; mons["May"]=5; mons["Jun"]=6
mons["Jul"]=7 ; mons["Aug"]=8; mons["Sep"]=9
mons["Oct"]=10 ; mons["Nov"]=11; mons["Dec"]=12
}
{
# 178.221.138.12 Mon May 02 08:59:23 GMT 2011
printf("dateTime=%04d%02d%02d%02d%02d%02d\n",
$NF, mons[$3], $4, substr($5,1,2), substr($5,4,2), substr($5,7,2) )
} ' -v StartTime=20110105235959
The -v StartTime is ilustrative of how to pass in (and the matching format) your starTime value.
I hope this helps.

Here's an alternative approach using awk's built-in mktime() function. I've never bothered with the month parsing until now - thanks to shelter for that part (see accepted answer). It always feels time to switch language around that point.
#!/bin/bash
# input format:
#(1 2 3 4 5 6 7)
#123.80.114.20 Sun May 01 11:52:28 GMT 2011
awk -v startTime=1304252691 -v endTime=1306000000 '
BEGIN {
mons["Jan"]=1 ; mons["Feb"]=2; mons["Mar"]=3
mons["Apr"]=4 ; mons["May"]=5; mons["Jun"]=6
mons["Jul"]=7 ; mons["Aug"]=8; mons["Sep"]=9
mons["Oct"]=10 ; mons["Nov"]=11; mons["Dec"]=12;
}
{
hmsSpaced=$5; gsub(":"," ",hmsSpaced);
timeInSec=mktime($7" "mons[$3]" "$4" "hmsSpaced);
if (timeInSec > startTime && timeInSec <= endTime) print $0
}' myfile
(I've chosen example time thresholds to select only the last two log lines.)
Note that if the mktime() function were a bit smarter this whole thing would reduce to:
awk -v startTime=1304252691 -v endTime=1306000000 't=mktime($7" "$3" "$4" "$5); if (t > startTime && t <= endTime) print $0}' myfile

I'm not sure of the format of the data you're parsing, but I do know that you can't use the backticks within single quotes. You'll have to use double quotes. If there are too many quotes being nested, and it's confusing you, you can also just save the output of your date command to a variable beforehand.

Humanized dates with awk?

I have this awk script that runs through a file and counts every occurrence of a given date. The date format in the original file is the standard date format, like this: Thu Mar 5 16:46:15 EST 2009 I use awk to throw away the weekday, time, and timezone, and then do my counting by pumping the dates into an associative array with the dates as indices.
In order to get the output to be sorted by date, I converted the dates to a different format that I could sort with bash sort.
Now, my output looks like this:
Date Count
03/05/2009 2
03/06/2009 1
05/13/2009 7
05/22/2009 14
05/23/2009 7
05/25/2009 7
05/29/2009 11
06/02/2009 12
06/03/2009 16
I'd really like the output to have more human readable dates, like this:
Mar 5, 2009
Mar 6, 2009
May 13, 2009
May 22, 2009
May 23, 2009
May 25, 2009
May 29, 2009
Jun 2, 2009
Jun 3, 2009
Any suggestions for a way I could do this? If I could do this on the fly when I output the count values that would be best.
UPDATE:
Here's my solution incorporating ghostdog74's example code:
grep -i "E[DS]T 2009" original.txt | awk '{printf "%s %2.d, %s\r\n",$2,$3,$6}' >dates.txt #outputs dates for counting
date -f dates.txt +'%Y %m %d' | awk ' #reformat dates as YYYYMMDD for future sort
{++total[$0]} #pump dates into associative array
END {
for (item in total) printf "%s\t%s\r\n", item, total[item] #output dates as yyyy mm dd with counts
}' | sort -t \t | awk ' #send to sort, then to cleanup
BEGIN {printf "%s\t%s\r\n","Date","Count"}
{t=$1" "$2" "$3" 0 0 0" #cleanup using example by ghostdog74
printf "%s\t%2.d\r\n",strftime("%b %d, %Y",mktime(t)),$4
}'
rm dates.txt
Sorry this looks so messy. I've tried to put clarifying comments in.

Use awk's sort and date's stdin to greatly simplify the script
Date will accept input from stdin so you can eliminate one pipe to awk and the temporary file. You can also eliminate a pipe to sort by using awk's array sort and as a result, eliminate another pipe to awk. Also, there's no need for a coprocess.
This script uses date for the monthname conversion which would presumably continue to work in other languages (ignoring the timezone and month/day order issues, though).
The end result looks like "grep|date|awk". I have broken it into separate lines for readability (it would be about half as big if the comments were eliminated):
grep -i "E[DS]T 2009" original.txt |
date -f - +'%Y %m %d' | #reformat dates as YYYYMMDD for future sort
awk '
BEGIN { printf "%s\t%s\r\n","Date","Count" }
{ ++total[$0] #pump dates into associative array }
END {
idx=1
for (item in total) {
d[idx]=item;idx++ # copy the array indices into the contents of a new array
}
c=asort(d) # sort the contents of the copy
for (i=1;i<=c;i++) { # use the contents of the copy to index into the original
printf "%s\t%2.d\r\n",strftime("%b %e, %Y",mktime(d[i]" 0 0 0")),total[d[i]]
}
}'

I get testy when I see someone using grep and awk (and sed, cut, ...) in a pipeline. Awk can fully handle the work of many utilities.
Here's a way to clean up your updated code to run in a single instance of awk (well, gawk), and using sort as a co-process:
gawk '
BEGIN {
IGNORECASE = 1
}
function mon2num(mon) {
return(((index("JanFebMarAprMayJunJulAugSepOctNovDec", mon)-1)/3)+1)
}
/ E[DS]T [[:digit:]][[:digit:]][[:digit:]][[:digit:]]/ {
month=$2
day=$3
year=$6
date=sprintf("%4d%02d%02d", year, mon2num(month), day)
total[date]++
human[date] = sprintf("%3s %2d, %4d", month, day, year)
}
END {
sort_coprocess = "sort"
for (date in total) {
print date |& sort_coprocess
}
close(sort_coprocess, "to")
print "Date\tCount"
while ((sort_coprocess |& getline date) > 0) {
print human[date] "\t" total[date]
}
close(sort_coprocess)
}
' original.txt

if you are using gawk
awk 'BEGIN{
s="03/05/2009"
m=split(s,date,"/")
t=date[3]" "date[2]" "date[1]" 0 0 0"
print strftime("%b %d",mktime(t))
}'
the above is just an example, as you did not show your actual code and so cannot incorporate it into your code.

Why don't you prepend your awk-date to the original date? This yields a sortable key, but is human readable.
(Note: to sort right, you should make it yyyymmdd)
If needed, cut can remove the prepended column.

Gawk has strftime(). You can also call the date command to format them (man). Linux Forums gives some examples.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Filtering Month in CSV using Bash - bash

Oneliner: MONTH=March; REGEX=`date -d "1 ${MONTH} 2022" +%m..$`; grep $REGEX guest.csv

Related

Need a script to split a large file by month that can determine year based off order of the logs

Change date format - bash or php

How to decode Seagate's hard drive date code in a Bash script

Running shell commands within AWK

Humanized dates with awk?

Categories

Resources