Formatting and converting dates and time - bash

I have a Very large (13 GiB) csv file (3856321 rows and 1698) where as expected some of the dates are differently formated. The file looks like ::
2013/01/08 2:11:30 AM,abdc,good time ...
2015/12/28 8:19:30 PM,abdc,good time ...
2/15/2016 10:46:30 AM,kdafh,almost as good ...
12/13/2014 10:46:00 PM,asjhdk,not that good ...
02-Jan-2014,bad time,good time ...
1/1/2015,nomiss time,boy ...
10/15/2016 17:08:30,bad,boy ...
I want to convert it to a same time format and required output is ::
1/8/2013 2:11:30,abdc,good time
12/28/2015 20:19:30,abdc,good time
2/15/2016 10:46:30,kdafh,almost as good
12/13/2014 22:46:00,asjhdk,not that good
1/2/2014 00:00:00,bad time,good time
1/1/2015 00:00:00,nomiss time,boy
10/15/2016 17:08:30,bad,boy
I managed to format the time using the following scripts
awk -F ',' 'BEGIN{FS=OFS=","}{split($1,a," ");
{ split(a[2],b,":");
# tmp2=system("date -d `tmp` +%m/%d/%Y");
# print tmp2
$1=tmp" "a[2]
}1' time_input.csv
I borrowed the idea of formatting dates from question
which is commented out in the second last line. However, this does not work in my case. I get an error
date: invalid date ‘+%m/%d/%Y’
Is there an easier and better way to do this? Thanks in advance

With Python, using the dateutils and csv modules:
import dateutil.parser as parser
import csv
with open('time_input.csv', 'rb') as inputfile, open('time_output.csv', 'w') as outputfile:
reader = csv.reader(inputfile, delimiter=',')
writer = csv.writer(outputfile)
for row in reader:
row[0] = parser.parse(row[0]).strftime('%m/%d/%Y %H:%M:%S')
The result is output to time_output.csv file.

Awk sure is one great way of doing it but since it's really early morning here I don't like to think about all those ifs so here is one in php, since it's got a really nice strtotime function:
$ cat program.php
$handle = fopen("file", "r");
if ($handle) {
while (($line = fgets($handle)) !== false) {
// process the line read.
$arr = explode(",", $line, 2);
echo date("m/d/Y H:i:s", strtotime($arr[0])), ",", $arr[1];
} else {
// error opening the file.
Run it:
$ php -f program.php
01/08/2013 02:11:30,abdc,good time
12/28/2015 20:19:30,abdc,good time
02/15/2016 10:46:30,kdafh,almost as good
12/13/2014 22:46:00,asjhdk,not that good
01/02/2014 00:00:00,bad time,good time
01/01/2015 00:00:00,nomiss time,boy
10/15/2016 17:08:30,bad,boy
The read line by line loop comes from here: How to read a file line by line in php. I only added lines with explode and strtotime.
The explode splits the line to pieces by the first , and stores them to array $arr. strtotime function is applied to the first element $arr[0]. $arr[1] is later outputed as-is.

You can try below awk command -
vipin#kali:~$ cat kk.txt
2013/01/08 2:11:30 AM,abdc,good time
2015/12/28 8:19:30 PM,abdc,good time
2/15/2016 10:46:30 AM,kdafh,almost as good
12/13/2014 10:46:00 PM,asjhdk,not that good
02-Jan-2014,bad time,good time
1/1/2015,nomiss time,boy
10/15/2016 17:08:30,bad,boy
filtering -
vipin#kali:~$ awk -F"," '{split($1,a," "); printf ("%s,%s,%s",$2,$3,",");system("date -d \""a[1]" "a[2]"\" +\"%m/%d/%Y %H:%M:%S\"")}' kk.txt
abdc,good time,,01/08/2013 02:11:30
abdc,good time,,12/28/2015 08:19:30
kdafh,almost as good,,02/15/2016 10:46:30
asjhdk,not that good,,12/13/2014 10:46:00
bad time,good time,,01/02/2014 00:00:00
nomiss time,boy,,01/01/2015 00:00:00
bad,boy,,10/15/2016 17:08:30
Move the filtered output to file kk.txt2
vipin#kali:~$ awk -F"," '{split($1,a," "); printf ("%s,%s,%s",$2,$3,",");system("date -d \""a[1]" "a[2]"\" +\"%m/%d/%Y %H:%M:%S\"")}' kk.txt > kk.txt2
vipin#kali:~$ awk -F"," '{print $NF,$1,$2}' OFS="," kk.txt2
01/08/2013 02:11:30,abdc,good time
12/28/2015 08:19:30,abdc,good time
02/15/2016 10:46:30,kdafh,almost as good
12/13/2014 10:46:00,asjhdk,not that good
01/02/2014 00:00:00,bad time,good time
01/01/2015 00:00:00,nomiss time,boy
10/15/2016 17:08:30,bad,boy
Explanation -
Use Split function on column 1 and put it in a and then use system function of awk to format the date as per our need.
I can print the output in order but it was printing a leading zero so i am printing formatted date in last column that is why i am moving the data in another file.
and finally you can print the column in your order.


Sorting a columns value from a large csv(more than a million records) using awk or bash

I am new to shell scripting.
I have a huge csv file which contains more than 100k rows. I need to find a column and sort it and write it to another file and later I need to process this new file.
below is the sample data
Now you can see that field 4 has data which contains comma as well. now I need the data in which the field 4 is sorted out as below:
to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
while read line
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[#]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[#]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
How to make it faster?
You're using the wrong tools for the task. While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines. Also bash isn't very fast when processing lots of data.
Try a tool which understands CSV directly like or use a programming language like Python. That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable. Note: I'm suggesting Python because of the low initial cost.
With python and the csv module, the code above would look like this:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
That said, there is nothing obviously wrong with your code. There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.
With single awk:
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv contents:
FS=OFS="\042,\042" - considering "," as field separator
split($4,a,",") - split the 4th field into array by separator ,
asort(a) - sort the array by values
Try pandas in python3. Only limitation: The data needs to fit into memory. And that can be a bit larger than your actually data is. I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
print("Start processing %s" % fname)
s =
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e =
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s =
df.set_index('ts', inplace=True)
e =
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s =
e =
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s =
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e =
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

How to compare a field of a file with current timestamp and print the greater and lesser data?

How do I compare current timestamp and a field of a file and print the matched and unmatched data. I have 2 columns in a file (see below)
oac.bat 09:09
klm.txt 9:00
I want to compare the timestamp(2nd column) with current time say suppose(10:00) and print the output as follows.
At 10:00
xyz.txt 10:32
mnp.csv 23:54
oac.bat 09:09
klm.txt 9:00
Could anyone help me on this please ?
I used awk $0 > "10:00", which gives me only 2nd column details but I want both the column details and I am taking timestamp from system directly from system with a variable like
d=`date +%H:%M`
With GNU awk you can just use it's builtin time functions:
awk 'BEGIN{now = strftime("%H:%M")} {
print > ((cur > now ? "greater" : "lesser") ".txt")
}' file
With other awks just set now using -v and date up front, e.g.:
awk -v now="$(date +"%H:%M")" '{
cur = sprintf("%02d:%02d",t[1],t[2])
print > ((cur > now ? "greater" : "lesser") ".txt")
}' file
The above is untested since you didn't provide input/output we could test against.
Pure Bash
The script can be implemented in pure Bash with the help of date command:
# Current Unix timestamp
let cmp_seconds=$(date +%s)
# Read file line by line
while IFS= read -r line; do
let line_seconds=$(date -d "${line##* }" +%s) || continue
(( line_seconds <= cmp_seconds )) && \
outfile=lesser || outfile=greater
# Append the line to the file chosen above
printf "%s\n" "$line" >> "${outfile}.txt"
done < file
In this script, ${line##* } removes the longest match of '* ' (any character followed by a space) pattern from the front of $line thus fetching the last column (the time). The time column is supposed to be in one of the following formats: HH:MM, or H:MM. Actually, date's -d option argument
can be in almost any common format. It can contain month names, time zones, ‘am’ and ‘pm’, ‘yesterday’, etc.
We use the flexibility of this option to convert the time (HH:MM, or H:MM) to Unix timestamp.
The let builtin allows arithmetic to be performed on shell variables. If the last let expression fails, or evaluates to zero, let returns 1 (error code), otherwise 0 (success). Thus, if for some reason the time column is in invalid format, the iteration for such line will be skipped with the help of continue.
Here is a Perl version I have written just for fun. You may use it instead of the Bash version, if you like.
# For current date
#cmp_seconds=$(date +%s)
# For specific hours and minutes
cmp_seconds=$(date -d '10:05' +%s)
perl -e '
my #t = localtime('$cmp_seconds');
my $minutes = $t[2] * 60 + $t[1];
while (<>) {
/ (\d?\d):(\d\d)$/ or next;
my $fh = ($1 * 60 + $2) > $minutes ? STDOUT : STDERR;
printf $fh "%s", $_;
}' < file >greater.txt 2>lesser.txt
The script computes the number of minutes in the following way:
HH:MM = HH * 60 + MM minutes
If the number of minutes from the file are greater then the number of minutes for the current time, it prints the next line to the standard output, otherwise to standard error. Finally, the standard output is redirected to greater.txt, and the standard error is redirected to lesser.txt.
I have written this script for demonstration of another approach (algorithm), which can be implemented in different languages, including Bash.

Carving data from log file

I have a log file containing the data below:
time=1460196536.247325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=13ms requests=517 option1=0 option2=0 errors=0 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
I am trying to write bashscript where I try to carve these values for each line in the log file and write it to a second file:
time (converted to local time GMT+2)
Desired output in second file:
time latency99 requests errors
12:08:56 13 517 0
Is the easiest way to use regex for this?
Here's a Bash solution for version 4 and above, using an associative array:
# Assoc array to hold data.
declare -A data
# Log file ( the input file ).
# Output file.
# Print column names for required values.
printf '%-20s %-10s %-10s %-10s\n' time latency99 requests errors > "$output_file"
# Iterate over each line in $logfile
while read -ra arr; do
# Insert keys and values into 'data' array.
for i in "${arr[#]}"; do
# Convert time to GMT+2
gmt2_time=$(TZ=GMT+2 date -d "#${data[time]}" '+%T')
# Print results to stdout.
printf '%-20s %-10s %-10s %-10s\n' "$gmt2_time" "${data[latency99]%ms}" "${data[requests]}" "${data[errors]}" >> "$output_file"
done < "$logfile"
As you can see, the script accepts two arguments. The first one is the file name of the logfile, and the second is the output file to which parsed data will be inserted line by line for each row in the logfile.
Please notice that I used GMT+2 as the value to the TZ variable.
Use the exact area as the value instead. Like, for example, TZ="Europe/Berlin".
You might want to use the tool tzselect to know the correct string value of your area.
In order to test it, I created the following logfile, containing 3 different rows of input:
time=1260196536.242325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=10ms requests=100 option1=0 option2=0 errors=1 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
time=1460246536.244325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=20ms requests=200 option1=0 option2=0 errors=2 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
time=1260236536.147325 latency=3:6:7:9:16:(8)ms latency95=11ms latency99=30ms requests=300 option1=0 option2=0 errors=3 throughput=480rps ql=1 rr=0.00% cr=0.00% accRequests=101468 accOption1=0 accOption2=0 accLatency=2:6:7:8:3998:(31)ms accLatency95=11ms accLatency99=649ms accOpenQueuing=1664 accErrors=278
Let's run the test ( script name is sof ):
$ ./sof logfile parsed_logfile
$ cat parsed_logfile
time latency99 requests errors
12:35:36 10 100 1
22:02:16 20 200 2
23:42:16 30 300 3
According to OP request as can be seen in the comments, and as discussed further in chat, I edited the script to include the following features:
Remove ms suffix from latency99's value.
Read input from a logfile, line by line, parse and output results to a
selected file.
Include column names only in the first row of output.
Convert the time value to GMT+2.
Here is a awk script for you. Say the logfile is mc.log and the script is saved as mc.awk, you would run it like this: awk -f mc.awk mc.log with GNU awk.
# some "" to align header and values in output
print "time", "", "latency99", "requests", "errors"
function getVal( str) {
# strip leading "key=" and trailing "ms" from str
gsub(/^.*=/, "", str)
gsub(/ms$/, "", str)
return str
function fmtTime( timeStamp ){
val=getVal( timeStamp )
return strftime( "%H:%M:%S", val)
# some "" to align header and values in output
print fmtTime($1), getVal($4), "", getVal($5), "", getVal($8)
Here's an awk version (not GNU). Converting the date would require a call to an external program:
#!/usr/bin/awk -f
print "time", "latency99", "requests", "errors"
print $2, $5, $6, $9

Why does awk skip the second field in first entry?

I have a manually created log file of the format
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
2:05 TOTAL time spent
There are many entries in the log. To avoid manually recomputing total time every time an entry is added, I wrote the following script:
file=`ls | grep log`
head -n -1 $file | egrep -o [0-9]:[0-9]{2}[^ap] \
| awk '{ FS = ":" ; SUM += 60*$1 ; SUM += $2 } END { print SUM }'
First, the script assumes there is exactly one file with log in its name, and that's the file I'm after. Second, it takes all lines other than the line with the current total, greps the time information from the line, and feeds it to awk, which converts it to minutes.
This is where I run into problems. The final sum would always be slightly off. Through trial and error, I discovered that awk will never count the second field of the very first record, e.g. the 45 minutes in this case. It will count the hour; it won't count the minutes. It has no such problem with the other records, but it's always off by the minutes in the first record.
What could be causing this behavior? How do I debug it?
You set FS in the loop and it's already too late for the first line.
The right way to do is :
echo -e "1:45\n0:20" | awk 'BEGIN { FS=":" } { SUM += 60*$1 + $2 } END { print SUM }'
You did not show us, that how you expect output
Whether like this ?
$ cat log
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
2:05 TOTAL time spent
Awk Code
awk '$3~/([[:digit:]]):([[:digit:]])/ && !/TOTAL/{
print "Total",sum,"Minutes"
}' log
Total 125 Minutes

Humanized dates with awk?

I have this awk script that runs through a file and counts every occurrence of a given date. The date format in the original file is the standard date format, like this: Thu Mar 5 16:46:15 EST 2009 I use awk to throw away the weekday, time, and timezone, and then do my counting by pumping the dates into an associative array with the dates as indices.
In order to get the output to be sorted by date, I converted the dates to a different format that I could sort with bash sort.
Now, my output looks like this:
Date Count
03/05/2009 2
03/06/2009 1
05/13/2009 7
05/22/2009 14
05/23/2009 7
05/25/2009 7
05/29/2009 11
06/02/2009 12
06/03/2009 16
I'd really like the output to have more human readable dates, like this:
Mar 5, 2009
Mar 6, 2009
May 13, 2009
May 22, 2009
May 23, 2009
May 25, 2009
May 29, 2009
Jun 2, 2009
Jun 3, 2009
Any suggestions for a way I could do this? If I could do this on the fly when I output the count values that would be best.
Here's my solution incorporating ghostdog74's example code:
grep -i "E[DS]T 2009" original.txt | awk '{printf "%s %2.d, %s\r\n",$2,$3,$6}' >dates.txt #outputs dates for counting
date -f dates.txt +'%Y %m %d' | awk ' #reformat dates as YYYYMMDD for future sort
{++total[$0]} #pump dates into associative array
for (item in total) printf "%s\t%s\r\n", item, total[item] #output dates as yyyy mm dd with counts
}' | sort -t \t | awk ' #send to sort, then to cleanup
BEGIN {printf "%s\t%s\r\n","Date","Count"}
{t=$1" "$2" "$3" 0 0 0" #cleanup using example by ghostdog74
printf "%s\t%2.d\r\n",strftime("%b %d, %Y",mktime(t)),$4
rm dates.txt
Sorry this looks so messy. I've tried to put clarifying comments in.
Use awk's sort and date's stdin to greatly simplify the script
Date will accept input from stdin so you can eliminate one pipe to awk and the temporary file. You can also eliminate a pipe to sort by using awk's array sort and as a result, eliminate another pipe to awk. Also, there's no need for a coprocess.
This script uses date for the monthname conversion which would presumably continue to work in other languages (ignoring the timezone and month/day order issues, though).
The end result looks like "grep|date|awk". I have broken it into separate lines for readability (it would be about half as big if the comments were eliminated):
grep -i "E[DS]T 2009" original.txt |
date -f - +'%Y %m %d' | #reformat dates as YYYYMMDD for future sort
awk '
BEGIN { printf "%s\t%s\r\n","Date","Count" }
{ ++total[$0] #pump dates into associative array }
for (item in total) {
d[idx]=item;idx++ # copy the array indices into the contents of a new array
c=asort(d) # sort the contents of the copy
for (i=1;i<=c;i++) { # use the contents of the copy to index into the original
printf "%s\t%2.d\r\n",strftime("%b %e, %Y",mktime(d[i]" 0 0 0")),total[d[i]]
I get testy when I see someone using grep and awk (and sed, cut, ...) in a pipeline. Awk can fully handle the work of many utilities.
Here's a way to clean up your updated code to run in a single instance of awk (well, gawk), and using sort as a co-process:
gawk '
function mon2num(mon) {
return(((index("JanFebMarAprMayJunJulAugSepOctNovDec", mon)-1)/3)+1)
/ E[DS]T [[:digit:]][[:digit:]][[:digit:]][[:digit:]]/ {
date=sprintf("%4d%02d%02d", year, mon2num(month), day)
human[date] = sprintf("%3s %2d, %4d", month, day, year)
sort_coprocess = "sort"
for (date in total) {
print date |& sort_coprocess
close(sort_coprocess, "to")
print "Date\tCount"
while ((sort_coprocess |& getline date) > 0) {
print human[date] "\t" total[date]
' original.txt
if you are using gawk
awk 'BEGIN{
t=date[3]" "date[2]" "date[1]" 0 0 0"
print strftime("%b %d",mktime(t))
the above is just an example, as you did not show your actual code and so cannot incorporate it into your code.
Why don't you prepend your awk-date to the original date? This yields a sortable key, but is human readable.
(Note: to sort right, you should make it yyyymmdd)
If needed, cut can remove the prepended column.
Gawk has strftime(). You can also call the date command to format them (man). Linux Forums gives some examples.
