python pandas index by time - time

I have a csv file that looks like this:
"06/09/2013 14:08:34.930","7.2680542849633447","1.6151231744362988","0","0","21","1546964992","15.772567829158248","1577332736","8360","21.400382061280961","0","15","0","685","0","0","0","0","0","0","0","4637","0"
the csv includes 1 month daily values (24 hrs)
I have a need to load it to pandas and then get some stats on data (min, max) but I need the data to include data records for all days only working hours (between 8:00 to 18:00)
I am very new to pandas library

Load your data:
import pandas as pd
from datetime import datetime
df = pd.read_csv('data.csv', header=None, index_col=0)
Filter your data for working hours from 8:00 to 18:00:
work_hours = lambda d: datetime.strptime(d, '%d/%m/%Y %H:%M:%S.%f').hour in range(8, 18)
df = df[map(work_hours, df.index)]
Get the min and max of the first data column:
min, max = df[1].min(), df[1].max()

Related

from the yfinance library how can I read the ex-dividend date?

This code should return the ex-dividend date:
import yfinance as yf
yf.Ticker('AGNC').info['exDividendDate']
but I get this as an output:
1661817600
I am wondering if there is a way to get the date from that number ?
It looks like this number is obtained based on seconds. In order to get the real date, you can use pd.to_datetime to convert the seconds to calendar date.
import pandas as pd
pd.to_datetime(1661817600, unit='s')
Out[6]: Timestamp('2022-08-30 00:00:00')
or you can use the built-in datetime package in Python.
from datetime import datetime
print(datetime.fromtimestamp(1661817600))
2022-08-30 08:00:00

Days from dates in qlikview expression

I am trying to get days from given data like this:
In this data suppose ID B start date is 4/10/2019 and end date is 10/25/2019. Here there is 7 months: April to October, so for the first month start date is 4/10/2019 and end date is 4/30/2019 so this means he only avail 10 days from this month and remaining days is 21.. same for here end date is 10/25/2019 so if we look calendar end date 10/31/2019 we only avail 6 days so in data I want to get above data which is mentioned in image .. where as I am trying this formula in qlikview:
=sum(
If(
MonthName(CalendarMonthEnd) = MonthName([End Date]),
([End Date]-CalendarMonthStart+1),
(RangeMin([End Date],CalendarMonthEnd)-RangeMax([Start Date],CalendarMonthStart))
)
)
and through this formula I get this data which is remaining days where i want to get days which is availed...
this is link of folder please download and check ..
https://www.dropbox.com/s/v48373io1bv9qqj/file_qlik.rar?dl=0
in this folder the excel file "output.. " in this excel file the first table output which i need
Just add another if
=Sum(
If(CalendarMonthStart >= [Start Date] and CalendarMonthEnd <= [End Date],
CalendarMonthEnd-CalendarMonthStart,
If([Start Date]>CalendarMonthStart,
[Start Date]-CalendarMonthStart+1,
CalendarMonthEnd-[End Date])
)
)
)

PySpark round off timestamps to full hours?

I am interested in rounding off timestamps to full hours. What I got so far is to round to the nearest hour. For example with this:
df.withColumn("Full Hour", hour((round(unix_timestamp("Timestamp")/3600)*3600).cast("timestamp")))
But this "round" function uses HALF_UP rounding. This means: 23:56 results in 00:00 but I would instead prefer to have 23:00. Is this possible? I didn't find an option field how to set the rounding behaviour in the function.
I think you're overcomplicating things. Hour function returns by default an hour component of a timestamp.
from pyspark.sql.functions import to_timestamp
from pyspark.sql import Row
df = (sc
.parallelize([Row(Timestamp='2016_08_21 11_59_08')])
.toDF()
.withColumn("parsed", to_timestamp("Timestamp", "yyyy_MM_dd hh_mm_ss")))
df2 = df.withColumn("Full Hour", hour(unix_timestamp("parsed").cast("timestamp")))
df2.show()
Output:
+-------------------+-------------------+---------+
| Timestamp| parsed|Full Hour|
+-------------------+-------------------+---------+
|2016_08_21 11_59_08|2016-08-21 11:59:08| 11|
+-------------------+-------------------+---------+

Speed up Pandas DateTime variable

I have a number of quite large cvs files (1,000,000 rows each) which contain a DateTime column. I am using Pandas pivot tables to summarise them. Part of what this involves is splitting out this DateTime variable into hours and minutes. I am using the following code, which is working fine, but it is taking quite a lot of time (around 4-5 minutes).
My question is: Is this just because the files are so large/my laptop to slow, or is there a more efficient code that allows me to split out hours and minutes from a DateTime variable?
Thanks
df['hours'], df['minutes'] = pd.DatetimeIndex(df['DateTime']).hour, pd.DatetimeIndex(df['DateTime']).minute
If dtypes of column Datetime is not datetime, first convert it to_datetime. Then use dt.hour and dt.minute:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
Sample:
import pandas as pd
df = pd.DataFrame({'DateTime': ['2014-06-17 11:09:20', '2014-06-18 10:02:10']})
print (df)
DateTime
0 2014-06-17 11:09:20
1 2014-06-18 10:02:10
print (df.dtypes)
DateTime object
dtype: object
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
print (df)
DateTime hours minutes
0 2014-06-17 11:09:20 11 9
1 2014-06-18 10:02:10 10 2

Awk and calculating start time from end time and duration

I have a file with date, end time and duration in decimal format and I need to calculate the start time. The file looks like:
20140101;1212;1.5
20140102;1515;1.58
20140103;1759;.69
20140104;1100;12.5
...
The duration 1.5 for the time 12:12 means one and a half hours and the start time would be 12:12 - 1:30 = 10:42 AM or 11:00 - 12.5 = 11:00 - 12:30 = 22:30 PM. Is there an easy way for calculating such time differences in Awk or is it the good ol' split-multiply-subtract-and-handle-the-day-break-yourself all over again?
Since the values are in hours and minutes, only the minutes matter and the seconds can be discarded, for example duration 1.58 means 1:34 and the leftover 0.8 seconds can be discarded.
I'm on GNU Awk 4.1.3
As you are using gawk take adventage of its native time functions:
gawk -F\; '{tmst=sprintf("%s %s %s %s %s 00",\
substr($1,1,4),\
substr($1,5,2),\
substr($1,7,2),\
substr($2,1,2),\
substr($2,3,2))
t1=mktime(tmst)
seconds=sprintf("%f",$3)+0
seconds*=60*60
difference=strftime("%H%M",t1-seconds)
print $0""FS""difference}' file
Results:
20140101;1212;1.5;1042
20140102;1515;1.58;1340
20140103;1759;.69;1717
20140104;1100;12.5;2230
Check: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html
Explanation:
tmst=sprintf(..) :used to create a date string from the file
that conforms with the datespec of mktime function YYYY MM
DD HH MM SS [DST].
t1=mktime(tmst) :turn datespec into a timestamp than can be
handle by gawk (as the number of seconds elapsed since 1
January 1970)
seconds=sprintf("%f",$3)+0 : convert third field to float.
seconds*=60*60 : convert hours (in float) to seconds.
difference=strftime("%H%M",t1-seconds) : get the difference in
human maner, hours an minutes.
I highly recommend to use a programming language which supports datetime calculations, because the calculation can be tricky in detail because daylight saving shifts. You can use Python for example:
start_times.py:
import csv
from datetime import datetime, timedelta
with open('input.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
for row in reader:
end_day = row[0]
end_time = row[1]
# Create a datetime object
end = datetime.strptime(end_day + end_time, "%Y%m%d%H%M")
# Translate duration into minutes
duration=float(row[2])*60
# Calculate start time
start = end - timedelta(minutes=duration)
# Column 3 is the start day (can differ from end day!)
row.append(start.strftime("%Y%m%d"))
# Column 4 is the start time
row.append(start.strftime("%H%M"))
print ';'.join(row)
Run:
python start_times.py
Output:
20140101;1212;1.5;20140101;1042
20140102;1515;1.58;20140102;1340
20140103;1759;.69;20140103;1717
20140104;1100;12.5;20140103;2230 <-- you see, the day matters!
The above example is using the system's timezone. If the input data refers to a different timezone, Pyhon's datetime module allows to specify it.
I would do something like this:
awk 'BEGIN{FS=OFS=";"}
{ h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60;
print $0, int(diff/60) ":" int(diff%60)
}' file
That is, convert everything to minutes and then back to hours/minutes.
Test
$ awk 'BEGIN{FS=OFS=";"}{h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60; print $0, int(diff/60) ":" int(diff%60)}' a
20140101;1212;1.5;10:42
20140102;1515;1.58;13:40
20140103;1759;.69;17:17

Resources