Polars string column to pl.datetime in Polars: conversion issue - python-datetime

working with a csv file with the following schema
'Ticket ID': polars.datatypes.Int64,
..
'Created time': polars.datatypes.Utf8,
'Due by Time': polars.datatypes.Utf8,
..
Converting to Datetime:
df = (
df.lazy()
.select(list_cols)
.with_columns([
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
])
)
Here is the output. 'Created time' is the original str and 'Create_date' is the conversion:
Created time
Create_date
str
date
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 11:48
2021-01-05
...
...
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:47
null
Getting a bunch of nulls and some of the date conversions seems to be incorrect (see 4th row in the output above). Also, how may I keep the time values?
Sure I am doing something wrong and any help would be greatly appreciated.
import polars as pl
from datetime import datetime
from datetime import date, timedelta
import pyarrow as pa
import pandas as pd
convert_to_date = ['Created time','Due by Time','Resolved time','Closed time','Last update time','Initial response time']
url = 'https://raw.githubusercontent.com/DOakville/PolarsDate/main/3000265945_tickets-Dates.csv'
df = (
pl.read_csv(url,parse_dates=True)
)
df = df.with_column(
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
)

Ahhh... I think I see what is happening - your with_columns expression is successfully converting all of the columns given in the "convert_to_date" list, but assigning the result of each conversion to the same name: "Create date".
So, the values you finally get are coming from the last column to be converted ("Initial response time"), which does have nulls where you see them.
If you want each column to be associated with a separate date-converted entry, you can use the suffix expression to ensure that each conversion is mapped to its own distinct output column (based on the original name).
For example:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M',
).suffix(" date") # << adds " date" to the existing column name
)
Or, if you prefer to overwrite the existing columns with the converted ones, you could keep the existing column names:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M'
).keep_name() # << keeps original name (effectively overwriting it)
)
Finally, if you actually want datetimes (not dates), just change the value of the datatype param in the strptime expression to pl.Datetime.

Related

Create new column based on for loop

In the code below, I'm taking an xlsx file and determining if a surgery overlapped based on 4 different date/time columns. Everything works fine except for the end of it where I'm trying to do the below which is what i'm trying to do in the last two lines. The new column is based on the results of the for loop, keeping all the columns in the original dataframe which are stated in DfResults.
Create a new column called "Overlap Status"
If conflict == True then value in new column is "Overlapped"
If conflict == False then value in new column is "Did not Overlap"
import pandas as pd
df1 = pd.read_excel(r'Directory\File.xlsx')
dfResults = df1.loc[(df1['conflict'] == True),
['LOG ID','Patient MRN',
'Providers Name', 'Surgery Date', 'Incision Start', 'Incision Close', 'Sedation Start', 'Case Finish']]
print(dfResults)
#df1.loc[:,'Overlap Status'] = df1.loc[(df1['conflict'] == True), "Overlapped"]
#df1.loc[:,'Overlap Status'] = df1.loc[(df1['conflict'] == False), "Did not Overlap"]
Expected Output:
Log ID
Patient MRN
Providers Name
Surgery Date
Incision Start
Incision Close
Sedation Start
Case Finish
Overlap Status
123
ABC
T, GEORGE
9/2/2021
9/2/2021 11:43 AM
9/2/2021 1:27 PM
9/2/2021 2:14 PM
Overlapped
456
DEF
T, GEORGE
9/2/2021
9/2/2021 1:46 PM
9/2/2021 3:20 PM
9/2/2021 3:41 PM
Overlapped
789
GEF
S, STEVEN
9/1/2021
9/1/2021 9 AM
9/1/2021 10 AM
Did not overlap
I figured it out.. just had to it with numpy
df1['Overlap Status'] = np.where(df1['conflict'] == True, 'Overlapped', 'Did not overlap')
df1.drop(['Start', 'End', 'SedStart', 'conflict'], axis=1, inplace=True)

How to solve for pyodbc.ProgrammingError: The second parameter to executemany must not be empty

Hi i'm having an issue with the transfer of data from one database to another. I created a list using field in a table on a msql db, used that list to query and oracle db table (using the initial list in the where statement to filter results) I then load the query results back into the msql db.
The program runs for the first few iterations but then errors out, with the following error (
Traceback (most recent call last):
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 67, in
insertIntoMSDatabase(idString)
File "C:/Users/1/PycharmProjects/DataExtracts/BuyerGroup.py", line 48, in insertIntoMSDatabase
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.)
I can't seem to find and guidance online to troubleshoot this error message. I feel it may be a simple solution but I just can't get there...
# import libraries
import cx_Oracle
import pyodbc
import logging
import time
import re
import math
import numpy as np
logging.basicConfig(level=logging.DEBUG)
conn = pyodbc.connect('''Driver={SQL Server Native Client 11.0};
Server='servername';
Database='dbname';
Trusted_connection=yes;''')
b = conn.cursor()
dsn_tns = cx_Oracle.makedsn('Hostname', 'port', service_name='name')
conn1 = cx_Oracle.connect(user=r'uid', password='pwd', dsn=dsn_tns)
c = conn1.cursor()
beginTime = time.time()
bind = (b.execute('''select distinct field1
from [server].[db].[dbo].[table]'''))
print('MSQL table(s) queried, List Generated')
# formats ids for sql string
def surroundWithQuotes(id):
return "'" + re.sub(",|\s$", "", str(id)) + "'"
def insertIntoMSDatabase(idString):
osql = '''SELECT distinct field1, field2
FROM Database.Table
WHERE field2 is not null and field3 IN ({})'''.format(idString)
c.execute(osql)
claimsdata = c.fetchall()
print('Oracle table(s) queried, Data Pulled')
mycursor = conn.cursor()
sql = '''INSERT INTO [dbo].[tablename]
(
[fields1]
,[field2]
)
VALUES (?,?)'''
val = claimsdata
mycursor.executemany(sql, val)
conn.commit()
ids = []
formattedIdStrings = []
# adds all the ids found in bind to an iterable array
for row in bind:
ids.append(row[0])
# splits the ids[] array into multiple arrays < 1000 in length
batchedIds = np.array_split(ids, math.ceil(len(ids) / 1000))
# formats the value inside each batchedId to be a string
for batchedId in batchedIds:
formattedIdStrings.append(",".join(map(surroundWithQuotes, batchedId)))
# runs insert into MS database for each batch of IDs
for idString in formattedIdStrings:
insertIntoMSDatabase(idString)
print("MSQL table loaded, Data inserted into destination")
endTime = time.time()
print("Program Time Elapsed: ",endTime-beginTime)
conn.close()
conn1.close()
mycursor.executemany(sql, val)
pyodbc.ProgrammingError: The second parameter to executemany must not be empty.
Before calling .executemany() you need to verify that val is not an empty list (as would be the case if .fetchall() is called on a SELECT statement that returns no rows) , e.g.,
if val:
mycursor.executemany(sql, val)

SQLRPGLE & JSON_OBJECT CTE Statements -101 Error

This program compiles correctly, we are on V7R3 - but when running it receives an SQLCOD of -101 and an SQLSTATE code is 54011 which states: Too many columns were specified for a table, view, or table function. This is a very small JSON that is being created so I do not think that is the issue.
The RPGLE code:
dcl-s OutFile sqltype(dbclob_file);
xfil_tofile = '/ServiceID-REFCODJ.json';
Clear OutFile;
OutFile_Name = %TrimR(XFil_ToFile);
OutFile_NL = %Len(%TrimR(OutFile_Name));
OutFile_FO = IFSFileCreate;
OutFile_FO = IFSFileOverWrite;
exec sql
With elm (erpRef) as (select json_object
('ServiceID' VALUE trim(s.ServiceID),
'ERPReferenceID' VALUE trim(i.RefCod) )
FROM PADIMH I
INNER JOIN PADGUIDS G ON G.REFCOD = I.REFCOD
INNER JOIN PADSERV S ON S.GUID = G.GUID
WHERE G.XMLTYPE = 'Service')
, arr (arrDta) as (values json_array (
select erpRef from elm format json))
, erpReferences (refs) as ( select json_object ('erpReferences' :
arrDta Format json) from arr)
, headerData (hdrData) as (select json_object(
'InstanceName' : trim(Cntry) )
from padxmlhdr
where cntry = 'US')
VALUES (
select json_object('header' : hdrData format json,
'erpReferenceData' value refs format json)
from headerData, erpReferences )
INTO :OutFile;
Any help with this would be very much appreciated, this is our first attempt at creating JSON for sending and have not experienced this issue before.
Thanks,
John
I am sorry for the delay in getting back to this issue. It has been corrected, the issue was with the "values" statement.
This is the correct code needed to make it work correctly:
Select json_object('header' : hdrData format json,
'erpReferenceData' value refs format json)
INTO :OutFile
From headerData, erpReferences )

Setting textinputformat.record.delimiter in sparksql

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').
You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Spark/Hive Hours Between Two Datetimes

I would like to know how to precisely get the number of hours between 2 datetimes in spark.
There is a function called datediff which I could use to get the number of days and then convert to hours however this is less precise than I'd like
example of what I want modeled after datediff:
>>> df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
>>> df.select(hourdiff(df.d2, df.d1).alias('diff')).collect()
[Row(diff=22)]
Try using UDF Here is the sample code, You can modify to UDF return what ever granularity as you want.
from pyspark.sql.functions import udf, col
from datetime import datetime, timedelta
from pyspark.sql.types import LongType
def timediff_x():
def _timediff_x(date1, date2):
date11 = datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')
date22 = datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')
return (date11 - date22).days
return udf(_timediff_x, LongType())
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-25 19:15:00')], ['d1', 'd2'])
df.select(timediff_x()(col("d2"), col("d1"))).show()
+----------------------------+
|PythonUDF#_timediff_x(d2,d1)|
+----------------------------+
| 6|
+----------------------------+
If your columns are of type TimestampType(), you can use the answer at the following question:
Spark Scala: DateDiff of two columns by hour or minute
However, if your columns are of type StringType(), you have an option that is easier than defining an UDF, using the built-in functions:
from pyspark.sql.functions import *
diffCol = unix_timestamp(col('d1'), 'yyyy-MM-dd HH:mm:ss') - unix_timestamp(col('d2'), 'yyyy-MM-dd HH:mm:ss')
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
df2 = df.withColumn('diff_secs', diffCol)

Resources