Create new column based on for loop - for-loop

In the code below, I'm taking an xlsx file and determining if a surgery overlapped based on 4 different date/time columns. Everything works fine except for the end of it where I'm trying to do the below which is what i'm trying to do in the last two lines. The new column is based on the results of the for loop, keeping all the columns in the original dataframe which are stated in DfResults.
Create a new column called "Overlap Status"
If conflict == True then value in new column is "Overlapped"
If conflict == False then value in new column is "Did not Overlap"
import pandas as pd
df1 = pd.read_excel(r'Directory\File.xlsx')
dfResults = df1.loc[(df1['conflict'] == True),
['LOG ID','Patient MRN',
'Providers Name', 'Surgery Date', 'Incision Start', 'Incision Close', 'Sedation Start', 'Case Finish']]
print(dfResults)
#df1.loc[:,'Overlap Status'] = df1.loc[(df1['conflict'] == True), "Overlapped"]
#df1.loc[:,'Overlap Status'] = df1.loc[(df1['conflict'] == False), "Did not Overlap"]
Expected Output:
Log ID
Patient MRN
Providers Name
Surgery Date
Incision Start
Incision Close
Sedation Start
Case Finish
Overlap Status
123
ABC
T, GEORGE
9/2/2021
9/2/2021 11:43 AM
9/2/2021 1:27 PM
9/2/2021 2:14 PM
Overlapped
456
DEF
T, GEORGE
9/2/2021
9/2/2021 1:46 PM
9/2/2021 3:20 PM
9/2/2021 3:41 PM
Overlapped
789
GEF
S, STEVEN
9/1/2021
9/1/2021 9 AM
9/1/2021 10 AM
Did not overlap

I figured it out.. just had to it with numpy
df1['Overlap Status'] = np.where(df1['conflict'] == True, 'Overlapped', 'Did not overlap')
df1.drop(['Start', 'End', 'SedStart', 'conflict'], axis=1, inplace=True)

Related

Polars string column to pl.datetime in Polars: conversion issue

working with a csv file with the following schema
'Ticket ID': polars.datatypes.Int64,
..
'Created time': polars.datatypes.Utf8,
'Due by Time': polars.datatypes.Utf8,
..
Converting to Datetime:
df = (
df.lazy()
.select(list_cols)
.with_columns([
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
])
)
Here is the output. 'Created time' is the original str and 'Create_date' is the conversion:
Created time
Create_date
str
date
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 11:48
2021-01-05
...
...
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:47
null
Getting a bunch of nulls and some of the date conversions seems to be incorrect (see 4th row in the output above). Also, how may I keep the time values?
Sure I am doing something wrong and any help would be greatly appreciated.
import polars as pl
from datetime import datetime
from datetime import date, timedelta
import pyarrow as pa
import pandas as pd
convert_to_date = ['Created time','Due by Time','Resolved time','Closed time','Last update time','Initial response time']
url = 'https://raw.githubusercontent.com/DOakville/PolarsDate/main/3000265945_tickets-Dates.csv'
df = (
pl.read_csv(url,parse_dates=True)
)
df = df.with_column(
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
)
Ahhh... I think I see what is happening - your with_columns expression is successfully converting all of the columns given in the "convert_to_date" list, but assigning the result of each conversion to the same name: "Create date".
So, the values you finally get are coming from the last column to be converted ("Initial response time"), which does have nulls where you see them.
If you want each column to be associated with a separate date-converted entry, you can use the suffix expression to ensure that each conversion is mapped to its own distinct output column (based on the original name).
For example:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M',
).suffix(" date") # << adds " date" to the existing column name
)
Or, if you prefer to overwrite the existing columns with the converted ones, you could keep the existing column names:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M'
).keep_name() # << keeps original name (effectively overwriting it)
)
Finally, if you actually want datetimes (not dates), just change the value of the datatype param in the strptime expression to pl.Datetime.

problems with the leaderboard discord.py

The leaderboard shows the same username even if they are different users in case they have the same value.
I don't know how to solve it but when in the code I ask to resist a variable it gives me only 3 elements and not 4 even if 4 come out.
code:
#client.command(aliases = ["lb"])
async def leaderboard(ctx,x = 10):
leader_board = {}
total = []
for user in economy_system:
name = int(user)
total_amount = economy_system[user]["wallet"] + economy_system[user]["bank"]
leader_board[total_amount] = name
total.append(total_amount)
print(leader_board)
total = sorted(total,reverse=True)
embed = discord.Embed(
title = f"Top {x} Richest People",
description = "This is decided on the basis of raw money in the bank and wallet",
color = 0x003399
)
index = 1
for amt in total:
id_ = leader_board[amt]
member = client.get_user(id_)
name = member.name
print(name)
embed.add_field(
name = f"{index}. {name}",
value = f"{amt}",
inline = False
)
if index == x:
break
else:
index += 1
await ctx.send(embed=embed)
print resists this:
{100: 523967502665908227, 350: 554617490806800387, 1100: 350886488235311126}
Padre Mapper
Flore (Orsolinismo)
Aetna
Aetna
In theory there should also be 100: 488826524791734275 (i.e. my user id) but it doesn't find it.
Your problem comes from this line:
leader_board[total_amount] = name
If total_amount is already a key (eg. two users have the same amount of money), it will replace the previous value (which was a user ID) and replace it with another user ID. In this situation, if multiple users have the same amount of money, only one will be saved in leader_board.
Then, you have this line:
total.append(total_amount)
In this case, if two users have the same amount of money, you would just have two identical values, which is normal but, considering the problem above, this will create a shift.
Let's say you have ten users with two of them who have the same amount of money. leader_board will only contain 9 items whereas total will contain 10 values. That's the reason why you have two of the same name in your message.
To solve the problem:
#client.command(aliases = ["lb"])
async def leaderboard(ctx, x=10):
d = {user_id: info["wallet"] + info["bank"] for user_id, info in economy_system.items()}
leaderboard = {user_id: amount for user_id, amount in sorted(d.items(), key=lambda item: item[1], reverse=True)}
embed = discord.Embed(
title = f"Top {x} Richest People",
description = "This is decided on the basis of raw money in the bank and wallet",
color = 0x003399
)
for index, infos in enumerate(leaderboard.items()):
user_id, amount = infos
member = client.get_user(user_id)
embed.add_field(
name = f"{index}. {member.display_name}",
value = f"{amount}",
inline = False
)
await ctx.send(embed=embed)
If I guessed right and your dictionnary is organized like this, it should work:
economy_system = {
user_id: {"bank": x, "wallet": y}
}

'No Rows Selected' when running query on sqlplus using CMD

Hi i am trying to run a file named query.sql using sqlplus on cmd but getting 'No Rows Selected' inside csv while the same query gives results when Run on Oracle SQL Developer
I have run the following command on cmd
sqlplus <username>/<password>#sid #query.sql > output.csv
The query inside query.sql is
SELECT
SR.SOID,
EL.SOID,
EO.EPTNUMBER,
EO.SLELABEL,
EL.LOGTEXT,
utl_raw.cast_to_varchar2(dbms_lob.substr(SR.SODATA, 3000, 1)) AS NAC_DATA
FROM
SORECORD SR,
SOAPPKEY SK,
EPTORDER EO,
EPTLOG EL
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '<ctn here>'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND SR.Receipttimestamp LIKE '07-JAN-20'
AND SR.SOID = EO.SOID
AND EO.EPTNUMBER = EL.EPTNUMBER
AND EL.SOID LIKE TO_CHAR((
Select
SUBSTR(TO_CHAR(SR.SOID), 1, LENGTH(SR.SOID) - 1)
FROM
SORECORD SR, SOAPPKEY SK
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '<ctn here>'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND SR.Receipttimestamp LIKE '07-JAN-20')) || '%'
AND EL.SOID > TO_NUMBER((
Select
SUBSTR(TO_CHAR(SR.SOID), 1, LENGTH(SR.SOID) - 1)
FROM
SORECORD SR, SOAPPKEY SK
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '<ctn here>'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND SR.Receipttimestamp LIKE '07-JAN-20') || '0');
I tried other queries to generate csv and they were working fine. I have no clue why this one is giving 'No Rows Selected' for sqlplus cmd when this same query fetches results in Oracle SQL Developer.
Can anyone help me out in pointing out the issue?
If SR.Receipttimestamp column's datatype is DATE, why are you comparing it to a string? '07-JAN-20' is a string, not a date. You're relying on Oracle's capabilities to implicitly convert that string into a valid date value, but - that doesn't work always. I presume that's what bothers your query.
I'd suggest you to rewrite it; a simple option - just to see whether it helps - is
where trunc(SR.Receipttimestamp) = date '2020-01-07'
i.e.
with trunc, "remove" time component (it'll set that column's value to 00:00 hours)
compare it to a date literal which is always in date 'yyyy-mm-dd' format
#Littlefoot, I made the mentioned change and it worked but facing a few issues in the csv file now
I have used the below Spool Code:
set pagesize 0
set feed off
set term off
spool '<my path here>'
SELECT
SR.SOID,
EL.SOID,
EO.EPTNUMBER,
EO.SLELABEL,
EL.LOGTEXT,
utl_raw.cast_to_varchar2(dbms_lob.substr(SR.SODATA, 3000, 1)) AS NAC_DATA
FROM
SORECORD SR,
SOAPPKEY SK,
EPTORDER EO,
EPTLOG EL
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '07996703863'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND trunc(SR.Receipttimestamp) = date '2020-01-07'
AND SR.SOID = EO.SOID
AND EO.EPTNUMBER = EL.EPTNUMBER
AND EL.SOID LIKE TO_CHAR((
Select
SUBSTR(TO_CHAR(SR.SOID), 1, LENGTH(SR.SOID) - 1)
FROM
SORECORD SR, SOAPPKEY SK
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '07996703863'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND trunc(SR.Receipttimestamp) = date '2020-01-07')) || '%'
AND EL.SOID > TO_NUMBER((
Select
SUBSTR(TO_CHAR(SR.SOID), 1, LENGTH(SR.SOID) - 1)
FROM
SORECORD SR, SOAPPKEY SK
WHERE
SK.APPKEYNAME = 'MSISDN'
AND SK.appkeyvalue = '07996703863'
AND SK.SOID = SR.SOID
AND SR.SOTYPE = 'NAC'
AND trunc(SR.Receipttimestamp) = date '2020-01-07') || '0');
spool off
and CMD used
sqlplus <username>/<password>#sid #<path of spool txt>
Now I am getting csv data as below
5764377,5764371,1,EEVMS,
Tue Jan 07 08:29:49:887 2020SOAP MESSAGE SENT:
<S:Envel
5764377,5764375,1,EEVMS,
Tue Jan 07 08:30:49:900 2020SOAP MESSAGE SENT:
<S:Envel
5764377,5764376,1,EEVMS,
Tue Jan 07 08:31:50:003 2020SOAP MESSAGE SENT:
<S:Envel
Now i am facing 2 issues in this:
I am getting partial data from EL.LOGTEXT whose data type is CLOB
Data column of utl_raw.cast_to_varchar2(dbms_lob.substr(SR.SODATA, 3000, 1)) AS NAC_DATA is all together missing from the csv
Here is the snapshot of query data on SQL Developer. I actually want the data in this tabular format in my CSV:

How to speed up the addition of a new column in pandas, based on comparisons on an existing one

I am working on a large-ish dataframe collection with some machine data in several tables. The goal is to add a column to every table which expresses the row's "class", considering its vicinity to a certain time stamp.
seconds = 1800
for i in range(len(tables)): # looping over 20 equally structured tables containing machine data
table = tables[i]
table['Class'] = 'no event'
for event in events[i].values: # looping over 20 equally structured tables containing events
event_time = event[1] # get integer time stamp
start_time = event_time - seconds
table.loc[(table.Time<=event_time) & (table.Time>=start_time), 'Class'] = 'event soon'
The event_times and the entries in table.Time are integers. The point is to assign the class "event soon" to all rows in a specific time frame before an event (the number of seconds).
The code takes quite long to run, and I am not sure what is to blame and what can be fixed. The amount of seconds does not have much impact on the runtime, so the part where the table is actually changed is probabaly working fine and it may have to do with the nested loops instead. However, I don't see how to get rid of them. Hopefully, there is a faster, more pandas way to go about adding this class column.
I am working with Python 3.6 and Pandas 0.19.2
You can use numpy broadcasting to do this vectotised instead of looping
Dummy data generation
num_tables = 5
seconds=1800
def gen_table(count):
for i in range(count):
times = [(100 + j)**2 for j in range(i, 50 + i)]
df = pd.DataFrame(data={'Time': times})
yield df
def gen_events(count, num_tables):
for i in range(num_tables):
times = [1E4 + 100 * (i + j )**2 for j in range(count)]
yield pd.DataFrame(data={'events': times})
tables = list(gen_table(num_tables)) # a list of 5 DataFrames of length 50
events = list(gen_events(5, num_tables)) # a list of 5 DataFrames of length 5
Comparison
For debugging, I added a dict of verification DataFrames. They are not needed, I just used them for debugging
verification = {}
for i, (table, event_df) in enumerate(zip(tables, events)):
event_list = event_df['events']
time_diff = event_list.values - table['Time'].values[:,np.newaxis] # This is where the magic happens
events_close = np.any( (0 < time_diff) & (time_diff < seconds), axis=1)
table['Class'] = np.where(events_close, 'event soon', 'no event')
# The stuff after this line can be deleted since it's only used for the verification
df = pd.DataFrame(data=time_diff, index=table['Time'], columns=event_list)
df['event'] = np.any((0 < time_diff) & (time_diff < seconds), axis=1)
verification[i] = df
newaxis
A good explanation on broadcasting is in Jakevdp's book
table['Time'].values[:,np.newaxis]
gives a (50,1) 2-d array
array([[10000],
[10201],
[10404],
....
[21609],
[21904],
[22201]], dtype=int64)
Verification
For the first step the verification df looks like this:
events 10000.0 10100.0 10400.0 10900.0 11600.0 event
Time
10000 0.0 100.0 400.0 900.0 1600.0 True
10201 -201.0 -101.0 199.0 699.0 1399.0 True
10404 -404.0 -304.0 -4.0 496.0 1196.0 True
10609 -609.0 -509.0 -209.0 291.0 991.0 True
10816 -816.0 -716.0 -416.0 84.0 784.0 True
11025 -1025.0 -925.0 -625.0 -125.0 575.0 True
11236 -1236.0 -1136.0 -836.0 -336.0 364.0 True
11449 -1449.0 -1349.0 -1049.0 -549.0 151.0 True
11664 -1664.0 -1564.0 -1264.0 -764.0 -64.0 False
11881 -1881.0 -1781.0 -1481.0 -981.0 -281.0 False
12100 -2100.0 -2000.0 -1700.0 -1200.0 -500.0 False
12321 -2321.0 -2221.0 -1921.0 -1421.0 -721.0 False
12544 -2544.0 -2444.0 -2144.0 -1644.0 -944.0 False
....
20449 -10449.0 -10349.0 -10049.0 -9549.0 -8849.0 False
20736 -10736.0 -10636.0 -10336.0 -9836.0 -9136.0 False
21025 -11025.0 -10925.0 -10625.0 -10125.0 -9425.0 False
21316 -11316.0 -11216.0 -10916.0 -10416.0 -9716.0 False
21609 -11609.0 -11509.0 -11209.0 -10709.0 -10009.0 False
21904 -11904.0 -11804.0 -11504.0 -11004.0 -10304.0 False
22201 -12201.0 -12101.0 -11801.0 -11301.0 -10601.0 False
Small optimizations of original answer.
You can shave a few lines and some assignments of the original algorithm
for table, event_df in zip(tables, events):
table['Class'] = 'no event'
for event_time in event_df['events']: # looping over 20 equally structured tables containing events
start_time = event_time - seconds
table.loc[table['Time'].between(start_time, event_time), 'Class'] = 'event soon'
You might shave some more if instead of the text 'no event' and 'event soon' you would just use booleans

How to extract string from large file only if specific string appears previous using Ruby?

I am trying to extract information from a large file and cannot figure out how to extract strings from file lines only when a previous line in the same record within the file has been matched by regex. An example of one record in the file is as follows:
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent: coordinate IM with MENTAL COMPETENCY (IM)
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization, by a patient or research subject, etc,...
This file contains over 20,000 records like this example. I want to identify a small percent of those records using the "MH" field. In this example, I want to find "Informed Consent", and then use regex to extract the information in the FX, AN, and MS fields only within that record. So far, I have opened the file, accessed the hash that the MH terms are stored in, and been able to extract those terms from the records in the file. I also have a functioning regex that identifies the content in the "FX" field.
File.open('mesh_descriptor.bin').each do |file_line|
file_line = file_line.chomp
# read each key of candidate_descriptor_keys
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
mesh_header = $1
puts "MH from Mesh Descriptor file is: #{mesh_header}"
if file_line =~ /^FX\s=\s(.*)$/
see_also = $1
puts " See_Also from Descriptor file is: #{see_also}"
end
end
end
end
The hash contains the following MH (keys):
candidate_descriptor_keys = ["Body Weight", "Obesity", "Thinness", "Fetal Weight", "Overweight"]
I had success extracting "FX" when I put the statement outside of the "if" statement to extract "MH", but all of the "FX" from the whole file were retrieved - not what I need. I thought putting the "if" statement for "FX" within the previous "if" statement would restrict the results to only those found when the first statement is true, but I am getting no results (also no errors) with this strategy. What I would like as a result is:
> Informed Consent
> Disclosure
> Mental Competency
> Therapeutic Misconception
> Treatment Refusal
as well as the strings within the "AN" and "MS" fields for only those records matching "MH". Any suggestions would be helpful!
I think this may be what you are looking for, but if not, let me know and I will change it. Look especially at the very end to see if that is the sort of output (for input having two records, both with a "MH" field) you want. I will also add a "explanation" section at the end once I have understood your question correctly.
I have assumed that each record begins
*NEW_RECORD
and you wish to identify all lines beginning "MH" whose field is one of the elements of:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
and for each match, you would like to print the contents of the lines for the same record that begin with "FX", "AN" and "MS".
Code
NEW_RECORD_MARKER = "*NEW RECORD"
def getem(fname, candidate_descriptor_keys)
line = 0
found_mh = false
File.open(fname).each do |file_line|
file_line = file_line.strip
case
when file_line == NEW_RECORD_MARKER
puts # space between records
found_mh = false
when found_mh == false
candidate_descriptor_keys.each do |cand_term|
if file_line =~ /^MH\s=\s(#{cand_term})$/
found_mh = true
puts "MH from line #{line} of file is: #{cand_term}"
break
end
end
when found_mh
["FX", "AN", "MS"].each do |des|
if file_line =~ /^#{des}\s=\s(.*)$/
see_also = $1
puts " Line #{line} of file is: #{des}: #{see_also}"
end
end
end
line += 1
end
end
Example
Let's begin be creating a file, starging with a "here document that contains two records":
records =<<_
*NEW RECORD
RECTYPE = D
MH = Informed Consent
AQ = ES HI LJ PX SN ST
ENTRY = Consent, Informed
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = Disclosure
FX = Mental Competency
FX = Therapeutic Misconception
FX = Treatment Refusal
ST = T058
ST = T078
AN = competency to consent
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Voluntary authorization
*NEW RECORD
MH = Obesity
AQ = ES HI LJ PX SN ST
ENTRY = Obesity
MN = N03.706.437.650.312
MN = N03.706.535.489
FX = 1st FX
FX = 2nd FX
AN = Only AN
PI = Jurisprudence (1966-1970)
PI = Physician-Patient Relations (1966-1970)
MS = Only MS
_
If you puts records you will see it is just a string. (You'll see that I shortened two of them.) Now write it to a file:
File.write('mesh_descriptor', records)
If you wish to confirm the file contents, you could do this:
puts File.read('mesh_descriptor')
We also need to define define the array candidate_descriptor_keys:
candidate_descriptor_keys =
["Body Weight", "Obesity", "Thinness", "Informed Consent"]
We can now execute the method getem:
getem('mesh_descriptor', candidate_descriptor_keys)
MH from line 2 of file is: Informed Consent
Line 7 of file is: FX: Disclosure
Line 8 of file is: FX: Mental Competency
Line 9 of file is: FX: Therapeutic Misconception
Line 10 of file is: FX: Treatment Refusal
Line 13 of file is: AN: competency to consent
Line 16 of file is: MS: Voluntary authorization
MH from line 18 of file is: Obesity
Line 23 of file is: FX: 1st FX
Line 24 of file is: FX: 2nd FX
Line 25 of file is: AN: Only AN
Line 28 of file is: MS: Only MS

Resources