Sort a pandas dataframe based on DateTime field - sorting

I am trying to sort a dataframe based on DateTime field which is of datatype datetime64[ns].
My dataframe looks like this:
Name DateTime1
P38 NaT
P62 2016-07-13 16:03:32.771
P59 2016-06-23 14:23:42.461
P07 NaT
P16 2016-06-23 14:02:06.237
P06 2016-07-13 16:03:52.570
P106 2016-07-13 19:56:22.676
When I sort it using DateTime field,
df.sort_values(by='DateTime1',ascending=True)
I do not get the desired result.
Output:
Name DateTime1
P16 2016-06-23 14:02:06.237
P59 2016-06-23 14:23:42.461
P62 2016-07-13 16:03:32.771
P06 2016-07-13 16:03:52.570
P106 2016-07-13 19:56:22.676
P38 NaT
P07 NaT

Try assigning back to df, otherwise use inplace=True, but don't do both. See pandas.DataFrame.sort_values
df = df.sort_values(by='DateTime1', ascending=True)
Otherwise, try pandas.DataFrame.set_index and then pandas.DataFrame.sort_index
df.set_index('DateTime1', drop=True, append=False, inplace=True, verify_integrity=False)
df = df.sort_index()

I try both codes below but is not working
df = df.sort_values(by='DateTime1', ascending=True)
or
df.set_index('DateTime1', drop=True, append=False, inplace=True, verify_integrity=False)
df = df.sort_index()
What I found working is convert the datetime column to an index column and afterward sort by index. So in your case is:
df=df.set_index('DateTime1')
df=df.sort_index(ascending=False)

For others who might find this useful. I got it working by using the inplace argument like so:
df.sort_values(by='DateTime1', ascending=True, inplace=True)

I know this is an old question but OP seems to have wanted to put NaN values at the beginning since the output they posted is already sorted. In that case, there's a parameter in sort_values that controls where to put NaN values: na_position
df = df.sort_values(by='DateTime1', ascending=True, na_position='first')
Output:
Name DateTime1
0 P38 NaT
3 P07 NaT
4 P16 2016-06-23 14:02:06.237
2 P59 2016-06-23 14:23:42.461
1 P62 2016-07-13 16:03:32.771
5 P06 2016-07-13 16:03:52.570
6 P106 2016-07-13 19:56:22.676

Related

time data doesn't match format specified

I am trying to convert the string to the type of 'datetime' in python. My data match the format, but still get the
'ValueError: time data 11 11 doesn't match format specified'
I am not sure where does the "11 11" in the error come from.
My code is
train_df['date_captured1'] = pd.to_datetime(train_df['date_captured'], format="%Y-%m-%d %H:%M:%S")
Head of data is
print (train_df.date_captured.head())
0 2011-05-13 23:43:18
1 2012-03-17 03:48:44
2 2014-05-11 11:56:46
3 2013-10-06 02:00:00
4 2011-07-12 13:11:16
Name: date_captured, dtype: object
I tried the following by just selecting the first string and running the code with same datetime format. They all work without problem.
dt=train_df['date_captured']
dt1=dt[0]
date = datetime.datetime.strptime(dt1, "%Y-%m-%d %H:%M:%S")
print(date)
2011-05-13 23:43:18
and
dt1=pd.to_datetime(dt1, format='%Y-%m-%d %H:%M:%S')
print (dt1)
2011-05-13 23:43:18
But why wen I using the same format in pd.to_datetime to convert all the data in the column, it comes up with the error above?
Thank you.
I solved it.
train_df['date_time'] = pd.to_datetime(train_df['date_captured'], errors='coerce')
print (train_df[train_df.date_time.isnull()])
I found in line 100372, the date_captured value is '11 11'
category_id date_captured ... height date_time
100372 10 11 11 ... 747 NaT
So the code with errors='coerce' will replace the invalid parsing with NaN.
Thank you.

How to extract all rows, for which row a particular criteria is met? Details in description

I am trying to load a set of policy numbers in my Target based on below criteria using Informatica PowerCenter.
I want to select all those rows of policy numbers, for which policy the Rider = 0
This is my source: -
Policy Rider Plan
1234 0 1000
1234 1 1010
1234 2 3000
9090 0 2000
9090 2 2545
4321 3 2000
4321 1 2000
Target should look like this: -
Policy Rider Plan
1234 0 1000
1234 1 1010
1234 2 3000
9090 0 2000
9090 2 2545
The policy number 4321 would not be loaded.
If I use filter as Rider = 0, then I miss out on below rows: -
1234 1 1010
1234 2 3000
9090 0 2000
9090 2 2545
What would be ideal way to load this kind of data using PowerCenter Designer?
Take the same source in one more qualifier in same mapping, use a filter as Rider=0 to get list of unique policy numbers that has Rider=0, then use a joiner with your regular source on policy number. This should work.
Another method, sort your data based on policy and Rider, and use variable ports with condition similar to below.
v_validflag=IIF(v_policy_prev!=policy, IIF(Rider=0, 'valid','invalid'), v_validflag)
v_policy_prev=policy
Then filter valid records.
There are many options. Here are two...
First:
It'll look like:
// AGGREGATOR \\
SOURCE >> SOURCE QUALIFIER >> SORTER << >> JOINER >> TARGET
\\============//
Connect all ports from Source Qualifier (SQ) to SORTER transformation (or sort in SQ itself) and define sorting Key for ‘Policy’ and ‘Rider’. After that split stream into two pipelines:
- Connect ‘Policy’ and ‘Rider’ to FILTER transformation and filter records by ‘Rider’ = 0. - After that link ‘Policy’ (only) to AGGREGATOR and set Group By to ‘YES’ for ‘Policy’. - Add a new port with FIRST or MAX function for ‘Policy’ port. This is to remove duplicate ‘Policy’-es.- Indicate ‘Sorted Input’ in the AGGREGATOR properties.- After that link ‘Policy’ from AGR to JOINER as Master in Port tab.
2.- Second stream, from SORTER, directly link to above JOINER (with aggregated ‘Policy’) as Detail. - Indicate ‘Sorted Input’ in the JOINER properties. - Set Join Type as ‘Normal Join’ and Join Condition as POLICY(master)=POLICY(detail) in JOINER properties.
... Target
Second option:
Just Override SQL in Source Qualifier...
WITH PLC as (
select POLICY
from SRC_TBL
where RIDER=0)
select s.POLICY, s.RIDER, s.PLAN
from PLC p left JOIN SRC_TBL s on s.POLICY = p.POLICY;
may vary depend on your source table constructions...

How can I extract parts of one column and append them to other columns?

I have a large .csv file that I need to extract information from and add this information to another column. My csv looks something like this:
file_name,#,Date,Time,Temp (°C) ,Intensity
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,
I want to create a two new columns that contains the data from the "file_name" column. I want to extract the one to two numbers after the text "trap" and I want to extract the c or the u and create new columns with this data. Data should look like something like this after processing:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
I suspect the way to do this is with awk and a regular expression, but I'm not sure how to implement the regular expression. How can I extract parts of one column and append them to other columns?
Using sed you can do this:
sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
gawk approach:
awk -F, 'NR==1{ print $0,"can_und,trap_no" }
NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file
The output:
file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12
NR==1{ print $0,"can_und,trap_no" } - print the header line
match($1,/^trap([0-9]+)([a-z])/,a) - matches the number following trap word and the next following suffix letter
With use of sed, this will be like:
sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file
Use sed -i ... to replace it in file.
Using python pandas reader because python is awesome for numerical analysis:
First: I had to modify the data header row so that the columns were consistent by appending 3 commas:
file_name,#,Date,Time,Temp (°C) ,Intensity,,,
There is probably a way to tell pandas to ignore the column differences - but I am yet a noob.
Python code to read your data into columns and create 2 new columns named 'cu_int' and 'cu_char' which contain the parsed elements of the filenames:
import pandas
def main():
df = pandas.read_csv("file.csv")
df['cu_int'] = 0 # Add the new columns to the data frame.
df['cu_char'] = ' '
for index, df_row in df.iterrows():
file_name = df['file_name'][index].strip()
trap_string = file_name.split("_")[0] # Get the file_name string prior to the underscore
numeric_offset_beg = len("trap") # Parse the number following the 'trap' string.
numeric_offset_end = len(trap_string) - 1 # Leave off the 'c' or 'u' char.
numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
cu_value = trap_string[len(trap_string) - 1]
df['cu_int'] = int(numeric_value)
df['cu_char'] = cu_value
# The pandas dataframe is ready for number crunching.
# For now just print it out:
print df
if __name__ == "__main__":
main()
The printed output (note there are inconsistencies in the data set posted - see row 1 as an example):
$ python read_csv.py
file_name # Date Time Temp (°C) Intensity Unnamed: 6 Unnamed: 7 Unnamed: 8 cu_int cu_char
0 trap12u_10733862_150809.txt 1 05/28/15 06:00:00.0 20.424 215.3 NaN NaN NaN 12 c
1 trap12u_10733862_150809.txt 2 05/28/15 07:00:00.0 21.091 1.0 130.2 NaN NaN 12 c
2 trap12u_10733862_150809.txt 3 05/28/15 08:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
3 trap11u_10733862_150809.txt 4 05/28/15 09:00:00.0 25.222 3.0 444.5 NaN NaN 12 c
4 trap11u_10733862_150809.txt 5 05/28/15 10:00:00.0 26.195 3.0 100.0 NaN NaN 12 c
5 trap11u_10733862_150809.txt 6 05/28/15 11:00:00.0 25.902 2.0 927.8 NaN NaN 12 c
6 trap11u_10733862_150809.txt 7 05/28/15 12:00:00.0 25.708 2.0 325.0 NaN NaN 12 c
7 trap12c_10733862_150809.txt 8 05/28/15 13:00:00.0 26.292 3.0 100.0 NaN NaN 12 c
8 trap12c_10733862_150809.txt 9 05/28/15 14:00:00.0 26.390 2.0 66.7 NaN NaN 12 c
9 trap12c_10733862_150809.txt 10 05/28/15 15:00:00.0 26.097 1.0 463.9 NaN NaN 12 c

Get all oids by ifindex ID and wildcard

Is there a way to get all OIDs by a given ifIndex ID using a wildcard? Say I have:
IF-MIB::ifIndex.513 = INTEGER: 513
Is there a way using snmpget or snmpbulkwalk to get only:
IF-MIB::ifIndex.513 = INTEGER: 513
IF-MIB::ifDescr.513 = STRING: Gi0/1
IF-MIB::ifType.513 = INTEGER: propVirtual(53)
IF-MIB::ifMtu.513 = INTEGER: 1420
IF-MIB::ifSpeed.513 = Gauge32: 0
The best way I can figure this out at present is to snmpwalk the device and use "| grep 513", which would be highly inefficient the more index id's I need to perform this on.
You can send single SNMP-GET request with multiple variable bindings to get the information you need:
snmpget -c public -v 2c <router_ip_address> ifIndex.513 ifDescr.513 ifType.513 ifMtu.513 ifSpeed.513

Pig Error on SUM function

I have data like -
store trn_date dept_id sale_amt
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
I want to get the sum of sale_amt,for this I'm doing
First I load the data using:
table = LOAD 'table' USING org.apache.hcatalog.pig.HCatLoader();
Then grouping the data on store, tran_date, dept_id
grp_table = GROUP table BY (store, tran_date, dept_id);
Finally trying to get the SUM of sale_amt using
grp_gen = FOREACH grp_table GENERATE
FLATTEN(group) AS (store, tran_date, dept_id),
SUM(table.sale_amt) AS tota_sale_amt;
getting below Error -
================================================================================
Pig Stack Trace
---------------
ERROR 2103: Problem doing work on Longs
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: grouped_all: Local Rearrange[tuple]{tuple}(false) - scope-1317 Operator Key: scope-1317): org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:289)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:263)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:183)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:161)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:51)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1645)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1611)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1462)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:700)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:770)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2103: Problem doing work on Longs
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:84)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:108)
at org.apache.pig.builtin.AlgebraicLongMathBase$Intermediate.exec(AlgebraicLongMathBase.java:102)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:369)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:333)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:281)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Number
at org.apache.pig.builtin.AlgebraicLongMathBase.doTupleWork(AlgebraicLongMathBase.java:77)
================================================================================
As I'm reading table using HCatalog Loader and in hive table data type is string so i have tried with casting as well in the script but still getting the same Error
I don't have HCatalog installed in my system, so tried with simple file, but the below approach and code will work for you.
1.SUM will work only with data types(int, long, float, double, bigdecimal, biginteger or bytearray cast as double). Its look like your sale_amt column is in string, so you need to typecast this column to (long or double) before using SUM function.
2.You should not use store as variable, bcoz it is reserved keyword in Pig, so you have to rename this variable to different name otherwise you will get an error. I renamed this variable as "stores".
Example:
table:
1 2014-12-14 101 10007655
1 2014-12-14 101 10007654
1 2014-12-14 101 10007544
6 2014-12-14 104 100086544
8 2014-12-14 101 1000000
9 2014-12-14 106 1000000
PigScript:
A = LOAD 'table' USING PigStorage() AS (store:chararray,trn_date:chararray,dept_id:chararray,sale_amt:chararray);
B = FOREACH A GENERATE $0 AS stores,trn_date,dept_id,(long)sale_amt; --Renamed the variable store to stores and typecasted the sale_amt to long.
C = GROUP B BY (stores,trn_date,dept_id);
D = FOREACH C GENERATE FLATTEN(group),SUM(B.sale_amt);
DUMP D;
Output:
(1,2014-12-14,101,30022853)
(6,2014-12-14,104,100086544)
(8,2014-12-14,101,1000000)
(9,2014-12-14,106,1000000)

Resources