Pandas performance issue of dataframe column "rename" and "drop"

Pandas performance issue of dataframe column "rename" and "drop" - performance

Below is the line_profiler record of a function :
Wrote profile results to FM_CORE.py.lprof
Timer unit: 2.79365e-07 s
File: F:\FM_CORE.py
Function: _rpt_join at line 1068
Total time: 1.87766 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1068 #profile
1069 def _rpt_join(dfa, dfb, join_type='inner'):
1070 ''' join two dataframe together by ('STK_ID','RPT_Date') multilevel index.
1071 'join_type' can be 'inner' or 'outer'
1072 '''
1073
1074 2 56 28.0 0.0 try: # ('STK_ID','RPT_Date') are normal column
1075 2 2936668 1468334.0 43.7 rst = pd.merge(dfa, dfb, how=join_type, on=['STK_ID','RPT_Date'], left_index=True, right_index=True)
1076 except: # ('STK_ID','RPT_Date') are index
1077 rst = pd.merge(dfa, dfb, how=join_type, left_index=True, right_index=True)
1078
1079
1080 2 81 40.5 0.0 try: # handle 'STK_Name
1081 2 426472 213236.0 6.3 name_combine = pd.concat([dfa.STK_Name, dfb.STK_Name])
1082
1083
1084 2 900584 450292.0 13.4 nameseries = name_combine[-Series(name_combine.index.values, name_combine.index).duplicated()]
1085
1086 2 1138140 569070.0 16.9 rst.STK_Name_x = nameseries
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
1089 except:
1090 pass
1091
1092 2 94 47.0 0.0 return rst
What surprise me is these two lines:
1087 2 596768 298384.0 8.9 rst = rst.rename(columns={'STK_Name_x': 'STK_Name'})
1088 2 722293 361146.5 10.7 rst = rst.drop(['STK_Name_y'], axis=1)
Why a simple dataframe column "rename" and "drop" operation costs that much percentage of time (8.9% + 10.7%)? Anyway, the "merge" operation only costs 43.7% , and "rename"/"drop" looks not like a calculation-intensive operation. How to improve it ?

Related

Count number of connections in each network using DAX

I have a dataset like this in Power BI with connections between "Participant ID" Column and "Knows Participant":
Participant ID
Knows Participant
111
353
111
777
111
112
111
249
112
143
112
144
113
111
113
244
114
NaN
115
113
...
...
777
111
777
398
777
114
778
NaN
779
112
3499
NaN
I've build Network chart. However, there are a lot of 1-1 connections that are not very useful for visualization, so I want to exclude them (see image):
Is it possible to count a number of connections in each network using DAX and then use this value to filter out all nodes with only 1 connection (red circled)? Or maybe filter out 1 connection nodes using another approach?
I've tried to make a calculated column using DAX:
Connection Column = COUNTROWS(
FILTER(Table,
EARLIER(Table[Knows Participant])=Table[Knows Participant])
)
However, it only shows duplicate values in "Knows Participant" Column, but not number of connections in each network.
Example of desired output:
Participant ID
Knows Participant
Number of Connections in the Network
111
353
4
353
444
4
444
551
4
551
987
4
112
143
1
220
190
1
333
337
2
337
410
2
765
0

You need the PATH functions as you're essentially trying to flatten a hierarchy and then exclude certain parts of it. The following help page gives a good rundown of the approach to take.
https://learn.microsoft.com/en-us/dax/understanding-functions-for-parent-child-hierarchies-in-dax

You can add a column to the table with a measure like this:
VAR pIdLinksCount = CALCULATE(COUNTROWS(tbl), ALL('tbl'[Knows Participant]))
VAR neighbourLinksCount =
IF(
pIdLinksCount=1
, -- if pIdLinksCount=1 then count neighbour links
VAR neighbourId =
CALCULATETABLE(
Values('tbl'[Knows Participant])
)
RETURN
CALCULATE(
COUNTROWS(tbl)
,ALL() -- removes all filters from data model
,'tbl'[Participant ID] = neighbourId -- applies filter to [Participant ID] column
--,'tbl'[Participant ID] IN neighbourId -- alternatively try this. I believe it is not necessary
)
,2 -- returns 2 if pIdLinksCount>1.
-- The "value = 2" will return "result > 3 = TRUE()"
)
VAR result = pIdLinksCount + neighbourLinksCount
RETURN
IF(
result>2
,1
,0
)
The idea is to check a neighbor too - if it has more then 1 link

removing bad data from a data file using pig

I have a data file like this
1943 49 1
1975 91 L
1903 56 3
1909 52 3
1953 96 3
1912 82
1976 66 3
1913 35
1990 45 1
1927 92 A
1912 2
1924 22
1971 2
1959 94 E
now using pig script I want to remove the bad data like removing those rows which have characters and empty fields
I tried this way
records = load '/user/a106524609/test.txt' using PigStorage(' ') as
(year:chararray, temperature:int, quality:int);
rec1 = filter records by temperature != 'null' and (quality != 'null ')

Load it as lines
A = load 'data.txt' using PigStorage('\n') as (line:chararray);
Split on all whitespaces
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '\\s+')) as (year:int,temp:int,quality:chararray);
Filter by valid strings
C = FILTER B BY quality IN ('0','1','2','3','4','5','6','7','8','9');
(Optionally) Cast to an int
D = FOREACH C GENERATE year,temp,(int)quality;
In Spark, I would start with a regex match of the expected format.
val cleanRows = sc.textFile("data.txt")
.filter(line => line.matches("(?:\\d+\\s+){2}\\d+"))

Microsoft VBScript runtime error '800a0005' Invalid procedure call or argument: 'mid'

I am facing an issue in my application throwing an error
Microsoft VBScript runtime error '800a0005'
Invalid procedure call or argument: 'mid'
strLine = Trim(Mid(strLine, 1, InStr(UCase(strLine), "SINGLE POINT DATA") - 1))
Main function is below
strLine = tstextstreamRead.ReadAll
If cint(intTestId) = 23 Then
strLine2 = Trim(Mid(strLine, InStr(UCase(strLine), "SINGLE POINT DATA") + 1, Len(strLine)))
strLine2 = Trim(Mid(strLine2, InStr(strLine2, "Model Cross-Arrhenius"), Len(strLine2)))
End If
response.write strLine
strLine = Trim(Mid(strLine, 1, InStr(UCase(strLine), "SINGLE POINT DATA") - 1))
When i verified through response.write strline
below i got the data
Sample ID Region :PACIFIC
Request No :tesy
Sample No :12
Company :213
Family :213
Grade :213
Standard :ZOLLER METHOD
Color No :UNKNOWN
Lot No :UNKNOWN
Remark :213
Date :7/22/2016
Generic :123
Operator :213
Test Lab :WTC
Test Name :Pressure-Volume-Temperature (PVT)
Test Method :ZOLLER METHOD
Dataset 0
POINTS 26 30.2 1109.998 40.6 1126.961 50.8 1124.916 61.1 1121.716 71.3 1117.909 81.5 1113.674 91.4 1108.76 101.4 1103.019 111.8 1095.211 121.8 1087.404 131.8 1079.681 142.2 1072.674 152 1065.485 162.3 1057.771 172.1 1050.805 182.3 1043.491 193.4 1036.327 203 1029.239 214.1 1022.532 223.6 1015.535 234.6 1008.179 245.1 1000.297 255 993.677 265.6 986.178 275.4 979.918 285.8 973.305
Dataset 50
POINTS 26 30.2 1141.188 40.6 1145.201 50.8 1142.895 61.1 1140.183 71.3 1137.529 81.5 1134.817 91.4 1132.018 101.4 1129.33 111.8 1125.39 121.8 1120.212 131.8 1113.709 142.2 1107.681 152 1101.57 162.3 1095.298 172.1 1089.361 182.3 1083.419 193.4 1077.757 203 1072.34 214.1 1066.657 223.6 1061.414 234.6 1055.779 245.1 1049.333 255 1043.651 265.6 1038.333 275.4 1033.283 285.8 1028.57
Dataset 100
POINTS 26 30.2 1160.699 40.6 1159.695 50.8 1157.533 61.1 1155.299 71.3 1153.075 81.5 1151.028 91.4 1149.093 101.4 1147.7 111.8 1145.902 121.8 1142.99 131.8 1138.733 142.2 1133.957 152 1128.643 162.3 1123.009 172.1 1117.834 182.3 1112.526 193.4 1107.26 203 1102.65 214.1 1097.619 223.6 1093.004 234.6 1086.949 245.1 1082.376 255 1077.455 265.6 1072.812 275.4 1068.752 285.8 1064.607
Dataset 150
POINTS 26 30.2 1174.42 40.6 1172.732 50.8 1170.673 61.1 1168.593 71.3 1166.813 81.5 1165.074 91.4 1163.543 101.4 1162.673 111.8 1161.681 121.8 1160.11 131.8 1157.741 142.2 1154.905 152 1150.866 162.3 1146.22 172.1 1141.382 182.3 1136.381 193.4 1131.774 203 1127.484 214.1 1123.175 223.6 1118.773 234.6 1113.149 245.1 1108.975 255 1104.514 265.6 1100.457 275.4 1096.706 285.8 1093.423
Dataset 200
POINTS 26 30.2 1185.837 40.6 1183.95 50.8 1182.17 61.1 1180.377 71.3 1178.593 81.5 1177.103 91.4 1175.963 101.4 1175.373 111.8 1174.678 121.8 1173.552 131.8 1172.182 142.2 1170.538 152 1168.029 162.3 1164.802 172.1 1160.855 182.3 1156.361 193.4 1152.075 203 1147.987 214.1 1143.934 223.6 1139.92 234.6 1134.884 245.1 1130.755 255 1126.75 265.6 1123.348 275.4 1119.874 285.8 1116.635
Kindly let me know what is the exact problem.

Your data does not contain "SINGLE POINT DATA", so InStr() returns 0 (not found) and your Mid() call boils down to
>> s = Mid("s", 1, -1)
>>
Error Number: 5
Error Description: Invalid procedure call or argument
>>
You should check the Instr() result:
nPos = InStr(UCase(strLine), "SINGLE POINT DATA")
If 0 = nPos Then
WScript.Echo "Bingo"
Else
strLine = Trim(Mid(strLine, 1, nPos - 1))
End If

How can I convert the qblast XML output into the NCBI BLAST -outfmt 17?

I started my project with the NCBI standalone BLAST and used the -outfmt 17 option. For my purpose that formatting is extremely helpful. However, I had to change to Biopython and I'm now using qblast to align my sequences to the NCBI NT database. Can I save/convert the qblast XML in a format which is comparable to the NCBI BLAST standalone -outfmt 17 format?
Thank you very much for your help!
Cheers,
Philipp

I'm going to assume you meant -outfmt 7 and you need an output with columns.
from Bio.Blast import NCBIWWW, NCBIXML
# This is the BLASTN query which returns an XML handler in a StringIO
r = NCBIWWW.qblast(
"blastn",
"nr",
"ACGGGGTCTCGAAAAAAGGAGAATGGGATGAGAAGGATATATGGGTAGTGTCATTTTTTAACTTGCAGAT" +
"TTCATCCTAGTCTTCCAGTTATCGTTTCCTAGCACTCCATGTTCCCAAGATAGTGTCACCACCCCAAGGA" +
"CTCTCTCTCATTTTCTTTGCCTGGGCCCTCTTTCTACTGAGGAGTCGTGGCCTTCCATCAGTAGAAGCCG",
expect=1E-5)
# Now we read that XML extracting the info
for record in NCBIXML.parse(r):
for alignment in record.alignments:
for hsp in alignment.hsps:
cols = "{}\t" * 10
print(cols.format(hsp.positives / hsp.align_length,
hsp.align_length,
hsp.align_length - hsp.positives,
hsp.gaps,
hsp.query_start,
hsp.query_end,
hsp.sbjct_start,
hsp.sbjct_end,
hsp.expect,
hsp.score))
Outputs something like:
1 210 0 0 1 210 89250 89459 8.73028e-102 420.0
0 206 19 2 5 210 46259 46462 5.16461e-73 314.0
1 210 0 0 1 210 68822 69031 8.73028e-102 420.0
0 206 19 2 5 210 25825 26028 5.16461e-73 314.0
1 210 0 0 1 210 65887 66096 8.73028e-102 420.0
...

Pig Join is returning no results

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pandas performance issue of dataframe column "rename" and "drop" - performance

Related

Count number of connections in each network using DAX

removing bad data from a data file using pig

Microsoft VBScript runtime error '800a0005' Invalid procedure call or argument: 'mid'

How can I convert the qblast XML output into the NCBI BLAST -outfmt 17?

Pig Join is returning no results

Categories

Resources