How to identify customers who didn't make/used incoming call, outgoing call, and internet during the churn phase? - sklearn-pandas

I'm trying to solve a problem where data sets are below:
Cust_Id period Total_Incoming_Call Total_outgoing_call Net_uses
123 09/01/2018 0 0 2
234 09/02/2018 0 0 0
345 09/03/2018 1 40 1
abc1 09/04/2018 0 0 0
I'd like to get the output in below:
Cust_Id Period Total_Incoming_call Total_outgoing_call Net_uses
234 09/02/2018 0 0 0
abc1 09/04/2018 0 0 0
I know how to extract one column from pandas data frame but not sure how to extract multiple columns so I can tagged them as churn customers.
cust = pd.csv(....../.csv)
cust = cust[cust.net_uses == 0]
cust = cust[cust.Total_incoming_call ==0]
Should I used below or we have better method to do?
cust = cust[(cust.total_incoming_call==0)&(cust.net_uses ==0)]

cust = cust[(cust.total_incoming_call == 0) & (cust.net_uses == 0)] works just fine.
You can also use .loc for the same purpose:
cust = cust.loc[(cust.total_incoming_call == 0) & (cust.net_uses == 0), :]
In case you just want to replace values for which the condition is False:
cust = cust.where((cust.total_incoming_call == 0) & (cust.net_uses == 0))

Related

How to use multiple condition to generate a new variable in Stata?

I want to generate 3 NEW variables using these variables in my data set:
Ucod
19 variables in series by this name: Record_2, Record_3......Record_20
Both of them have values in alphanumerical format in it, basically ICD codes i.e, I150
I want to generate 3 new variables satisfying each of three new condition:
People dying primarily of COVID (Var1=1 if Ucod= U07.1)
People dying of a non-COVID condition WITH covid (Var2=1 IF Ucod != U07.1 & Record_2/20= U07.1)
People dying of a non-COVID condition WITHOUT covid (Var3=1 if Ucod != U07.1 & Record_2/20 != U07.1)
Can anyone suggest a code which can help me to generate these 3 variables using these 3 condition.
This may help. Note how I needed to define a toy dataset to give flavour to the problem.
* Example generated by -dataex-.
clear
input str5(Ucod Record_2) str4(Record_3 Record_4)
"U07.1" "U000" "U111" "U222"
"U999" "U07.1" "U444" "U333"
"U888" "U777" "U666" "U555"
end
gen wanted1 = Ucod == "U07.1"
gen count = 0
quietly foreach v of var Record_* {
replace count = count + (`v' == "U07.1")
}
gen wanted2 = Ucod != "U07.1" & count > 0
gen wanted3 = Ucod != "U07.1" & count == 0
list
+------------------------------------------------------------------------------+
| Ucod Record_2 Record_3 Record_4 wanted1 count wanted2 wanted3 |
|------------------------------------------------------------------------------|
1. | U07.1 U000 U111 U222 1 0 0 0 |
2. | U999 U07.1 U444 U333 0 1 1 0 |
3. | U888 U777 U666 U555 0 0 0 1 |
+------------------------------------------------------------------------------+

applyInPandas() aggregation runs slowly on big delta table

I'm trying to create a gold table notebook in Databricks, however it would take 9 days to fully reprocess the historical data (43GB, 35k parquet files). I tried scaling up the cluster but it doesn't go above 5000 records/second. The bottleneck seems to be the applyInPandas() function. I'm wondering if I could replace pandas with anything else to make the gold notebook execute faster.
Silver table has 60 columns (read_id, reader_id, tracker_timestamp, event_type, ebook_id, page_id, agent_ip, agent_device_type, ...). Each row of data represents read event of an ebook. E.g 'page turn', 'click on image', 'click on link',... All of the events that have occurred in the single session have the same read.id. In the gold table I'm trying to group those events in sessions and calculate the number of times each event has occurred in the single session. So instead of 100+ rows of data for a read session in silver table I would end up just with a single aggregated row in gold table.
Input is the silver delta table:
import pyspark.sql.functions as F
import pyspark.sql.types as T
import pandas as pd
from pyspark.sql.functions import pandas_udf
input = (spark
.readStream
.format("delta")
.option("withEventTimeOrder", "true")
.option("maxFilesPerTrigger", 100)
.load(f"path_to_silver_bucket")
)
I use withWatermark and session_window functions to ensure I end up grouping all of the events from the single read session. (read session automatically ends 30 minutes after the last reader activity)
group = input.withWatermark("tracker_timestamp", "10 minutes").groupBy("read_id", F.session_window(input.tracker_timestamp, "30 minutes"))
In the next step I use the applyInPandas function like so:
sessions = group.applyInPandas(processing_function, schema=processing_function_output_schema)
Definition of the processing_function used in applyInPandas:
def processing_function(df):
surf_time_ms = df.query('event_type == "surf"')['duration'].sum()
immerse_time_ms = df.query('event_type == "immersion"')['duration'].sum()
min_timestamp = df['tracker_timestamp'].min()
max_timestamp = df['tracker_timestamp'].max()
shares = len(df.query('event_type == "share"'))
leads = len(df.query('event_type == "lead_store"'))
is_read = len(df.query('event_type == "surf"')) > 0
distinct_pages = df['page_id'].nunique()
data = {
"read_id": df['read_id'].values[0],
"surf_time_ms": surf_time_ms,
"immerse_time_ms": immerse_time_ms,
"min_timestamp": min_timestamp,
"max_timestamp": max_timestamp,
"shares": shares,
"leads": leads,
"is_read": is_read,
"number_of_events": len(df),
"distinct_pages": distinct_pages
}
for field in not_calculated_string_fields:
data[field] = df[field].values[0]
new_df = pd.DataFrame(data=data, index=['read_id'])
for x in all_events:
new_df[f"count_{x}"] = df.query(f"type == '{x}'").count()
for x in duration_events:
duration = df.query(f"event_type == '{x}'")['duration']
duration_sum = duration.sum()
new_df[f"duration_{x}_ms"] = duration_sum
if duration_sum > 0:
new_df[f"mean_duration_{x}_ms"] = duration.mean()
else:
new_df[f"mean_duration_{x}_ms"] = 0
return new_df
And finally, I'm writing the calculated row to the gold table like so:
for_partitioning = (sessions
.withColumn("tenant", F.col("story_tenant"))
.withColumn("year", F.year(F.col("min_timestamp")))
.withColumn("month", F.month(F.col("min_timestamp"))))
checkpoint_path = "checkpoint-path"
gold_path = f"gold-bucket"
(for_partitioning
.writeStream
.format('delta')
.partitionBy('year', 'month', 'tenant')
.option("mergeSchema", "true")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.start(gold_path))
Can anybody think of a more efficient way to do a UDF in PySpark than applyInPandas for the above example? I simply cannot afford to wait 9 days to reprocess 43GB of data...
I've tried playing around with different input and output options (e.g. .option("maxFilesPerTrigger", 100)) but the real problem seems to be applyInPandas.
You could rewrite your processing_function into native Spark if you really wanted.
"read_id": df['read_id'].values[0]
F.first('read_id').alias('read_id')
"surf_time_ms": df.query('event_type == "surf"')['duration'].sum()
F.sum(F.when(F.col('event_type') == 'surf', F.col('duration'))).alias('surf_time_ms')
"immerse_time_ms": df.query('event_type == "immersion"')['duration'].sum()
F.sum(F.when(F.col('event_type') == 'immersion', F.col('duration'))).alias('immerse_time_ms')
"min_timestamp": df['tracker_timestamp'].min()
F.min('tracker_timestamp').alias('min_timestamp')
"max_timestamp": df['tracker_timestamp'].max()
F.max('tracker_timestamp').alias('max_timestamp')
"shares": len(df.query('event_type == "share"'))
F.count(F.when(F.col('event_type') == 'share', F.lit(1))).alias('shares')
"leads": len(df.query('event_type == "lead_store"'))
F.count(F.when(F.col('event_type') == 'lead_store', F.lit(1))).alias('leads')
"is_read": len(df.query('event_type == "surf"')) > 0
(F.count(F.when(F.col('event_type') == 'surf', F.lit(1))) > 0).alias('is_read')
"number_of_events": len(df)
F.count(F.lit(1)).alias('number_of_events')
"distinct_pages": df['page_id'].nunique()
F.countDistinct('page_id').alias('distinct_pages')
for field in not_calculated_string_fields:
data[field] = df[field].values[0]
*[F.first(field).alias(field) for field in not_calculated_string_fields]
for x in all_events:
new_df[f"count_{x}"] = df.query(f"type == '{x}'").count()
The above can probably be skipped? As far as my tests go, new columns get NaN values, because .count() returns a Series object instead of one simple value.
for x in duration_events:
duration = df.query(f"event_type == '{x}'")['duration']
duration_sum = duration.sum()
new_df[f"duration_{x}_ms"] = duration_sum
if duration_sum > 0:
new_df[f"mean_duration_{x}_ms"] = duration.mean()
else:
new_df[f"mean_duration_{x}_ms"] = 0
*[F.sum(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"duration_{x}_ms") for x in duration_events]
*[F.mean(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"mean_duration_{x}_ms") for x in duration_events]
So, instead of
def processing_function(df):
...
...
sessions = group.applyInPandas(processing_function, schema=processing_function_output_schema)
you could use efficient native Spark:
sessions = group.agg(
F.first('read_id').alias('read_id'),
F.sum(F.when(F.col('event_type') == 'surf', F.col('duration'))).alias('surf_time_ms'),
F.sum(F.when(F.col('event_type') == 'immersion', F.col('duration'))).alias('immerse_time_ms'),
F.min('tracker_timestamp').alias('min_timestamp'),
F.max('tracker_timestamp').alias('max_timestamp'),
F.count(F.when(F.col('event_type') == 'share', F.lit(1))).alias('shares'),
F.count(F.when(F.col('event_type') == 'lead_store', F.lit(1))).alias('leads'),
(F.count(F.when(F.col('event_type') == 'surf', F.lit(1))) > 0).alias('is_read'),
F.count(F.lit(1)).alias('number_of_events'),
F.countDistinct('page_id').alias('distinct_pages'),
*[F.first(field).alias(field) for field in not_calculated_string_fields],
# skipped count_{x}
*[F.sum(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"duration_{x}_ms") for x in duration_events],
*[F.mean(F.when(F.col('event_type') == x, F.col('duration'))).alias(f"mean_duration_{x}_ms") for x in duration_events],
)

Auto number an attribute based on multiple attributes

I have a transaction like this
And I have a web panel using Work With Plus to insert data into the transaction
I want to auto number the attribute TmpltId based on the SalOutCd7Plc and BseCd like this:
Example:
SalOutCd7Plc = 1 and BseCd = 1 -> TmpltId = 1 then continue if SalOutCd7Plc = 1 and BseCd = 1 -> TmpltId = 2
But if SalOutCd7Plc = 1 and BseCd = 2 -> TmpltId = 1 and continue
If SalOutCd7Plc = 2 and BseCd = 1 -> TmpltId = 1 and continue
Something like that. How can I achieve this. Thank you
To autonumber the attribute TmpltId you may create a procedure with the following:
Rules:
parm(in:&SENSY0470M_SalOutCd7Plc,in:&SENSY0470M_BseCd,out:&SENSY0470M_TmpltId);
Source:
For each SENSY0470M order SENSY0470M_SalOutCd7Plc SENSY0470M_BseCd (SENSY0470M_TmpltId)
where SENSY0470M_SalOutCd7Plc = &SENSY0470M_SalOutCd7Plc
where SENSY0470M_BseCd = &SENSY0470M_BseCd
&SENSY0470M_TmpltId = SENSY0470M_TmpltId + 1
exit
when none
&SENSY0470M_TmpltId = 1
EndFor
Then, in your web panel before inserting you can call the procedure to get the new SENSY0470M_TmpltId
&NEW_SENSY0470M_TmpltId = Procedure.Udp(&SENSY0470M_SalOutCd7Plc, &SENSY0470M_BseCd)

Count the 1 and 0 by group in Pig

How to count how many 1 and 0 here for each type of events? I'm doing all this in pig and there's only 1 and 0 in the second field.
The data looks like this:
(pageLoad,1)
(pageLoad,0)
(pageLoad,1)
(appLaunch,1)
(appLaunch,0)
(otherEvent,1)
(otherEvent,0)
(event,1)
(event,1)
(event,0)
(somethingelse,0)
The output will be something like this
pageLoad 1:234 0:2359
appLaunch 1:54 0:111
event 1:345 0:0
or
type 1 0
pageLoad 21 345
appLaunch 0 123
event 234 12
Thanks everyone.
Input :
pageLoad,1
pageLoad,0
pageLoad,1
appLaunch,1
appLaunch,0
otherEvent,1
otherEvent,0
event,1
event,1
event,0
somethingelse,0
Pig Script :
A = LOAD 'input.csv' USING PigStorage(',') AS (event_type:chararray,status:int);
B = GROUP A BY event_type;
req = FOREACH B {
event_type_1 = FILTER A BY status==1;
event_type_0 = FILTER A BY status==0;
GENERATE group AS event_type, COUNT(event_type_1) AS event_type_1_count, COUNT(event_type_0) AS event_type_0_count;
};
DUMP req;
Output :
(event,2,1)
(pageLoad,2,1)
(appLaunch,1,1)
(otherEvent,1,1)
(somethingelse,0,1)

Better way to map multiple values (inject)

Please suggest a better (Ruby-ish) way to summarise mutiple entries in one go given the code below:
def summary_totals
pay_paid, pay_unpaid = 0, 0
rec_paid, rec_unpaid = 0, 0
net_paid, net_unpaid = 0, 0
summary_entries.each do |proj_summary|
pay_paid += proj_summary.payable.paid || 0
pay_unpaid += proj_summary.payable.unpaid || 0
rec_paid += proj_summary.receivable.paid || 0
rec_unpaid += proj_summary.receivable.unpaid || 0
net_paid += proj_summary.net.paid || 0
net_unpaid += proj_summary.net.unpaid || 0
end
pay = PaidUnpaidEntry.new(pay_paid, pay_unpaid)
rec = PaidUnpaidEntry.new(rec_paid, rec_unpaid)
net = PaidUnpaidEntry.new(net_paid, net_unpaid)
ProjectPaymentsSummary.new(pay, rec, net)
end
Update: All you need to do is to rewrite the each loop (which sums up 6 variables) in a better Ruby style.
"Better" could be subjective, but I guess you want to use inject to do the summing. The symbol argument to inject can be used to make it nice and concise. If you pass the result directly to your constructors, there's no need for the local variables, eg:
pay = PaidUnpaidEntry.new(
summary_entries.map { |e| e.payable.paid }.inject(:+),
summary_entries.map { |e| e.payable.unpaid }.inject(:+)
)
# etc

Resources