Delete duplicates based on varchar in another row. Eliminate terminated staff from active roster - drop-duplicates

I want to create a roster that shows only active staff. All staff are listed in a table, active staff with no termination have one row in the table. Terminated staff have two rows of data. Employee status is wages or term. How can I get one row of only active staff? Each staff member has a unique ID.

What programming language are you using?
I'm assuming you're using Python, and using pandas to process data.
Also, whenever possible, sharing a sample data enables other users
to help you more efficiently.
Please, if possible, edit your question adding a little more context to your question, so that others can help you more efficiently!
I've developed the code below, trying to re-create your problem based on what I could infer about the data from your question.
# == Necessary Imports ===============================================
import pandas as pd
# == Generate sample dataframe =======================================
# For our sample, employees 3, 7, and 9 will represent the terminated
# employees.
df = pd.DataFrame(
{
"memberID": [1, 2, 3, 3, 4, 5, 6, 7, 7, 8, 9, 9],
"status": [
"wages",
"wages",
"wages", # <-- Employee 3 terminated
"term", # <-- Employee 3 terminated
"wages",
"wages",
"wages",
"wages", # <-- Employee 7 terminated
"term", # <-- Employee 7 terminated
"wages",
"wages", # <-- Employee 9 terminated
"term", # <-- Employee 9 terminated
],
}
)
# == Solution ==============================================================
df.loc[
df["memberID"].isin(
df["memberID"]
.value_counts()
.to_frame("count")
.loc[lambda xdf: xdf["count"] < 2, :]
.index
),
:,
]
# Returns:
#
# memberID status
# 0 1 wages
# 1 2 wages
# 4 4 wages
# 5 5 wages
# 6 6 wages
# 9 8 wages
Notes
Basically, my approach was to use pandas.Series.value_counts to find the number of times each "memberID" appears on the dataframe.
Then I chained a .loc call to filter for employees with less than 2 appearances on the dataframe and filtered those "memberID" on the original df.
In summary, the solution uses three methods from pandas:
pandas.Series.value_counts: count the number of times each value from a pandas Series gets repeated.
pandas.Series.to_frame: transform a pandas Series to DataFrame (useful for chaining multiple operations together).
pandas.DataFrame.loc: used to filter employees that appeared only once in the data. Then used a second time to filter the original df to only consider those employees that weren't terminated.

Related

How to Combine the Sort, Filter and IFS Formula for Assigning Unique Id

Please help me write the correct formula in Google Sheets to achieve the following goal:
I want to sort data found in specific rows within a column, and then, depending on the order they are sorted in, assign a Unique ID to that row plus the 3 rows that follow. This results in a group of 4 rows that are sorted together. The specific rows to sort are 2, 6, 10, 14, 18, 22 (every 4th row). Formula in question will be in cell B2. Please note, we are NOT sorting by the Unique IDs as they are written. Rather, we are assigning a Unique Id depending on the sorted Data.
I have attempted formulas from these tutorials, with no success?
https://infoinspired.com/google-docs/spreadsheet/query-with-importrange-in-google-sheets/
https://www.spreadsheetclass.com/google-sheets-sort-filter-functions/
https://www.youtube.com/watch?v=Qo9FbK_rnhE
https://www.ablebits.com/office-addins-blog/excel-not-equal-to-greater-than-less-than/comment-page-3/
Overall Structure:
Col A = Sort Key (this probably does not need to be addressed here)
Col B = Unique IDs
Col C = Labels
Col D = Data
Example:
Column B
ID 1
ID 1
ID 1
ID 1
ID 2
ID 2
ID 2
ID 2
ID 3
ID 3
ID 3
ID 3
Column C
(the following letters, such as "1A" refer to labels, not cells. Specifically - "Label 1A" is actually "Time 1" and "Label 2B" is actually "Event 2", but I am trying to reduce these to the minimum viable example requirements):
Label 1A
Label 1B
Label 1C
Label 1D
Label 2A
Label 2B
Label 2C
Label 2D
Label 3A
Label 3B
Label 3C
Label 3D
Column D
5 (numerical value)
string (of text)
string
string
4
string
string
string
1
string
string
string
If properly sorted, this should be the result (NOTE: I've added more specific examples): Sample Sheet
***Another Edit:
Please note - The Unique Identifiers are NOT unsorted. They simply cannot be assigned at all until the data is sorted, so Column B is all formulas ***
based on your insufficient example...
=INDEX("ID "&SORT(ROUNDUP(SEQUENCE(COUNTA(C4:C))/4), 1, ))
update 1:
=ARRAYFORMULA({"ID "&ROUNDUP(SEQUENCE(COUNTA(C4:C))/4),
QUERY(SORT(C4:D, IF(D4:D="",, VLOOKUP(ROW(D4:D),
IF(ISNUMBER(D4:D), {ROW(D4:D), D4:D}), 2, 1)), 1),
"where Col1 is not null", )})
demo sheet
update 2:
={"id"; ArrayFormula(ARRAY_CONSTRAIN(VLOOKUP(ROW(D2:D25),
SORT({SORT(FILTER({ROW(D2:D25), D2:D25}, ISNUMBER(D2:D25)), 2, 1),
"ID "&SEQUENCE(ROWS(FILTER(D2:D25, ISNUMBER(D2:D25))))}), 3, 1),
ROWS(FILTER(D2:D25, ISNUMBER(D2:D25)))*4, 1))}

SUMIF with date range for specific column

I've been trying to find an answer for this, but haven't succeeded - I need to sum a column for a specified date range, as long as my rowname matches the reference sheet's column name.
i.e
Reference_Sheet
Date John Matt
07/01/19 1 2
07/02/19 1 2
07/03/19 2 1
07/04/19 1 1
07/05/19 3 3
07/06/19 1 2
07/07/19 1 1
07/08/19 5 9
07/09/19 9 2
Sheet1
A B
1 07/01
2 07/07
3 Week1
4 John 10
5 Matt 12
Have to work in google sheets, and I tried using SUMPRODUCT which told me I can't multiply texts and I tried SUMIFS which let me know I can't have different array arguments - failed efforts were similar to below,
=SUMIFS('Reference_Sheet'!B2:AO1000,'Reference_Sheet'!A1:AO1,"=A4",'Reference_Sheet'!A2:A1000,">=B1",'Reference_Sheet'!A2:A1000,"<=B2")
=SUMPRODUCT(('Reference_Sheet'!$A$2:$AO$1000)*('Reference_Sheet'!$A$2:$A$1000>=B$1)*('Reference_Sheet'!$A$2:$A$1000<=B$2)*('Reference_Sheet'!$A$1:$AO$1=$A4))
This might work:
=sumifs(indirect("Reference_Sheet!"&address(2,match(A4,Reference_Sheet!A$1:AO$1,0))&":"&address(100,match(A4,Reference_Sheet!A$1:AO$1,0))),Reference_Sheet!A$2:A$100,">="&B$1,Reference_Sheet!A$2:A$100,"<="&B$2)
But you'll need to specify how many rows down you need it to go. In my formula, it looks down till 100 rows.
To change the number of rows, you need to change the number in three places:
&address(100
Reference_Sheet!A$2:A$100," ... in two places
To briefly explain what is going on:
look for the person's name in row 1 using match
Use address and indirect to build the address of cells to add
and then sumIfs() based on dates.
alternative:
=SUMPRODUCT(QUERY(TRANSPOSE(QUERY($A:$D,
"where A >= date '"&TEXT(F$1, "yyyy-mm-dd")&"'
and A <= date '"&TEXT(F$2, "yyyy-mm-dd")&"'", 1)),
"where Col1 = '"&$E4&"'", 0))

Making DAX code more efficient - counting unique Start dates in overlapping date ranges

I have a table of every product purchased by every client over 25 years. The table contains client#, product, start date, and end date.
The products can be owned by the client for any amount of time (1 day to 100 years). While the client owns products with us, the client is active. If a client ends all products they cease to be a client. I want to count new client starts each year. The problem is, some clients end all products then start purchasing products again years later (but clients always retain the same client#) - If the client leaves then rejoins year's later I want to count the client as a new client.
I have created DAX code to do this which works perfectly on a small file, but the code uses up too many resources and so I cannot use it on my data (about 200,000 records). I know my code is HIGHLY INEFFICIENT and could probably be cleaned up...but I am not sure how. Alternately, if I could figure out how to make these columns in PowerQuery, perhaps that would work
Here is how I do it.
1) Add four calculated columns to my table:
VeryFirstStart = Calculate(
Min('Products'[StartDate]),
ALLEXCEPT(Products,Products[ClientNumber]))=Products[StartDate]
this flags records that contain the first ever start date of any client
MaxEndDateofEarlierDates = Calculate(
Max('Products'[EndDate]),
Filter(
Filter(ALLEXCEPT(Products, Products[ClientNumber]), Products[EndDate]),
Products[StartDate] < EARLIER(Products[StartDate])))
This step blows up my PowerBI - this shows the date of any NEW product purchases where the new start date occurs AFTER an ending date
Second+Start = And(
Products[MaxEndDateofEarlierDates]<>BLANK(),
Products[MaxEndDateofEarlierDates]<Products[StartDate])
this flags records where we want to count the new start date as a new client
NewStart = OR(Products[Second+Start],Products[VeryFirstStart])
**this flags ANY new client start date regardless of whether it was the first or a subsequent*
Finally I added this measure:
!MemberNewStarts = CALCULATE(
DISTINCTCOUNT(Products[ClientNumber]),
FILTER(
'Products',
('Products'[StartDate] <= LASTDATE('DIMDate'[Date]) &&
'Products'[StartDate]>= FIRSTDATE('DIMDate'[Date]) &&
Products[NewStart]=TRUE())))
Does anyone have any suggestions about how to achieve this with less resources?
Thanks
Here is some data to try
MemberNumber Product StartDate EndDate Note (not in real data)
1 A 02/02/2003 02/02/2004
1 C 02/02/2009 02/02/2010
2 A 02/02/2001 02/02/2002
2 C 02/02/2001 02/02/2002
2 B 02/02/2005 02/02/2010
3 C 02/02/2002 02/02/2005
3 B 02/02/2002 02/02/2005
3 A 02/02/2003 02/02/2008
4 B 02/02/2002 02/02/2003
4 C 02/02/2003 02/02/2006
5 B 02/02/2003 02/02/2007
5 C 02/02/2005 02/02/2010
5 A 02/02/2005 02/02/2007
6 A 02/02/2001 02/02/2006
6 C 02/02/2003 02/02/2007
7 B 02/02/2001 02/02/2004
7 A 02/02/2001 02/02/2005
7 C 02/02/2005 02/02/2006
8 B 02/02/2002 02/02/2006
8 A 02/02/2004 02/02/2009
note member 1 starts as a new client in 2009 since all previous products ended in 2004 and member 2 starts as a new client in 2005 since all previous products ended in 2002
The desired outcome is:
Start Year 2001 2002 2003 2004 2005 2006 2007 2008
New Clients 3 3 2 0 1 0 0 0
Here's one way of trying to solve it. Let me know if this is any more efficient than yours:
1st New Column:
PreviousHighestFinish:=
Calculate(
Max(Products[EndDate]),
ALLEXCEPT(Products,Products[ClientNumber]),
Products[StartDate] < Earlier(Products[StartDate]
)
This will give you the latest end date where the Client Number matches and the start date is before the current start date. If there is no earlier start date, it returns a blank.
2nd New Column:
NewClientProduct:=
if(Products[StartDate]>=Products[PreviousHighestFinish],1,0)
This will give you a 1 for every row where the client has either not been seen before (and the previous column showed blank) or the client has ben seen before, but has no current products.
The problem with this measure is that if you have a client starting more than one product on the same date, they will show as multiple new clients.
The fix for this is to count up the instances of each client-date combination
3rd New Column:
ClientDateCount:=
CALCULATE(
COUNTROWS(Products),
ALLEXCEPT(Products,Products[ClientNumber],Products[StartDate])
)
This essentially gives the number of times that the client on this row in the table has started a product on this date.
Now divide the 2nd new column by this one
4th New Column:
NewClients:=
DIVIDE(Products[NewClientProduct],Products[ClientDateCount])
And voila:

Sorting a pandas DataFrame by the order of a list

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Qlikview, how to create an expression which does not change with listbox selection?

Example, I have table below..
Week, Quantity
1, 10
1, 15
1, 10
2, 20
2, 30
3, 10
3, 50
I also have a list box for 'Week' which is current selected on week 2.
Now, I want to create text object which shows the value of sum of quantity of week 1 (ie. 35), which will always show that result even when the list box is selected on week 2. How can I achieve this?
Currently I managed to do an expression which sums week 1 but as soon as I select week 2 it shows 0 ....
Enter the following to your textfield:
= 'Sum week 1 : ' & sum({$<Week={'1'}>}Quantity)
Use the '&' to concat values. And use the set analysis (Page 799 of the QlikView Reference Manual) to select the reqired values.
sum({$<Week={'1'}>}Quantity)
Read this like: Sum the values of 'Quantity' Where 'Week' is 1.
Replace the '$' with '1' and the expression will ignore current selection
e.g.
sum({1<Week={'1'}>}Quantity)
do this:
sum({<Week={'1'}>}Quantity)
That is basically telling Qlikview regardless of what is selected in Qlikview, that expression will always calculate as if Week 1 is selected.

Resources