I have seen many different questions that are similar but nothing that I can find that will work.
I am trying to calculate a "running" total for the amount of support tickets that I had on any given day prior to today. I have a current (today) total queue size, and know for each day whether I added to or removed from that queue.
For example:
Date
Created < Known
Completed < Known
Growth < Known
Total Size < Unknown
10-Jan
100
09-Jan
79
77
+2
102
08-Jan
97
92
+5
107
07-Jan
64
67
-3
104
06-Jan
70
66
-4
100
05-Jan
78
80
+2
102
04-Jan
90
82
-8
94
03-Jan
74
68
+6
100
02-Jan
83
87
-4
106
01-Jan
80
70
+10
116
10-Jan is the only known Total value. The remainder total values are being calculated.
In Excel, this would be a simple formula D3 = D2 + C3.
(Calculated column on 'Table' table)
RecursionWithoutIFAndNoFilter_AlsoThisIsWhatIcouldUnderstandFromYourPost_Sorry =
--RunningGrowth
VAR CurrentDate = 'Table'[Date]
VAR RunningGrowth = CALCULATE(SUM('Table'[Growth < Known]), REMOVEFILTERS('Table'), 'Table'[Date]>=CurrentDate)
--MAXDateInTable (I suppose this means TODAY)
--A change in level (because of SELECTEDVALUE) would mean there are more than one row with 01/10
VAR MaxDate = CALCULATE(MAX('Table'[Date]),REMOVEFILTERS('Table'))
VAR TotalSizeInMaxDate = CALCULATE(SELECTEDVALUE('Table'[Total Size < Unknown]),REMOVEFILTERS('Table'),'Table'[Date] = MaxDate)
--Result
VAR Result = TotalSizeInMaxDate + RunningGrowth
RETURN Result
I am unable to get this DAX to work. This is the correct formula, however, I am missing a key step or something cause the DAX doesn't produce the result. Where am I going wrong in my code?
Measure =
var Q3 = PERCENTILEX.INC(ALL(Analytics[Users]),Analytics[Users],.75)
var Q1 = PERCENTILEX.INC(ALL(Analytics[Users]),Analytics[Users],.25)
var iqr = Q3-Q1
var lower = Q1-iqr*1.5
var upper = Q3+iqr*1.5
return
SWITCH(TRUE(),
SELECTEDVALUE(Analytics[Users])>=upper,"high",
SELECTEDVALUE(Analytics[Users])<=lower ,"low",
BLANK()
Here is some sample data:
Day Index Users
1/1/2020 335
1/2/2020 1131
1/3/2020 1094
1/4/2020 393
1/5/2020 22
1/6/2020 1380
1/7/2020 1607
1/8/2020 1578
1/9/2020 1640
1/10/2020 1368
1/11/2020 477
1/12/2020 634
1/13/2020 1812
1/14/2020 1840
1/15/2020 1802
1/16/2020 1708
1/17/2020 1386
1/18/2020 420
1/19/2020 544
1/20/2020 1527
1/21/2020 1799
1/22/2020 1938
1/23/2020 3000
1/24/2020 1570
1/25/2020 546
1/26/2020 660
I'm spinning up on high level language for mixed integer linear programs (MILPs). The language is A Modeling Language for A Mathematical Programming Language (AMPL).
Chapter 4, page 65, Figure 4-7 shows the following syntax:
set PROD := bands coils plate ;
However, Chapter 5, page 74, shows the following syntax:
set PROD = {"bands", "coils", "plate"};
Can anyone please explain this difference in syntax?
I put the latter into a *.dat file, and AMPL complains expected ; ( : or symbol where the { is. Wondering if it is just a mistake in the manual.
Thanks.
The syntax in Chapter 4 --
set PROD := bands coils plate;
-- is used in data files, while the syntax in Chapter 5 --
set PROD = {"bands", "coils", "plate"};
-- is used in model files. It's a little weird (IMO) that the syntax for sets is different in model and data files, but it is. For another example of this difference, see this question and answer.
Complete working example code modified from AMPL manual
Added by the original poster of the question.
dietu.mod:
# dietu.mod
#----------
# set MINREQ; # nutrients with minimum requirements
# set MAXREQ; # nutrients with maximum requirements
set MINREQ = {"A", "B1", "B2", "C", "CAL"};
set MAXREQ = {"A", "NA", "CAL"};
set NUTR = MINREQ union MAXREQ; # nutrients
set FOOD; # foods
param cost {FOOD} > 0;
param f_min {FOOD} >= 0;
param f_max {j in FOOD} >= f_min[j];
param n_min {MINREQ} >= 0;
param n_max {MAXREQ} >= 0;
param amt {NUTR,FOOD} >= 0;
var Buy {j in FOOD} >= f_min[j], <= f_max[j];
minimize Total_Cost: sum {j in FOOD} cost[j] * Buy[j];
subject to Diet_Min {i in MINREQ}:
sum {j in FOOD} amt[i,j] * Buy[j] >= n_min[i];
subject to Diet_Max {i in MAXREQ}:
sum {j in FOOD} amt[i,j] * Buy[j] <= n_max[i];
The explicit definitions of setes MINREQ and MAXREQ and their members is taken from the *.dat file below (where their definitions have been commented out). Matlab users, observe above & beware that you need commas between members in a set.
dietu.dat:
# dietu.dat
#----------
data;
# set MINREQ := A B1 B2 C CAL ;
# set MAXREQ := A NA CAL ;
set FOOD := BEEF CHK FISH HAM MCH MTL SPG TUR ;
param: cost f_min f_max :=
BEEF 3.19 2 10
CHK 2.59 2 10
FISH 2.29 2 10
HAM 2.89 2 10
MCH 1.89 2 10
MTL 1.99 2 10
SPG 1.99 2 10
TUR 2.49 2 10 ;
param: n_min n_max :=
A 700 20000
C 700 .
B1 0 .
B2 0 .
NA . 50000
CAL 16000 24000 ;
param amt (tr): A C B1 B2 NA CAL :=
BEEF 60 20 10 15 938 295
CHK 8 0 20 20 2180 770
FISH 8 10 15 10 945 440
HAM 40 40 35 10 278 430
MCH 15 35 15 15 1182 315
MTL 70 30 15 15 896 400
SPG 25 50 25 15 1329 370
TUR 60 20 15 10 1397 450 ;
Solve the model using the following at the AMPL prompt:
reset data;
reset;
model dietu.mod;
data dietu.dat;
solve;
Here is my input data with four columns with space as the delimiter. I want to add the second and third column and print the result
sachin 200 10 2
sachin 900 20 2
sachin 500 30 3
Raju 400 40 4
Mike 100 50 5
Raju 50 60 6
My code is in the mid way
from pyspark import SparkContext
sc = SparkContext()
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name,cash,cash2)
myFile = sc.textFile("D:\PYSK\cash.txt")
rdd = myFile.map(getLineInfo)
print rdd.collect()
From here I got the result as
[('sachin', 200, 10), ('sachin', 900, 20), ('sachin', 500, 30), ('Raju', 400, 40
), ('Mike', 100, 50), ('Raju', 50, 60)]
Now the final result I need is as below, adding the 2nd and 3rd column and display the remaining fields
sachin 210 2
sachin 920 2
sachin 530 3
Raju 440 4
Mike 150 5
Raju 110 6
Use this:
def getLineInfo(lines):
spLine = lines.split(' ')
name = str(spLine[0])
cash = int(spLine[1])
cash2 = int(spLine[2])
cash3 = int(spLine[3])
return (name, cash + cash2, cash3)
I am trying to shift the Pandas dataframe column data by group of first index. Here is the demo code:
In [8]: df = mul_df(5,4,3)
In [9]: df
Out[9]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 -0.5505 0.7445 -0.3645
B001 0.9129 -1.0473 -0.5478
B002 0.8016 0.0292 0.9002
B003 2.0744 -0.2942 -0.7117
A0001 B000 0.7064 0.9636 0.2805
B001 0.4763 0.2741 -1.2437
B002 1.1563 0.0525 -0.7603
B003 -0.4334 0.2510 -0.0105
A0002 B000 -0.6443 0.1723 0.2657
B001 1.0719 0.0538 -0.0641
B002 0.6787 -0.3386 0.6757
B003 -0.3940 -1.2927 0.3892
A0003 B000 -0.5862 -0.6320 0.6196
B001 -0.1129 -0.9774 0.7112
B002 0.6303 -1.2849 -0.4777
B003 0.5046 -0.4717 -0.2133
A0004 B000 1.6420 -0.9441 1.7167
B001 0.1487 0.1239 0.6848
B002 0.6139 -1.9085 -1.9508
B003 0.3408 -1.3891 0.6739
In [10]: grp = df.groupby(level=df.index.names[0])
In [11]: grp.shift(1)
Out[11]:
COL000 COL001 COL002
STK_ID RPT_Date
A0000 B000 NaN NaN NaN
B001 -0.5505 0.7445 -0.3645
B002 0.9129 -1.0473 -0.5478
B003 0.8016 0.0292 0.9002
A0001 B000 NaN NaN NaN
B001 0.7064 0.9636 0.2805
B002 0.4763 0.2741 -1.2437
B003 1.1563 0.0525 -0.7603
A0002 B000 NaN NaN NaN
B001 -0.6443 0.1723 0.2657
B002 1.0719 0.0538 -0.0641
B003 0.6787 -0.3386 0.6757
A0003 B000 NaN NaN NaN
B001 -0.5862 -0.6320 0.6196
B002 -0.1129 -0.9774 0.7112
B003 0.6303 -1.2849 -0.4777
A0004 B000 NaN NaN NaN
B001 1.6420 -0.9441 1.7167
B002 0.1487 0.1239 0.6848
B003 0.6139 -1.9085 -1.9508
The mul_df() code is attached here : How to speed up Pandas multilevel dataframe sum?
Now I want to grp.shift(1) for a big dataframe.
In [1]: df = mul_df(5000,30,400)
In [2]: grp = df.groupby(level=df.index.names[0])
In [3]: timeit grp.shift(1)
1 loops, best of 3: 5.23 s per loop
5.23s is too slow. How to speed it up ?
(My computer configuration is: Pentium Dual-Core T4200#2.00GHZ, 3.00GB RAM, WindowXP, Python 2.7.4, Numpy 1.7.1, Pandas 0.11.0, numexpr 2.0.1 , Anaconda 1.5.0 (32-bit))
How about shift the total DataFrame object and then set the first row of every group to NaN?
dfs = df.shift(1)
dfs.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
the problem is that the shift operation is not cython optimized, so it involves callback to python. Compare this with:
In [84]: %timeit grp.shift(1)
1 loops, best of 3: 1.77 s per loop
In [85]: %timeit grp.sum()
1 loops, best of 3: 202 ms per loop
added an issue for this: https://github.com/pydata/pandas/issues/4095
similar question and added answer with that works for shift in either direction and magnitude: pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
Code (including test setup) is:
#
# the function to use in apply
#
def replace_shift_overlap(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
#
# the apply
#
df = df.groupby(level=0).apply(replace_shift_overlap,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)
EDIT: Note that the initial sort really eats into the effectiveness of this. So in some cases the original answer is more effective.
try this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 15, 30, 45,43,67,22,12,14,54],
'B': [13, 23, 18, 33, 48, 1,7, 56,66,45,32],
'C': [17, 27, 22, 37, 52,77,34,21,22,90,8],
'D': ['a','a','a','a','b','b','b','c','c','c','c']
})
df
#> A B C D
#> 0 10 13 17 a
#> 1 20 23 27 a
#> 2 15 18 22 a
#> 3 30 33 37 a
#> 4 45 48 52 b
#> 5 43 1 77 b
#> 6 67 7 34 b
#> 7 22 56 21 c
#> 8 12 66 22 c
#> 9 14 45 90 c
#> 10 54 32 8 c
def groupby_shift(df, col, groupcol, shift_n, fill_na = np.nan):
'''df: dataframe
col: column need to be shifted
groupcol: group variable
shift_n: how much need to shift
fill_na: how to fill nan value, default is np.nan
'''
rowno = list(df.groupby(groupcol).size().cumsum())
lagged_col = df[col].shift(shift_n)
na_rows = [i for i in range(shift_n)]
for i in rowno:
if i == rowno[len(rowno)-1]:
continue
else:
new = [i + j for j in range(shift_n)]
na_rows.extend(new)
na_rows = list(set(na_rows))
na_rows = [i for i in na_rows if i <= len(lagged_col) - 1]
lagged_col.iloc[na_rows] = fill_na
return lagged_col
df['A_lag_1'] = groupby_shift(df, 'A', 'D', 1)
df
#> A B C D A_lag_1
#> 0 10 13 17 a NaN
#> 1 20 23 27 a 10.0
#> 2 15 18 22 a 20.0
#> 3 30 33 37 a 15.0
#> 4 45 48 52 b NaN
#> 5 43 1 77 b 45.0
#> 6 67 7 34 b 43.0
#> 7 22 56 21 c NaN
#> 8 12 66 22 c 22.0
#> 9 14 45 90 c 12.0
#> 10 54 32 8 c 14.0