QuickSight add subset of fields - amazon-quicksight

Total AWS QuickSight newbie here. I'm trying to import some cost data in CSV form into QuickSight and add some calculated fields.
The data I have is of the form:
Type
Units Consumed
A
2
B
3
A
1
B
5
... and so on
Unit Cost ($) is not part of the dataset and is something like
Unit Cost
Amount ($)
Unit Cost (A)
1
Unit Cost (B)
2
I would like to compute (either as part of the dataset or as part of an analysis visual, maybe) the total costs for A and B as separate line items. Something like
Total Cost (A) = Sum(Amount where Type = A) * Unit Cost (A)
Total Cost (B) = Sum(Amount where Type = B) * Unit Cost (B)
Here are the things I've tried which don't work:
sumOver({Units Consumed}, Type='A')
sumIf({Units Consumed}, Type='A')
To break it down and test smaller parts, I added a calculated field which simply does
sum({Units Consumed})
But it just adds a column to the dataset with every field as "Undefined".
How can I achieve what I'm trying to do?

I tried to replicate the code
sumIf({Units Consumed}, Type='A')
and it worked. Could you check if Units Consumed is a integer column type?
How to change column type

Related

In Visual FoxPro, how does one incorporate a SUM REST command into a SCAN loop?

I am trying to complete a mortality table, using loops in Visual Foxpro. I have run into one difficulty where the math operation involves doing a sum of of all data in a column for the remaining rows - this needs to be incorporated into a loop. The strategy I thought would work, nesting a SUM REST function into the SCAN REST function, was not successful, and I haven't found a good alternative approach.
In FoxPro, I can successfully use the SCAN function as follows, say:
Go 1
Replace survivors WITH 1000000
SCATTER NAME oprev
SKIP
SCAN rest
replace survivors WITH (1 - oprev.prob) * oprev.survivors
SCATTER NAME oprev
ENDSCAN
(to take the mortality rates in a table and use it to compute number of survivors at each age)
Or, say:
Replace Yearslived WITH 0
SCATTER NAME oprev1
SKIP
SCAN rest
replace Yearslived WITH (oprev1.survivors + survivors) * 0.5
SCATTER NAME oprev1
ENDSCAN
In order to complete a mortality table I want to use the Yearslived and survivors data (which were produced using the SCANs above) to get life expectancy data as follows. Say we have the simplified table:
SURVIVORS YEARSLIVED LIFEEXP
100 0 ?
80 90 ?
60 70 ?
40 50 ?
20 30 ?
0 10 ?
Then each LIFEEXP record should be the sum of the remaining YEARSLIVED records divided by the corresponding Survivors record, i.e:
LIFEEXP (1) = (90+70+50+30+10)/100
LIFEEXP (2) = (70+50+30+10)/80
...and so on.
I attempted to do this with a similar SCAN approach - see below:
Go 1
SCATTER NAME Oprev2
SCAN rest
replace lifeexp WITH ((SUM yearslived Rest) - oprev2.yearslived) / oprev2.survivors
SCATTER NAME oprev2
ENDSCAN
But here I get the error message "Function name is missing)." Help tells me this is probably because the function contains too many arguments.
So I then also tried to break things down and first use SCAN just to get all of my SUM REST data, as follows:
SCAN rest
SUM yearslived REST
END SCAN
... in the hope that I could get this data, define it as a variable, and create a simpler SCAN function above. However, I seem to be doing something wrong here as well, as instead of getting all necessary sums (first the sum of rows 2 to end, then 3 to end, etc.), I only get one sum, of all the yearslived data. In other words, using the sample data, I am given just 250, instead of the list 250, 160, 90, 40, 10.
What am I doing wrong? And more generally, how can I create a loop in Foxpro that includes a function where you Sum up all remaining data in a specific column over and over again (first 2nd through last record, then 3rd through last record, and so on)?
Any help will be much appreciated!
TM
Well you are really hiding the important detail, your table's structure, sample data and desired output. Then it is mostly guess work which have a high chance of to be true.
You seem to be trying to do something like this:
Create Cursor Mortality (Survivors i, YearsLived i, LifeExp b)
Local ix, oprev1
For ix=100 To 0 Step -20
Insert Into Mortality (Survivors, YearsLived) Values (m.ix,0)
Endfor
Locate
Survivors = Mortality.Survivors
Skip
Scan Rest
Replace YearsLived With (m.Survivors + Mortality.Survivors) * 0.5
Survivors = Mortality.Survivors
Endscan
*** Here is the part that deals with your sum problem
Local nRecNo, nSum
Scan
* Save current recnord number
nRecNo = Recno()
Skip
* Sum REST after skipping to next row
Sum YearsLived Rest To nSum
* Position back to row where we started
Go m.nRecNo
* Do the replacement
Replace LifeExp With Iif(Survivors=0,0,m.nSum/Survivors)
* ENDSCAN would implicitly move to next record
Endscan
* We are done. Go first record and browse
Locate
Browse
While there are N ways to do this in VFP, this is one xbase approach to do that and relatively simple to understand IMHO.
Where did you go wrong?
Well, you tried to use SUM as if it were a function, but it is a command. There is SUM() function for SQL as an aggregate function but here you are using the xBase command SUM.
EDIT: And BTW in this code:
SCAN rest
SUM yearslived REST
ENDSCAN
What you are doing is, starting a SCAN with a scope of REST, in loop you are using another scoped command
SUM yearslived REST
This effectively does the summing on the REST of records and places the record pointer to bottom. Endscan further advances it to eof(). Thus it only works for the first record.

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

Tableau - Filter Dimension on Max Value of Measure

I'm trying to calculate the MAX value of column D of value E which resets for every A Column. Later I just need to show the MAX row of A.
Raw data looks like this:
The desired view would be like this:
Thanks!
This is an excellent example of a question which can/should be answered using LOD Calculations.
To start, find the maximum [Percentage] (Column E) per [Name] (Column B):
{Fixed [Name]: MAX(Percentage)}
Then insert that calculation, call it "Max Per Name", into another calculation which will only show the [Letter] (Column D) if the [Percentage] and the [Max Per Name] match:
IF [Percentage] = [Max Per Name] THEN [Letter] END
The result will be a dimension which shows the [Letter] which matches the largest [Percentage] and 'Null' when not matched.
From here, just filter out 'Null' - the result will be exactly as you desire. You will even be able to take [Percentage] off the view and still see an accurate [Letter].
Please note that while LOD Calculations are extremely powerful, they require a little more work to maintain and operate. They are a little more complex than normal calculations. Please take some time and read through this LOD Overview document to further familiarize yourself with their use/limitations.

Poor h2o GBM Classification Performance in a balanced binomial response

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)
You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

calculate standard deviation of daily data within a year

I have a question,
In Matlab, I have a vector of 20 years of daily data (X) and a vector of the relevant dates (DATES). In order to find the mean value of the daily data per year, I use the following script:
A = fints(DATES,X); %convert to financial time series
B = toannual(A,'CalcMethod', 'SimpAvg'); %calculate average value per year
C = fts2mat(B); %Convert fts object to vector
C is a vector of 20 values. showing the average value of the daily data for each of the 20 years. So far, so good.. Now I am trying to do the same thing but instead of calculating mean values annually, i need to calculate std annually but it seems there is not such an option with function "toannual".
Any ideas on how to do this?
THANK YOU IN ADVANCE
I'm assuming that X is the financial information and it is an even distribution across each year. You'll have to modify this if that isn't the case. Just to clarify, by even distribution, I mean that if there are 20 years and X has 200 values, each year has 10 values to it.
You should be able to do something like this:
num_years = length(C);
span_size = length(X)/num_years;
for n = 0:num_years-1
std_dev(n+1,1) = std(X(1+(n*span_size):(n+1)*span_size));
end
The idea is that you simply pass the date for the given year (the day to day values) into matlab's standard deviation function. That will return the std-dev for that year. std_dev should be a column vector that correlates 1:1 with your C vector of yearly averages.
unique_Dates = unique(DATES) %This should return a vector of 20 elements since you have 20 years.
std_dev = zeros(size(unique_Dates)); %Just pre allocating the standard deviation vector.
for n = 1:length(unique_Dates)
std_dev(n) = std(X(DATES==unique_Dates(n)));
end
Now this is assuming that your DATES matrix is passable to the unique function and that it will return the expected list of dates. If you have the dates in a numeric form I know this will work, I'm just concerned about the dates being in a string form.
In the event they are in a string form you can look at using regexp to parse the information and replace matching dates with a numeric identifier and use the above code. Or you can take the basic theory behind this and adapt it to what works best for you!

Resources