The column of the csv file in google automl tables is recognised as text or categorical instead of numeric as i would like - google-cloud-automl

I tried to train a model using google automl tables but i have the following problem
The csv file is correctly imported, it has 2 columns and about 1870 rows, all numeric.
The system recognises only 1 column as numeric but not the other.
The column, where the problem is, has 5 digits in each row separated with space.
Is there anything i should do in order for the system to properly recognise the data as numeric?
Thanks in advance for your help

The issue is with the Data type Numeric definition, the number needs to be comparable (greater than, smaller than, equal).
Two different list of numbers are not comparable, for example 2 4 7 is not comparable to 1 5 7. To solve this, without using strings and therefore losing the "information" of those numbers, you have several options.
For example:
Create an array of numbers, by inserting [ ] in the limits of the second entrance. Take into consideration the Array Data type relative weighted approach in AutoMl tables as it may affect the "information" extracted from the sequence.
Create additional columns for every entry of the second column so each one is a single number and hence truly numeric.
I would personally go for the second option.
If you are afraid of losing "information" by splitting the numbers take into consideration that after training, the model should deduce by itself the importance of the position and other "information" those number sequences might contain (mean, norm/modulus,relative increase,...) provided the training data is representative.

Related

Create a Dynamic Array formula (Excel) to combine multiple results columns into one column that is filtered & sorted using multiple criteria?

The sample data in the image below is collected from a round robin tournament.
There is a Round column,Home team & Away team columns listing who is playing who. A team could be either Home or Away.
For each match in a round (including any "Bye" match) the number of games won for the Home and Away team are recorded in separate columns respectively.
"Ff" = forfeit and has a value of 0. "Bye" result is left blank (at this stage).
Output columns are "Won, Lost, Round".
Required output (shown in the image) is, for any selected team, the top n most-games-won matches (from both Home & Away) sorted in descending order and then the corresponding games lost but sorted in ascending order where the games won are equal. Finally show the rounds where those scores occurred.
These are the challenges I've faced in going from data to output in one step using dynamic array formula:
Collating/Combining the the Win results into 1 column. Likewise the Losses.
Getting the array to ignore blanks or convert "Ff" to 0 without getting #NUM or #VALUE errors.
Ensuring that if I used separate single column arrays the corresponding Loss and Round matched the Win result
Although "Round, Won, Lost" would be acceptable. But I wasn't able to get the Dynamic Array capability to give the required output with this order.
SUMPRODUCT, INDEX(MATCH), SORT(FILTER) functions all hint at a possible one step formula solution.
The solutions are numerous for sorting & filtering where the existing values are already in one column. There was one solution that dealt with 2 columns of values which was somewhat useful How to get the highest values from 2 columns in excel - Stackoverflow 2013
Many other responses are around the use of concatenation, combining/merging array sets, aggregation etc.
My work around solution is to use a Helper Sheet to combine the Wins from the separate results columns and convert blanks & "Ff" to -1. Likewise for Losses. Using the formula for each line
=IF($C5=L$2,IF($F5="",-1,IF($F5="Ff",0,$F5)),IF($D5=L$2,IF($G5="",-1,IF($G5="Ff",0,$G5)),-1))
Example Helper Sheet
To get the final output the Dynamic Array formula was used on the Helper Sheet data
=SORT(FILTER(L$26:N$40,L$26:L$40>=LARGE(L$26:L$40,$J$3),""),{1,2},{-1,1},FALSE)
I'm trying to avoid using pivottable, VBA solutions. Powerquery possible but not preferred.
Apologies for the screenshots but I couldn't work out how to attach the sample spreadsheet file. (Unfortunately Stackoverflow Help didn't help me to/not to do this.)
Based on the comments I changed my answer with a different approach:
=LET(data,A5:F19,
round,INDEX(data,,1),
ha,CHOOSECOLS(data,3,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,BYROW(ha,LAMBDA(h,IFERROR(XMATCH(L2,h),0))),
clm,CHOOSE(w,{1,2},{2,1}),
srtwon,DROP(REDUCE(0,SEQUENCE(ROWS(data)),LAMBDA(y,z,VSTACK(y,INDEX(HAwonR,z,HSTACK(INDEX(clm,z,),3))))),1),
res,FILTER(srtwon,w),
TAKE(SORT(res,{1,2},{-1,1}),J3))
Old answer:
=LET(data,A5:F19,
round,INDEX(data,,1),
home,INDEX(data,,3),
away,INDEX(data,,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))
In your example you selected round 3 for the third result, but that wasn't won, so I guess that was by mistake.
As you can see making use of LET avoids helpers. Let allows you to create names (helpers) that are stored and because you can name them, you can make complex formulas be more readable.
Basically what it does is filter the columns Home, Away and Round (in that order) for either Home or Away equal the team in cell L2. That's sorted column 1 descending and column 2 ascending. Than the number of rows mentioned in cell J3 are displayed from that sorted array.
Here is my solution based on the excellent contribution by #P.b. Thank you much appreciated.
The wins (likewise losses) required mapping the presence, of the team in question, as hT (home team) to the games it won (hG) and adding to that a 2nd mapping of the games it won (aG) when it was the away team (aT). Essentially what was being done on the Helper Sheet. Result was a 1 column array for game wins and a 1 column array for game losses.
In the process I was able to convert the "Ff" text to 0. I attempted without the conversion and it threw an error.
Instead of CHOOSECOLS used HSTACK to create the new array (wins, losses & round) for the FILTER, SORT, TAKE to work on.
If it could be made conciser(?) that is the next challenge. Overall (not just my solution), this exercise has provided greater flexibility and solved the problems stated. I'm happy!
=LET(data,A5:G19,
round,INDEX(data,,1),
hT,INDEX(data,,3),
aT,INDEX(data,,4),
hG,INDEX(data,,6),
aG,INDEX(data,,7),
wins,MAP(hG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(aG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
losses,MAP(aG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(hG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
HAwonR,HSTACK(wins,losses,round),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))

SPSS: generate 'fake' survey data using rv.uniform without losing value labels

I have a pretty straightforward survey dataset. Each row is a respondent, and each column is a question. Responses have a value that is a whole number, and each number has a label.
Now, I need to replace all of those values with fake data to use in a training. I need something that looks and feels like the original dataset, but isn't actually client data.
I started by replacing my variables with random number values:
COMPUTE Q1=RV.UNIFORM(1,2).
EXECUTE.
COMPUTE Q2=RV.UNIFORM(1,36).
EXECUTE.
COMPUTE Q3=RV.NORMAL(50, 13).
EXECUTE.
(rv.normal/rv.uniform depending on what kind of data I'm trying to fake - age versus multiple-choice question, for example).
This works, but then when I try and generate crosstabs, export the dataset w value labels, etc., the labels aren't applied to the columns with fake data. As far as I can tell, my fake numbers are in the exact same format they were in before - numeric, no decimals, width of 2, nominal. The labels still appear in the variable view, but they aren't actually being applied.
I'd really prefer not to have to manually re-label every one of these columns, because there's quite a few of them. Any ideas for how to get around this issue? Or is there a smarter way to generate fake data?
Your problem is the RV.UNIFORM and the RV.NORMAL functions do not generate integers - they generate decimal numbers. You may have your display hide the decimal numbers by having 0 decimals in the variable view, but they are still there (you can check this by adding decimals in the variable view).
So you neen another step of turning your decimals into integers. For example, the following are two ways to get a random 1 or 2 (integers):
COMPUTE Q1=rnd(RV.UNIFORM(1,2)).
or
COMPUTE Q1=trunc(RV.UNIFORM(1,3)).
Once the numbers generated are integers corresponding to the value labels definition, you should be able to see the labels in the output.

Tableau - Calculated fields / grouping / Custom Dim

Tableau:
This may seem simple, but I ran out of the usual tricks I've used in other systems.
I want a variance column. Essentially adding a member 'Variance' to the Act/Plan dimension which only contains the members 'Actual' and 'Plan'
I've come in where the data structure and reporting is set up like so:
Actual | Plan
Profit measure
measure 2
measure 3
etc
The goal is to have a Variance column (calculated and not part of the Actual/Plan dimension)
Actual | Plan | Variance
Profit measure
measure 2
measure 3
etc
There are solutions where it works for one measure only, and I've looked into that.
ie, create calculated field as such
Profit_Actual | Profit_Plan | Variance
You put this on the columns, and you get a grid that I want... except a grid with only 1 measure.
This does not work if I want to run several measures on rows. Essentially the solution above will only display the Profit measure, not Measure 1_Actual , Measure 2_Plan etc.
So I tried a trick where I grouped a the 3 calculated measures, ie Profit_Actual | Profit_Plan | Profit_Variance as 'Profit_Measure'
Created a parameter list - 'Actual', 'Plan', 'Variance'
Now I can half achieve my goal, by having the parameter on columns and the 'Profit Measure' on Rows (so I can have Measure 123_group etc down on rows too). Trouble is, I found that parameters are single select only. Only if it can display all options in the custom paramater at once, I would've solved my problem.
Any ideas on how I can achieve the Variance column I want?
Virtually adding a member to a dimension/Calculated fieds/tricks/workaround
Thank you
Any leads is appreciated
Gemmo
Okay. First thing, I had a really hard time trying to understand how your data is organized, try to be more clear (say how each entry in your database looks like, and not how a specific view in Tableau looks like).
But I think I got it. I guess you have a collection of entries, and each entry has a number of measure fields (profits and etc.) and an Act/Plan field, to identify whether that entry is an actual value or a planned value. Is that correct?
Well, if that's the case, I'm sorry to say you have to calculate a variance field for each dimension. Think about it, how your original dataset is structured. Do you think you can add a single field "Variance" to represent the variance of each measure? Well, you can, store the values in a string, and then collect it back using some string functions, but it's not very practical. The problem is that each entry have many measures, if it had only 1 measure, than 1 single variance field would suffice.
So, if you can re-organize your data, what would be an easier to work set (but with many more entries) is something with the fields: Measure, Value, Actual/Plan. The measure field would have a string to identify what you're measuring in that entry. Value would be a number to represent the actual measure. And the Actual/Plan is the same. For instance:
Measure Value Actual/Plan
Profit 100 Actual
So, each line in your current model would become n entries, where n is the number of measures you have right now. So a larger dataset in a way, but easier to work with. Think about, now you can have a calculated field, and use some table calculations to calculate the variance only for that measure and/or Actual/Plan. Just use WINDOW_VAR, and put Measure and/or Actual/Plan in the partition.
Table calculations are awesome, take a look at this to understand it better. http://onlinehelp.tableausoftware.com/current/pro/online/en-us/help.htm#calculations_tablecalculations_understanding_addressing.html
I generally like to have my data staged such that Actual is its own column and Plan is its own column in the data being fed to Tableau. It makes calculations so much easier.
If your data is such that there is a column called "Actual/Plan" and every row is populated with either "Actual" or "Plan" and there is another column called "Value" or "Measure" that is populated with the values, you can force Tableau to make them columns assuming you can't or won't rearrange your data.
Create a calculated field called "Actual" with the following calc:
IF [Actual/Plan] = 'Actual' THEN [Value] END
Similarly, create a calculated field called "Plan" with the following calc:
IF [Actual/Plan] = 'Plan' THEN [Value] END
Now, you can finally create your "Variance" and "Variance %" calculations (respectively):
SUM([Actual]) - SUM([Plan])
[Variance] / SUM([Plan])

Sorting and merging in Stata on categorical variables

I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.

BIRT: Adding multiple Category (X) Series

I have a single dataset containing 4 columns, each showing the number of rejections for a quarter-year. A 5th column shows the Team to which those values belong.
Is it possible to add 4 fixed points on the x-Axis, each belonging to one of these columns? Then I could add the Team as the Y-Series. I'd like to see the evolution of each team in time.
Take a look at this example:
http://www.birt-exchange.org/org/devshare/designing-birt-reports/1553-use-column-names-as-chart-xaxis/
I solved this by writing a (really short, really simple) loop which takes the values out of the four mentioned columns, one at a time, and creates four rows instead.
So the rows basically contain the Team from the original row (duplicated 4 times) and the Number of Rejections. So now instead of a Row with one team and four numbers, I have four rows, each with one team and one number.
I did this all in the report scripts (under "fetch"). Try it, it's really easy.

Resources