I am new to world of data science and am trying to understand the concepts on the outcomes of the the ML. I have started off to use scikit - clustering example. Using the scikit library is well documented everywhere. But all the examples go with the assumption of ready numerical data.
Now how does a data scientist convert a business data into machine learning data. Just to give an example, here is a customer and sales data I have prepared..
The first picture shows the customer data with some parameters having an integer, string and boolean values
The second picture shows the historical sales data for those customers.
Now how does such a real business data gets translated to feed to a Machine Learning algorithm? How do I convert each data to a common factor which the algorithm can understand?
Thanks
K
Technicaly, there are many ways, such as one-hot encoding, standardization, and going into logspace for skewed attributes.
But the problem is not just of a technical nature.
Finding a way is not enough, but you need to find one that works really well for your problem. This is usually very different from problem to another. There is no "turn key solution".
Just addition to comment by #Anony-Mousse, you can convert Won/Lost column to value 1, 0 (e.g. 1 for Won, 0 for Lost). For Y column, suppose you have 3 unique values in the column, you can convert A to says [1, 0, 0] and B to [0, 1, 0] and C to [0, 0, 1] (called one-hot encoding). Same on Z column, you can convert TRUE column to 1 and FALSE to 0 (or True or False respectively).
To merge 2 table or excel file together, you can use additional library called pandas which allows you to merge two dataframe together e.g. df1.merge(df2, on='CustID', how='left'). Now, you can put your feature set to scikit learn properly.
Related
I have some JSON data that I want to match to a particular array of IDs. So for example, the JSON temperature: 80, weather: tornado can map to an array of IDs [15, 1, 82]. This array of IDs is completely arbitrary and something I will define myself for that particular input, it's simply meant to give recommendations based on conditions.
So while a temperature >= 80 in tornado conditions always maps to [15, 1, 82], the same temperature in cloudy conditions might be [1, 16, 28], and so on.
The issue is that there are a LOT of potential "branches". My program has 7 times of day, each of those time of day nodes has 7 potential temperature ranges, and each of those temperature range nodes have 15 possible weather events. So manually writing if statements for 735 combinations (if I did the math correctly) would be very unruly.
I have drawn a "decision tree" representing one path for demonstration purposes, above.
What are some recommended ways to represent this in code besides massively nested conditionals/case statements?
Thanks.
No need for massive branching. It's easy enough to create a lookup table with the 735 possible entries. You said that you'll add the values yourself.
Create enums for each of your times of day, temperature ranges, and weather events. So your times of day are mapped from 0 to 6, your temperature ranges are mapped from 0 to 6, and your weather events are mapped from 0 to 14. You basically have a 3-dimensional array. And each entry in the array is a list of ID lists.
In C# it would look something like this:
List<List<int>>[][][] = LookupTable[7][7][15];
To populate the lookup table, write a program that generates JSON that you can include in your program. In pseudocode:
for (i = 0 to 6) { // loop for time of day
for (i = 0 to 6) { // loop for temperature ranges
for (i = 0 to 14) { // loop for weather events
// here, output JSON for the record
// You'll probably want a comment with each record
// to say which combination it's for.
// The JSON here is basically just the list of
// ID lists that you want to assign.
}
}
}
Perhaps you want to use that program to generate the JSON skeleton (i.e. one record for each [time-of-day, temperature, weather-event] combination), and then manually add the list of ID lists.
It's a little bit of preparation, but in the end your lookup is dead simple: convert the time-of-day, temperature, and weather event to their corresponding integer values, and look it up in the array. Just a few lines of code.
You could do something similar with a map or dictionary. You'd generate the JSON as above, but rather than load it into a three-dimensional array, load it into your dictionary with the key being the concatenation of the three dimensions. For example, a key would be:
"early morning,lukewarm,squall"
There are probably other lookup table solutions, as well. Those are the first two that I came up with. The point is that you have a whole lot of static data that's very amenable to indexed lookup. Take advantage of it.
When / in what context should you use StringIndexer vs StringIndexer+OneHotEncoder?
Looking at the docs for sparkml's StringIndexer (https://spark.apache.org/docs/latest/ml-features#stringindexer) and OneHotEncoder (https://spark.apache.org/docs/latest/ml-features#onehotencoder), it's not obvious to me when to use just StringIndexer vs StringIndexer+OneHotEncoder (I've been using just a StringIndexer on a benchmarking dataset and getting pretty good results as is, but I suppose that does not mean that doing this is necessarily "correct"). The ohe docs refer to a StringIndexer > OneHotEncoder > VectorAssembler staging pipeline, but the way it is worded make that seem optional (vs just doing StringIndexer > VectorAssembler).
Can anyone clarify this for me?
First, it is necessary to use StringIndexer before OneHotEncoder, because OneHotEncoder needs a column of category indices as input.
To answer your question, StringIndexer may bias some machine learning models. For instance, after passing a data frame with a categorical column that has three classes (0, 1, and 2) to a linear regression model. A relationship of double between value 1 and 2 may be concluded while it is just a different class, a different index. When having a vector with zeros and ones at specific positions can transmit the desired information of class difference. So finally, it depends on the model used during training, tree-based models are sensitive to one-hot encoding and become worse with one-hot encoded vectors.
You may consider reading Create a Pipeline - Learning Spark for more details behind one hot encoding.
I've been working with the h2o.ai automl function on a few problems with quite a bit of success, but have come across a bit of a roadblock.
I've got a problem that uses 500-odd predictors (all float) to map onto 6 responses (again all float.)
Required Data Parameters
y: This argument is the name (or index) of the response column.
3.16 docs
It seems that the automl library only handles a single response. Am I missing something? Perhaps in the terminology even?
In the case that I'm not, my plan is to build 6 separate leaderboards, one for each response, and use the results to kick-start a manual network search.
In theory I guess I could actually run the 6 automl models individually to get the vector response, but that feels like an odd approach.
Any insight would be appreciated,
Cheers.
Not just AutoML, but H2O generally, will only let you predict a single thing.
Without more information about what those 6 outputs represent, and their relationship to each other, I can think of 3 approaches.
Approach 1: 6 different models, as you suggest.
Approach 2: Train an auto-encoder to compress 6 dimensions to 1 dimension. Then train your model to predict that single value. Then expand it back out. (E.g. by a lookup table on the training data, e.g. if your model predicts 1.123, and you have [1,2,3,4,5,6] was represented by 1.122, and [3.14,0,0,3.14,0,0] was represented by 1.125, you could choose [1,2,3,4,5,6], or a weighted average of those 2 closest matches.) (Other dimension-reduction approaches, such as PCA, are the same idea.)
Approach 3: If the possible combinations of your 6 floats is a (relatively small) finite set, you could have an explicit lookup table, to N categories.
I assume each are continuous variables, which is why they are float, so I expect approach 3 will be inferior to approach 2. If there is very little correlation/relationship between the 6 outputs, approach 1 is going to be best.
I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.
In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).
My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.
I read some of the answers from this forum but i didn't get clarity on the last part.
One Hot Encoding should be done for categorical variables with categories > 2.
To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.
Ordinal Data: The values has some sort of ordering between them. example:
Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.
Nominal Data: The values has no defined ordering between them.
example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.
After String Indexing lets assume the output is:
id | colour | categoryIndex
----|----------|---------------
0 | black | 0.0
1 | white | 1.0
2 | yellow | 2.0
3 | red | 3.0
Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true.
OneHotEncoder() will help us avoid this situation.
So to answer your question,
Will indexed feature be considered as continuous in the algorithm?
It will be considered as continious variable.
Is it the right approach? Or should I go ahead with One-Hot-Encoding
for categorical features
depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.
Refer: https://spark.apache.org/docs/latest/ml-features.html#onehotencoder
In short, Spark's RandomForest does NOT require OneHotEncoder for categorical features created by StringIndexer or VectorIndexer.
Longer Explanation. In general DecisionTrees can handle both Ordinal and Nominal types of data. However, when it comes to the implementation, it could be that OneHotEncoder is required (as it is in Python's scikit-learn).
Luckily, Spark's implementation of RandomForest honors categorical features if properly handled and OneHotEncoder is NOT required!
Proper handling means that categorical features contain the corresponding metadata so that RF knows what it is working on. Features that have been created by StringIndexer or VectorIndexer contain metadata in the DataFrame about being generated by the Indexer and being categorical.
According to the vdep answers, the StringIndexer is enough for Ordinal Data. Howerver the StringIndexer sort the data by label frequency, for example "excellent > good > neutral > bad > very bad" maybe become the "good,excellent,neutral". So for Oridinal data, the StringIndexer do not suit for it.
Secondly, for Nominal Data, the document tells us that
for a binary classification problem with one categorical feature with three categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical features are ordered as A, C, B. The two split candidates are A | C, B and A , C | B where | denotes the split.
The "corresponding proportions of label 1" is same as the label frequency? So I am confused of the feasibility with the StringInder to DecisionTree in Spark.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.
I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).
Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.
I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)
I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.
You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)
I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches