When to use StringIndexer vs StringIndexer+OneHotEncoder? - apache-spark-mllib

When / in what context should you use StringIndexer vs StringIndexer+OneHotEncoder?
Looking at the docs for sparkml's StringIndexer (https://spark.apache.org/docs/latest/ml-features#stringindexer) and OneHotEncoder (https://spark.apache.org/docs/latest/ml-features#onehotencoder), it's not obvious to me when to use just StringIndexer vs StringIndexer+OneHotEncoder (I've been using just a StringIndexer on a benchmarking dataset and getting pretty good results as is, but I suppose that does not mean that doing this is necessarily "correct"). The ohe docs refer to a StringIndexer > OneHotEncoder > VectorAssembler staging pipeline, but the way it is worded make that seem optional (vs just doing StringIndexer > VectorAssembler).
Can anyone clarify this for me?

First, it is necessary to use StringIndexer before OneHotEncoder, because OneHotEncoder needs a column of category indices as input.
To answer your question, StringIndexer may bias some machine learning models. For instance, after passing a data frame with a categorical column that has three classes (0, 1, and 2) to a linear regression model. A relationship of double between value 1 and 2 may be concluded while it is just a different class, a different index. When having a vector with zeros and ones at specific positions can transmit the desired information of class difference. So finally, it depends on the model used during training, tree-based models are sensitive to one-hot encoding and become worse with one-hot encoded vectors.
You may consider reading Create a Pipeline - Learning Spark for more details behind one hot encoding.

Related

Cross Entropy Loss Gets Negative Values In Training Transformer Model

I built my Transformer model for recovering text. In detail, the source text may contain some redundant, missing or wrong words, my model have to correct as many as possible these words. Moreover, I just want my model learn embedding of the correct sentence, so the sources and the targets are sequences of embedding. Therefore, my loss function - Cross Entropy takes 2 embedding sequence as input and target. In addition, this model is a part of the larger model which the main criterion is Negative-Log Likelihood.
Unfortunately, the values of Cross Entropy Loss is under 0.0 after few epochs, then the sum of Cross Entropy and Negative-Log Likelihood is under 0.0 too. This makes the whole model be not able to converge.
I need helps to resolve this issue. Thank in advance.

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

Use of validation_frame in H2O AutoML

Just started with H2O AutoML so apologies in advance if I have missed something basic.
I have a binary classification problem where data are observations from K years. I want to train on the K-1 years and tune the models and select the best one explicitly based on the remaining K year.
If I switch off cross-validation (with nfolds=0) to avoid randomly blending of years into the N folds and define data of year K as the validation_frame then I don't have the ensemble created (as expected according to the documentation) which in fact I need.
If I train with cross-validation (default nfolds) and defining a validation frame to be the K-year data
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
then according to
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
the validation_frame is ignored
"...By default and when nfolds > 1, cross-validation metrics will be used for early stopping and thus validation_frame will be ignored."
Is there a way to get the tuning of the models and the selection of the best one(ensemble or not) based on the K-year data only, and while the ensemble of models is also available in the output?
Thanks a lot!
You don't want to have cross-validation (CV) if you are dealing with times-series (non-IID) data, since you won't want folds from the future to the predict the past.
I would explicitly add nfolds=0 so that CV is disabled in AutoML:
aml = H2OAutoML(max_runtime_secs=3600, seed=1, nfolds=0)
aml.train(x=x,y=y, training_frame=k-1_years, validation_frame=k_year)
To have an ensemble, add a blending_frame which also applies to time-series. See more info here.
Additionally, since you are dealing with time-series data. I would recommend adding time-series transformations (e.g. lags), so that your model gets info from previous years and their aggregates (e.g. weighted moving average).

I'm looking for an algorithm or function that can take a text string and convert it a number

I looking for a algorithm, function or technique that can take a string and convert it to a number. I would like the algorithm or function to have the following properties:
Identical string yields the same calculated value
Similar strings would yield similar values (similar can be defined as similar in meaning or similar in composition)
Capable of handling strings of variable length
I read an article several years ago that gives me hope that this can be achieved. Unfortunately, I have been unable to recall the source of the article.
Similar in composition is pretty easy, I'll let somebody else tackle that.
Similar in meaning is a lot harder, but fun :), I remember reading an article about how a neural network was trained to construct a 2D "semantic meaning graph" of a whole bunch of english words, where the distance between two words represented how "similar" they are in meaning, just by training it on wikipedia articles.
You could do the same thing, but make it one-dimensional, that will give you a single continuous number, where similar words will be close to each other.
Non-serious answer: Map everything to 0
Property 1: check. Property 2: check. Property 3: check.
But I figure you want dissimilar strings to get different values, too. The question then is, what is similar and what is not.
Essentially, you are looking for a hash function.
There are a lot of hash functions designed with different objectives. Crypographic hashes for examples are pretty expensive to compute, because you want to make it really hard to go backwards or even predict how a change to the input affects the output. So they try really hard to violate your condition 2. There are also simpler hash functions that mostly try to spread the data. They mostly try to ensure that close input values are not close to each other afterwards (but it is okay if it is predictable).
You may want to read up on Wikipedia:
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_substrings
(Yes, it has a section on "Finding similar substrings" via Hashing)
Wikipedia also has a list of hash functions:
https://en.wikipedia.org/wiki/List_of_hash_functions
There is a couple of related stuff for you. For example minhash could be used. Here is a minhash-inspired approach for you: Define a few random lists of all letters in your alphabet. Say I have the letters "abcde" only for this example. I'll only use two lists for this example. Then my lists are:
p1 = "abcde"
p2 = "edcba"
Let f1(str) be the index in p1 of the first letter in my test word, f2(str) the first letter in p2. So the word "bababa" would map to 0,3. The word "ababab" also. The word "dada" would make to 0,1, while "ce" maps to 2,0. Note that this map is invariant to word permutations (because it treats them as sets) and for long texts it will converge to "0,0". Yet with some fine tuning it can give you a pretty fast chance of finding candidates for closer inspection.
Fuzzy hashing (context triggered piecewise hashing) may be what you are looking for.
Implemenation: ssdeep
Explanation of the algorithm: Identifying almost identical files using context triggered piecewise hashing
I think you're probably after a hash function, as numerous posters have said. However, similar in meaning is also possible, after a fashion: use something like Latent Dirichlet Allocation or Latent Semantic Analysis to map your word into multidimensional space, relative to a model trained on a large collection of text (these pre-trained models can be downloaded if you don't have access to a representative sample of the kind of text you're interested in). If you need a scalar value rather than multi-dimensional vector (it's hard to tell, you don't say what you want it for) you could try a number of things like the probability of the most probable topic, the mean across the dimensions, the index of the most probable topic, etc. etc.
num = 0
for (byte in getBytes(str))
num += UnsignedIntValue(byte)
This would meet all 3 properties(for #2, this works on the strings binary composition).

Good Data Structure for Unit Conversion? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
StackOverflow crowd. I have a very open-ended software design question.
I've been looking for an elagant solution to this for a while and I was wondering if anyone here had some brilliant insight into the problem. Consider this to be like a data structures puzzle.
What I am trying to do is to create a unit converter that is capable of converting from any unit to any unit. Assume that the lexing and parsing is already done. A few simple examples:
Convert("days","hours") // Yields 24
Convert("revolutions", "degrees") // Yields 360
To make things a little more complicated, it must smoothly handle ambiguities between inputs:
Convert("minutes","hours") // Yields (1/60)
Convert("minutes","revolutions") // Yields (1/21600)
To make things even more fun, it must handle complex units without needing to enumerate all possibilities:
Convert("meters/second","kilometers/hour")
Convert("miles/hour","knots")
Convert("Newton meters","foot pounds")
Convert("Acre feet","meters^3")
There's no right or wrong answer, I'm looking for ideas on how to accomplish this. There's always a brute force solution, but I want something elegant that is simple and scalable.
I would start with a hashtable (or persisted lookup table - your choice how you implement) that carries unit conversions between as many pairs as you care to put in. If you put in every possible pair, then this is your brute force approach.
If you have only partial pairs, you can then do a search across the pairs you do have to find a combination. For example, let's say I have these two entries in my hashtable:
Feet|Inches|1/12
Inches|Centimeters|2.54
Now if I want to convert feet to centimeters, I have a simple graph search: vertices are Feet, Inches, and Centimeters, and edges are the 1/12 and 2.54 conversion factors. The solution in this case is the two edges 1/12, 2.54 (combined via multiplication, of course). You can get fancier with the graph parameters if you want to.
Another approach might be applying abductive reasoning - look into AI texts about algebraic problem solvers for this...
Edit: Addressing Compound Units
Simplified problem: convert "Acres" to "Meters^2"
In this case, the keys are understanding that we are talking about units of length, so why don't we insert a new column into the table for unit type, which can be "length" or "area". This will help performance even in the earlier cases as it gives you an easy column to pare down your search space.
Now the trick is to understand that length^2 = area. Why not add another lookup that stores this metadata:
Area|Length|Length|*
We couple this with the primary units table:
Meters|Feet|3.28|Length
Acres|Feet^2|43560|Area
So the algorithm goes:
Solution is m^2, which is m * m, which is a length * length.
Input is acres, which is an area.
Search the meta table for m, and find the length * length mapping. Note that in more complex examples there may be more than one valid mapping.
Append to the solution a conversion Acres->Feet^2.
Perform the original graph search for Feet->M.
Note that:
The algorithm won't know whether to use area or length as the basic domain in which to work. You can provide it hints, or let it search both spaces.
The meta table gets a little brute-force-ish.
The meta table will need to get smarter if you start mixing types (e.g. Resistance = Voltage / Current) or doing something really ugly and mixing unit systems (e.g. a FooArea = Meters * Feet).
Whatever structure you choose, and your choice may well be directed by your preferred implementation (OO ? functional ? DBMS table ?) I think you need to identify the structure of units themselves.
For example a measurement of 1000km/hr has several components:
a scalar magnitude, 1000;
a prefix, in this case kilo; and
a dimension, in this case L.T^(-1), that is, length divided by time.
Your modelling of measurements with units needs to capture at least this complexity.
As has already been suggested, you should establish what the base set of units you are going to use are, and the SI base units immediately suggest themselves. Your data structure(s) for modelling units would then be defined in terms of those base units. You might therefore define a table (thinking RDBMS here, but easily translatable into your preferred implementation) with entries such as:
unit name dimension conversion to base
foot Length 0.3048
gallon(UK) Length^3 4.546092 x 10^(-3)
kilowatt-hour Mass.Length^2.Time^(-2) 3.6 x 10^6
and so forth. You'll also need a table to translate prefixes (kilo-, nano-, mega-, mibi- etc) into multiplying factors, and a table of base units for each of the dimensions (ie meter is the base unit for Length, second for Time, etc). You'll also have to cope with units such as feet which are simply synonyms for other units.
The purpose of dimension is, of course, to ensure that your conversions and other operations (such as adding 2 feet to 3.5 metres) are commensurate.
And, for further reading, I suggest this book by Cardarelli.
EDIT in response to comments ...
I'm trying to veer away from suggesting (implementation-specific) solutions so I'll waffle a bit more. Compound units, such as kilowatt-hours, do pose a problem. One approach would be to tag measurements with multiple unit-expressions, such as kilowatt and hour, and a rule for combining them, in this case multiplication I could see this getting quite hairy quite quickly. It might be better to restrict the valid set of units to the most common ones in the domain of the application.
As to dealing with measurements in mixed units, well the purpose of defining the Dimension of a unit is to provide some means to ensure that only sensible operations can be applied to measurements-with-units. So, it's sensible to add two lengths (L+L) together, but not a length (L) and a volume (L^3). On the other hand it is sensible to divide a volume by a length (to get an area (L^2)). And it's kind of up to the application to determine if strange units such as kilowatt-hours per square metre are valid.
Finally, the book I link to does enumerate all the possibilities, I guess most sensible applications with units will implement only a selection.
I would start by choosing a standard unit for every quantity(eg. meters for length, newtons for force, etc) and then storing all the conversion factors to that unit in a table
then to go from days to hours, for example, you find the conversion factors for seconds per day and seconds per hour and divide them to find the answer.
for ambiguities, each unit could be associated with all the types of quantities it measures, and to determine which conversion to do, you would take the intersection of those two sets of types(and if you're left with 0 or more than one you would spit out an error)
I assume that you want to hold the data about conversion in some kind of triples (fstUnit, sndUnit, multiplier).
For single unit conversions:
Use some hash functions in O(1) to change the unit stucture to a number, and then put all multipliers in a matrix (you only have to remember the upper-right part, because the reflection is the same, but inversed).
For complex cases:
Example 1. m/s to km/h. You check (m,km) in the matrix, then the (s,h), then multiply the results.
Example 2. m^3 to km^3. You check (m,km) and take it to the third power.
Of course some errors, when types don't match like field and volume.
You can make a class for Units that takes the conversion factor and the exponents of all basic units (I'd suggest to use metric units for this, that makes your life easier). E.g. in Pseudo-Java:
public class Unit {
public Unit(double factor, int meterExp, int secondExp, int kilogrammExp ... [other base units]) {
...
}
}
//you need the speed in km/h (1 m/s is 3.6 km/h):
Unit kmPerH = new Unit(1 / 3.6, 1, -1, 0, ...)
I would have a table with these fields:
conversionID
fromUnit
toUnit
multiplier
and however many rows you need to store all the conversions you want to support
If you want to support a multi-step process (degrees F to C), you'd need a one-to-many relationship with the units table, say called conversionStep, with fields like
conversionID
sequence
operator
value
If you want to store one set of conversions but support multi-step conversions, like storing
Feet|Inches|1/12
Inches|Centimeters|2.54
and supporting converting from Feet to Centimeters, I would store a conversion plan in another table, like
conversionPlanID
startUnits
endUnits
via
your row would look like
1 | feet | centimeters | inches

Resources