String indexes converted to onehot vector are blank (no index set to 1) for some rows? - apache-spark-mllib

I have a pyspark dataframe with a categorical column that is being converted into a onehot encoded vector via...
si = StringIndexer(inputCol="LABEL", outputCol="LABEL_IDX").fit(df)
df = si.transform(df)
oh = OneHotEncoderEstimator(inputCols=["LABEL_IDX"], outputCols=["LABEL_OH"]).fit(df)
df = oh.transform(df)
when looking at the dataframe after, I see some of the onehot encoded vectors looking like...
(1,[],[])
I would expect the sparse vectors to either look like (1,[0],[1.0]) or (1,[1],[1.0]), but here the vectors are just zeros.
Any idea what could be happening here?

This has to do with how the values are encoded in mllib.
The 1hot is not encoding the binary value like...
[1, 0] or [0, 1]
in a [this, that] fashion but rather
[1] or [0]
In the sparse vector format the [0] case looks like (1,[],[]), meaning length=1, no position indexes have nonzero, and (thus) no nonzero values to list (can see more about how mllib represents sparse vectors here). So same as how a binary category only needs a single bit to represent both choices, the 1hot encoding uses a single index in the vector. From another article on encoding...
One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included [... But note that] for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available
If you don't want the onehot encoder to drop the last category to simplify the representation, the mllib class a dropLast param you can set, see https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator

Related

The data structure that is the result of stack-based flattening of nested homogeneous forward iterators (vectors/arrays/lists)

Let's say we have an array of vectors A: Vec[M] such that len(A[0]) = 5, len(A[1]) = 2, and len(A[2]) = 4.
One can traverse the stack that corresponds to A from left to write. One can also see that in order to adjust the data structure for higher dimensions, it is sufficient to have more integers in the beginning. How would you call such a data structure that is generic over the dimension and the type of stored elements?

Converting decimal values to binary

I am working on error control in WMSNs. I want to transmit a video through binary symmetric channel with error probability p. So I have frames (images) in each gop which are shown by a matrix.
Each matrix element have decimal value which might be positive or negative. As explained here I need to convert this whole matrix to a binary stream. I used
reshape(dec2bin(typecast(b,'uint8'),8).',1,[])
for converting elements to binary streams but I cannot get back the exact number using
typecast(uint8(bin2dec(reshape(m,8,[]).')),'double').
On the other side, I think for getting the right bit error rate, I have to convert the whole matrix to just one bit stream which I'm not sure how to do that. And convert them to a matrix of measured values of a image again.
I think you need
m = reshape(dec2bin(typecast(b(:),'uint8'),8).',1,[]);
Note that this reads the matrix in Matlab's standard, column-major order (down, then across).
Then you can convert back with
b_recovered = reshape(typecast(uint8(bin2dec(reshape(m,8,[]).')),'double'),size(b));
Since typecast converts data types without changing the underlying data, this process entails no loss of accuracy. For example,
>> b = randn(2,3)
b =
-0.241247174335006 0.540703471823211 0.526269662140438
0.908207564087271 -0.507829312416083 -1.067884765919437
>> m = reshape(dec2bin(typecast(b(:),'uint8'),8).',1,[])
m =
101100101011100000000010111110100010111111100001110011101011111101001010100000100011011101001111000010010001000011101101001111110010100100001111000010100101111001110001010011011110000100111111001011101101111100011000010000100010001101000000111000001011111100010010101001010111100001111001001100111101011111100000001111110101110001010100000110000101011000001110000101101111000110111111
>> b_recovered = reshape(typecast(uint8(bin2dec(reshape(m,8,[]).')),'double'),size(b))
b_recovered =
-0.241247174335006 0.540703471823211 0.526269662140438
0.908207564087271 -0.507829312416083 -1.067884765919437
>> b==b_recovered
ans =
2×3 logical array
1 1 1
1 1 1
>>

Changing the Rank of a Matrix

I have a rank one array given by
COMPLEX(KIND = DBL),DIMENSION(DIMJ)::INITIALSTATE
How can I covert its rank to
DIMENSION(DIMJ,1)
so I can perform matrix operations on it -- transpose etc.
Note that the change is trivial. In both cases, we have a column vector. But Fortran won't trasnpose the array in the first form. Assume that DIMJ is an initialized integer.
Also, as obvious, I'd like the complex numbers to remain intact in the right positions after the manipulation.
Is it possible to perform such an operation in fortran?
If you just want a temporary read-only version of the array, you can always use RESHAPE:
TRANSPOSEDSTATE = transpose(reshape(INITIALSTATE, (/ DIMJ, 1 /)))
But this doesn't change the shape of the INITIALSTATE variable. Also, that is the same as
TRANSPOSEDSTATE = reshape(INITIALSTATE, (/ 1, DIMJ /))
no transpose needed.

Create Ancestor Matrix from given Binary Tree

The question is, given a Ancestor Matrix, as a bitmap of 1s and 0s, to construct the corresponding Binary Tree. Can anyone give me an idea on how to do it? I found a solution at Stackoverflow, but the line a[root->data][temp[i]]=1 seems wrong, there is no binding that the nodes will contain data 1 to n. It may contain, say 2000, in which case, there will be no a[2000][some_column], since there are only 7 nodes, hence 7 rows and columns in the matrix.
Two ways:
Normalize your node values such that they are all from 1 to n. If you have nodes 1, 2, 5000 for example, make them 1, 2, 3. You can do this by sorting or hashing your labels and keeping something like normalized[i] = normalized value of node i. normalized can be a map / hash table if you have very large labels or even text labels.
You might be able to use a sparse matrix for this, implementable with a hash table or a set: keep a hash table of hash tables. H[x] stores another hash table that stores your y values. So if in a naive matrix solution you had a[2000][5000] = 1, you would use H.get(2000) => returns a hash table H' of values stored on the 2000th row => H'.get(5000) => returns the value you want.

String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
Let's consider these strings and scores as an example:
Hello world 1000
Hello world! 1010
Hello earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
Foo world! 2350
You can see that Hello world! and Hello world are similar and their scores are close to each other.
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".
Levenstein distance or its derivatives is the algorithm you want.
Match given string to each of strings from dictionary.
(Here, if you need only fixed number of most similar strings, you may want to use min-heap.)
If running Levenstein distance for all strings in dictionary is too expensive, then use some rough
algorithm first that will exclude too distant words from list of candidates.
After that, run levenstein distance on left candidates.
One way to remove distant words is to index n-grams.
Preprocess dictionary by splitting each of words into list of n-grams.
For example, consider n=3:
(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]
Next, create index of n-gramms:
" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]
When you need to find most similar strings for given string, you split given string into n-grams and select only those
words from dictionary which have at least one matching n-gram.
This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.
If your strings are long enough, you may reduce index size by using min-hashing technnique:
you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.
P.S. this presentation seems like a good introduction to your problem.
This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.
For example, you cannot assign numbers to these three phrases:
one two
one six
two six
Such that the numbers reflect the difference between all three phrases.
While the idea seems extremely sweet... I've never heard of this.
I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.
There are fairly elaborated technics, the one I am currently working on combines:
A Bursted Trie, with level compactness
A Levenshtein Automaton
Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.
If you ever find such a method, I am extremely interested :)
Would Levenshtein distance work for you?
In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.
Imagine similarity at the character level
stops
spots
hello world
world hello
In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)
Then compare the following similar messages
hello world
ello world
Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.
In the case of
spots
stops
The words only differ by position of the characters, so some form of position is important.
If the following strings are similar
yesssssssssssssss
yessssssssssssss
Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.
In general this is treated as a multi-dimensional problem - breaking the string into a vector
[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
But the values of the vector can not be
represented by a fixed size number, or
give good quality difference measure.
If the number of words, or length of strings were bounded, then a solution of coding may be possible.
Bounded values
Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.
data mining solution
If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.
I have code for such at github : clustering
Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.
Edit Distance or Levenshtein distance
This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.
I think of something like this:
remove all non-word characters
apply soundex
Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.
Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.
And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.
Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.
Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.
Just an idea.
ready-to-run PCA in C#
It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.
The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.
In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.
In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:
{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"
Med(base-string, "Hello World!") = 1
Med(base-string, "Hello Earth") = 8
1st closest string is "Hello World!"
This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)
public static int Distance(string s1, string s2)
{
int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
for (int i = 0; i <= s1.Length; i++)
matrix[i, 0] = i;
for (int i = 0; i <= s2.Length; i++)
matrix[0, i] = i;
for (int i = 1; i <= s1.Length; i++)
{
for (int j = 1; j <= s2.Length; j++)
{
int value1 = matrix[i - 1, j] + 1;
int value2 = matrix[i, j - 1] + 1;
int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
}
}
return matrix[s1.Length, s2.Length];
}
Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here
Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.
You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Resources