Explanation of Spark ML CountVectorizer output - apache-spark-mllib

Please help understand the output of the Spark ML CountVectorizer and suggest which documentation explains it.
val cv = new CountVectorizer()
.setInputCol("Tokens")
.setOutputCol("Frequencies")
.setVocabSize(5000)
.setMinTF(1)
.setMinDF(2)
val fittedCV = cv.fit(tokenDF.select("Tokens"))
fittedCV.transform(tokenDF.select("Tokens")).show(false)
2374 should be the number of terms (words) in the dictionary.
What is the "[2,6,328,548,1234]"?
Are they indices of the words "[airline, bag, vintage, world, champion]" in the dictionary? If so, why the same word "airline" has a different index "0" in the second line?
+------------------------------------------+----------------------------------------------------------------+
|Tokens |Frequencies |
+------------------------------------------+----------------------------------------------------------------+
...
|[airline, bag, vintage, world, champion] |(2374,[2,6,328,548,1234],[1.0,1.0,1.0,1.0,1.0]) |
|[airline, bag, vintage, jet, set, brown] |(2374,[0,2,6,328,405,620],[1.0,1.0,1.0,1.0,1.0,1.0]) |
+------------------------------------------+----------------------------------------------------------------+
[1]: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer

There is some doc explaining the basics. However this is pretty bare.
Yes. The numbers represent the words in a vocabulary index. However the order in the frequencies vector does not correspond to the order in tokens vector.
airline, bag, vintage are in both rows, hence they correspond to indices [2,6,328]. But you can't rely on the same order.
The row data type is a SparseVector. The first array, shows the indices and the second the values.
e.g
vector[328]
=> 1.0
a mapping could be as follows:
vocabulary
airline 328
bag 6
vintage 2
Frequencies
2734, [2, 6 ,328], [99, 5, 7]
# counts
vintage x 99
bag x 5
airline 7
In order to get the words back , you can do a lookup in the vocabulary. This needs to be broadcasted to different workers. You also most probably want to explode the counts per doc into separate rows.
Here is some python code snippet to extract top 25 frequent words per doc with a udf into separate rows and compute the mean for each word
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql import Row
vocabulary = sc.broadcast(fittedCV.vocabulary)
def _top_scores(v):
# create count tuples for each index(i) in a vector(v)
# `.item()` is used, because in python the count value is a numpy datatype, in `scala` it will be just double
counts = [Row(i=i.item(),count=v[i.item()].item()) for i in v.indices]
# => [Row(i=2, count=30, Row(i=362, count=40)]
# return 25 top count rows
counts = sorted(counts, reverse=True, key=lambda x: x.count)
return counts[:25]
top_scores = F.udf(_top_scores, T.ArrayType(T.StructType().add('i', T.IntegerType()).add('count', T.DoubleType())))
vec_to_word = F.udf(_vecToWord, T.StringType())
def _vecToWord(i):
return vocabulary.value[i]
res = df.withColumn('word_count', explode(top_scores('Frequencies')))
=>
+-----+-----+----------+
doc_id, ..., word_count
(i, count)
+-----+-----+----------+
4711, ..., (2, 30.0)
4711, ..., (362, 40.0)
+-----+-----+----------+
res = res \
.groupBy('word_count.i').agg( \
avg('word_count.count').alias('mean')
.orderBy('mean', ascending=False)
res = res.withColumn('token', vec_to_word('i'))
=>
+---+---------+----------+
i, token, mean
+---+---------+----------+
2, vintage, 15
328, airline, 30
+--+----------+----------+

Related

Optimum redistribution algorothm

Let's assume I have a bucket of identical items (n in number) distributed in identical buckets (m in number). The given distribution may or may not be fair/uniform. The goal is to write an algorithm that can uniformly redistribute these items by transferring some items. Each transfer has a cost associated with it, so the number of transfers must be minimum.
For example I have total 7 items in 3 buckets with this distribution. Input is a vector of size 'm' of number of items in each bucket with - 4, 2, 1
The solution would involve 1 transfer from bucket 1 to bucket 3 and the resulting distribution will look like - 3, 2, 2
Since n(7) is not perfectly divisible by m(3), this is the closest achievable uniform distribution.
Another sample case -
input: {1, 4, 5, 11}
output: {5, 5, 5, 6}
Number of transfer to get to the output: 5
I'm looking for some existing algorithms that can solve this problem statement. Thanks!
While your use case doesn't necessitate putting much thought into the implementation, since the numbers are apparently small, it may sometimes be desirable to use an algorithm with a time complexity not essentially dependent on the number of items, e. g.:
import numpy as np
def dist(inp):
inp = np.array(inp)
m = len(inp) # number of buckets
n = sum(inp) # number of items
l = n//m # low number in an output bucket
h = (n+m-1)//m # high number in an output bucket
lo = inp < l # buckets which need transfer to
hi = inp > h # buckets which need transfer from
out = np.empty(m, int)
out[lo] = l # fill underfull buckets with low
out[hi] = h # fill overfull buckets with high
out[~lo & ~hi] = inp[~lo & ~hi] # keep other buckets as is
o = sum(out) # check missing or surplus items
if o < n: out[np.where(out == l)[0][:n-o]] = h # adjust
if o > n: out[np.where(out == h)[0][:o-n]] = l # adjust
return out

Sort Thousands of Chuck E. Cheese Tickets

I need to sort an n-thousand size array of random unique positive integers into groups of consecutive integers, each of group size k or larger, and then further grouped into dividends of some arbitrary positive integer j.
In other words, let's say I work at Chuck E. Cheese and we sometimes give away free tickets. I have a couple hundred thousand tickets on the floor and want to find out what employee handed out what but only for ticket groupings of consecutive integers that are larger than 500. Each employee has a random number from 0 to 100 assigned to them. That number corresponds to what "batch" of tickets where handed out, i.e. tickets from #000000 to #001499 where handed out by employee 1, tickets from #001500 to #002999 were handed out by employee 2, and so on. A large number of tickets are lost or are missing. I only care about groups of consecutive ticket numbers larger than 500.
What is the fastest way for me to sort through this pile?
Edit:
As requested by #trincot, here is a worked out example:
I have 150,000 unique tickets on the floor ranging from ticket #000000 to #200000 (i.e. missing 50,001 random tickets from the pile)
Step 1: sort each ticket from smallest to largest using an introsort algorithm.
Step 2: go through the list of tickets one by one and gather only tickets with "consecutiveness" greater than 500. i.e. I keep a tally of how many consecutive values I have found and only keep those with tallys 500 or higher. If I have tickets #409 thru #909 but not #408 or #1000 then I would keep that group but if that group had missed a ticket anywhere from #409 to #909, I would have thrown out the group and moved on.
Step 3: combine all my newly sorted groups together, each of which are size 500 or larger.
Step 4: figure out what tickets belong to who by going through the final numbers one by one again, dividing each by 1500, rounding down to nearest whole number, and putting them in their respective pile where each pile represents an employee.
The end result is a set of piles telling me which employees gave out more than 500 tickets at a time, how many times they did so, and what tickets they did so with.
Sample with numbers:
where k == 3 and j = 1500; k is minimum consecutive integer grouping size, j is final ticket interval grouping size i.e. 5,6, and 7 fall into the 0th group of intervals of size 1500 and 5996, 5997, 5998, 5999 fall into the third group of intervals of size 1500.
Input: [5 , 5996 , 8111 , 1000 , 1001, 5999 , 8110 , 7 , 5998 , 2500 , 1250 , 6 , 8109 , 5997]
Output:[ 0:[5, 6, 7] , 3:[5996, 5997, 5998, 5999] , 5:[8109, 8110, 8111] ]
Here is how you could do it in Python:
from collections import defaultdict
def partition(data, k, j):
data = sorted(data)
start = data[0] # assuming data is not an empty list
count = 0
output = defaultdict(list) # to automatically create a partition when referenced
for value in data:
bucket = value // j # integer division
if value % j == start % j + count: # in same partition & consecutive?
count += 1
if count == k:
# Add the k entries that we skipped so far:
output[bucket].extend(list(range(start, start+count)))
elif count > k:
output[bucket].append(value)
else:
start = value
count = 1
return dict(output)
# The example given in the question:
data = [5, 5996, 8111, 1000, 1001, 5999, 8110, 7, 5998, 2500, 1250, 6, 8109, 5997]
print(partition(data, k=3, j=1500))
# outputs {0: [5, 6, 7], 3: [5996, 5997, 5998, 5999], 5: [8109, 8110, 8111]}
Here is untested Python for the fastest approach that I can think of. It will return just pairs of first/last ticket for each range of interest found.
def grouped_tickets (tickets, min_group_size, partition_size):
tickets = sorted(tickets)
answer = {}
min_ticket = -1
max_ticket = -1
next_partition = 0
for ticket in tickets:
if next_partition <= ticket or max_ticket + 1 < ticket:
if min_group_size <= max_ticket - min_ticket + 1:
partition = min_ticket // partition_size
if partition in answer:
answer[partition].append((min_ticket, max_ticket))
else:
answer[partition] = [(min_ticket, max_ticket)]
# Find where the next partition is.
next_partition = (ticket // partition_size) * partition_size + partition_size
min_ticket = ticket
max_ticket = ticket
else:
max_ticket = ticket
# And don't lose the last group!
if min_group_size <= max_ticket - min_ticket + 1:
partition = min_ticket // partition_size
if partition in answer:
answer[partition].append((min_ticket, max_ticket))
else:
answer[partition] = [(min_ticket, max_ticket)]
return answer

How to do fuzzy string matching of bigger than memory dictionary in an ordered key-value store?

I am looking for an algorithm and storage schema to do string matching over a bigger than memory dictionary.
My initial attempt, inspired from https://swtch.com/~rsc/regexp/regexp4.html, was to store trigams of every word of the dictionary for instance the word apple is split into $ap, app, ppl, ple and le$ at index time. All of those trigram as associated with the word they came from.
Then I query time, I do the same for the input string that must be matched. I look up every of those trigram in the database and store in the candidate words in mapping associated with the number of matching trigrams in them. Then, I proceed to compute the levenshtein distance between every candidate and apply the following formula:
score(query, candidate) = common_trigram_number(query, candidate) - abs(levenshtein(query, candidate))
There is two problems with this approach, first the candidate selection is too broad. Second the levenshtein distance is too slow to compute.
Fixing the first, could make the second useless to optimize.
I thought about another approach, at index time, instead of storing trigrams, I will store words (possibly associated with frequency). At query time, I could lookup successive prefixes of the query string and score using levenshtein and frequency.
In particular, I am not looking for an algorithm that gives me strings at a distance of 1, 2 etc... I would like to just have a paginated list of more-or-less relevant words from the dictionary. The actual selection is made by the user.
Also it must be possible to represent it in terms of ordered key-value store like rocksdb or wiredtiger.
simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.
I think I found a hash that allows to capture similarity inside the prefix of the hash:
In [1]: import fuzz
In [2]: hello = fuzz.bbkh("hello")
In [3]: helo = fuzz.bbkh("helo")
In [4]: hellooo = fuzz.bbkh("hellooo")
In [5]: salut = fuzz.bbkh("salut")
In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix
Out[6]: 213
In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))
Out[7]: 12
In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))
Out[8]: 0
After small test over wikidata labels it seems to give good results:
$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france 0
** farrance -2
** freande -2
** defrance -2
real 0m0.054s
$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace -1
** france -2
** fernacre -2
real 0m0.060s
$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium -2
real 0m0.047s
$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium 0
** ajbelgium -2
real 0m0.059s
$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium -1
** beijum -2
real 0m0.047s
Here is an implementation:
HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)
chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])
def ngram(string, n):
return [string[i:i+n] for i in range(len(string)-n+1)]
def integer2booleans(integer):
return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
def merkletree(booleans):
assert len(booleans) == HASH_SIZE
length = (2 * len(booleans) - 1)
out = [False] * length
index = length - 1
booleans = list(reversed(booleans))
while len(booleans) > 1:
for boolean in booleans:
out[index] = boolean
index -= 1
new = []
for (right, left) in chunks(booleans, 2):
value = right or left
new.append(value)
booleans = new
return out
def bbkh(string):
integer = 0
string = "$" + string + "$"
for gram in ngram(string, 2):
hotbit = ONE_HOT_ENCODER.index(gram)
hotinteger = 1 << hotbit
integer = integer | hotinteger
booleans = integer2booleans(integer)
tree = merkletree(booleans)
fuzz = ''.join('1' if x else '0' for x in tree)
buzz = int(fuzz, 2)
hash = buzz.to_bytes(BBKH_LENGTH, 'big')
return hash
def lcp(a, b):
"""Longest Common Prefix between a and b"""
out = []
for x, y in zip(a, b):
if x == y:
out.append(x)
else:
break
return ''.join(out)
Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.
Checkout the code at https://github.com/amirouche/fuzzbuzz/blob/master/typofix.py
Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.
Using a bytes encoding is an optimization. So it is possible to figure what binary representation 0b001 means.

Take top n results from table in power query, where n is dynamic based on a an if function

I want to use Power Query to extract by field(field is [Project]), then get the top 3 scoring rows from the master table for each project, but if there are more than 3 rows with a score of over 15, they should all be included. 3 rows must be extracted every time as minimum.
Essentially I'm trying to combine Keep Rows function with my formula of "=if(score>=15,1,0)"
Setting the query to records with score greater than 15 doesn't work for projects where the highest scores are, for example, 1, 7 and 15. This would only return 1 row, but we need 3 as a minimum.
Setting it to the top 3 scores only would omit rows in a table where the highest scores are 18, 19, 20
Is there a way to combine the two function to say "Choose the top 3 rows, but choose the top n rows if there are n rows with score >= 15
As far as I understand you try to do following (Alexis Olson proposed very same):
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each Table.SelectRows(Table.AddIndexColumn(Table.Sort(_, {"Score", 1}), "i", 1, 1), each [i]<=3 or [Score]>=15)}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"temp", each [a = Table.Sort(_, {"Score", 1}), b = Table.FirstN(a, 3) & Table.SelectRows(Table.Skip(a,3), each [Score]>=15)][b]}),
expand = Table.ExpandTableColumn(group, "temp", {"Score"})
in
expand
Or:
let
Source = Excel.CurrentWorkbook(){[Name="Table"]}[Content],
group = Table.Group(Source, {"Project"}, {"Score", each [a = List.Sort([Score], 1), b = List.FirstN(a,3)&List.Select(List.Skip(a,3), each _ >=15)][b]}),
expand = Table.ExpandListColumn(group, "Score")
in
expand
Note, if there are more columns in the table you want to keep, for first and second variants you may just add these columns to last step. For last variant you haven't such option and the code should be modified.
Sort by the Score column in descending order and then add an Index column (go to Add Column > Index Column > From 1).
Then filter on the Index column choosing to keep values less than or equal to 3. This should produce a step with this M code:
= Table.SelectRows(#"Added Index", each [Index] <= 3)
Now you just need to make a small adjustment to also include any score 15 or greater:
= Table.SelectRows(#"Added Index", each [Index] <= 3 or [Score] >= 15)

Python: break up dataframe (one row per entry in column, instead of multiple entries in column)

I have a solution to a problem, that to my despair is somewhat slow, and I am seeking advice on how to speed up my solution (by adding vectorization or other clever methods). I have a dataframe that looks like this:
toy = pd.DataFrame([[1,'cv','c,d,e'],[2,'search','a,b,c,d,e'],[3,'cv','d']],
columns=['id','ch','kw'])
Output is:
The task is to break up kw column into one (replicated) row per comma-separated entry in each string. Thus, what I wish to achieve is:
My initial solution is the following:
data = pd.DataFrame()
for x in toy.itertuples():
id = x.id; ch = x.ch; keys = x.kw.split(",")
data = data.append([[id, ch, x] for x in keys], ignore_index=True)
data.columns = ['id','ch','kw']
Problem is: it is slow for larger dataframes. My hope is that someone has encountered a similar problem before, and knows how to optimize my solution. I'm using python 3.4.x and pandas 0.19+ if that is of importance.
Thank you!
You can use str.split for lists, then get len for length.
Last create new DataFrame by constructor with numpy.repeat and numpy.concatenate:
cols = toy.columns
splitted = toy['kw'].str.split(',')
l = splitted.str.len()
toy = pd.DataFrame({'id':np.repeat(toy['id'], l),
'ch':np.repeat(toy['ch'], l),
'kw':np.concatenate(splitted)})
toy = toy.reindex_axis(cols, axis=1)
print (toy)
id ch kw
0 1 cv c
0 1 cv d
0 1 cv e
1 2 search a
1 2 search b
1 2 search c
1 2 search d
1 2 search e
2 3 cv d

Resources