What is a "term-vector algorithm"? - algorithm

Google states that a "term-vector algorithm" can be used to determine popular keywords. I have studied http://en.wikipedia.org/wiki/Vector_space_model, but cant understand the term "term-vector algorithm".
Please explain it in a brief summary, very simple language, as if the reader is a child.
I believe "vector" refers to the mathematics definition, a quantity having direction as well as magnitude. How is it that keywords have a quantity moving in a direction?
http://en.wikipedia.org/wiki/Vector_space_model states "Each dimension corresponds to a separate term." I thought dimension relates to cardinality, is that correct?
From the book Hadoop In Practice, by Alex Holmes, page 12.

It means that each word forms a separate dimension:
Example: (shamelessly taken from here)
For a model containing only three words you would get:
dict = { dog, cat, lion }
Document 1
“cat cat” → (0,2,0)
Document 2
“cat cat cat” → (0,3,0)
Document 3
“lion cat” → (0,1,1)
Document 4
“cat lion” → (0,1,1)

The most popular example for MapReduce is to calculate work frequency; namely, a map step to output the word as key with 1 as a value, and a reduce step to sum the numbers for each word. So if a web page has a list of (possibly duplicate) words that occur, each word in that list maps to 1. The reduce step essentially counts how many times each word occurs in that page. You can do this across pages, websites, or whatever criteria. The resulting data is a dictionary mapping word to frequency which is effectively a term frequency vector.
Example document: "a be see be a"
Resulting data: { 'a':2, 'be':2, 'see':1 }

Term vector sounds like it just mean that each term has a weight or number value attached, probably corresponding to the number of times the term is mentioned.
You are thinking of the geometric meaning of the word vector but there is another mathematical meaning that just means multiple dimensions ie instead of saying x,y,z you say the vector x in bold that has multiple dimensions x1, x2, x3...xn and some values. So for a term vector, the vector is term and it takes the form term1, term2 up to term n. Each can then have a value, just as x,y, or z has a value.
As an example term 1 could be dog, term 2 cat, term3 lion and each has a weight, 2, 3, 1, meaning the word dog appears twice, cat 3 times and lion 1 time.

Related

Term for a slice of a multi-dimensional array or tensor

Ported from English Language & Usage -> Mathematics
And then:
Ported from Mathematics -> SO
I'm in the market for a mathematical (or otherwise) term to describe a slice of a hypercube.
Tensor is out of the running as that's the name of the object I am slicing.
The second I could use a hand with is a term to describe an index (or access point) that spans more than a single point in each dimension.
Though about using Ranged Index as a collection of Ranged Dimension, but I'm really hoping there's a more concise and explicit alternative.
For example:
Regular index [1, 2, 1] would access index 1 in dimension-one, index 2 in dimension-2 and index 1 in dimension-3
Spanning Index (or whatever) [3->4, 1, 4->9] would access all elements between indices 3 and 4 of dimension-1, index 1 of dimension-2, etc...
The first is a hyperplane.
The second would be a rectangle in 3 dimensions, so a hyperrectangle in more than three. (Or possibly, n-orthotope as in the wiki article, depending on how geeky you want to be.)

How can I merge similar words into one?

The similar_text gem can calculate the words's pairwise similarity. I want to merge words whose similarity is greater than 50% into one, and keep the longest one.
Original
[
"iphone 6",
"iphone 5c",
"iphone 6",
"macbook air",
"macbook",
]
Expected
[
"iphone 5c",
"macbook air",
]
But I don't know how to implement the algorithm to filter the expected results efficiently.
This is not a trivial problem and also not 100% what you're looking for.
Especially how to handle transitive similarities: If a is similar to b and b similar to c, are a and c in the same group (even if they aren't similar to each other?)
Here is a piece of code where you can find all similar pairs in an array:
def find_pairs(ar)
ar.product(ar).reject{|l,r| l == r}.map(&:sort).uniq
.map{|l,r| [[l,r],l.similar(r)]}
.reject{|pair, similarity| similarity < 50.0}
.map{|pair, _| pair}
end
For an answer on how to find the groups in the matches see:
Finding All Connected Components of an Undirected Graph
First of all - There is no efficient way to do this as you must calculate all piers which can take long times on long lists.
Having said that...
I am not familiar with this specific gem but I'm assuming it will give you some sort of distance between the two words (the smaller the better) or probability the words are the same (Higher results are better). Let's go with distance as changing the algorithm to probability is trivial.
This is just an algorithm description that you may find useful. It helped my in a similar case.
What I suggest is to put all the words in a 2 dimensional array as the headers of the rows and columns. If you have N words you need NxN matrix.
In each cell put the calculated distance between the words (the row and column headers).
You will get a matrix of all the possible distances. Remember that in this case we look for minimum distance between words.
So, for each row, look for the minimum cell (not the one with a zero value which is the distance of the word with itself).
If this min is bigger than some threshold than this word has no similar words. If not, look for all the words in this row with a distance up to the threshold (actually you can skip the previous stage and just do this search).
All of the words you found belong to the same group. Look for the longest and use it in the new list you are building.
Also note the column indexes you found minimum distances in, and skip the rows with that indexes (so you will not add the same words to different groups).

scrabble solving with maximum score

I was asked a question
You are given a list of characters, a score associated with each character and a dictionary of valid words ( say normal English dictionary ). you have to form a word out of the character list such that the score is maximum and the word is valid.
I could think of a solution involving a trie made out of dictionary and backtracking with available characters, but could not formulate properly. Does anyone know the correct approach or come up with one?
First iterate over your letters and count how many times do you have each of the characters in the English alphabet. Store this in a static, say a char array of size 26 where first cell corresponds to a second to b and so on. Name this original array cnt. Now iterate over all words and for each word form a similar array of size 26. For each of the cells in this array check if you have at least as many occurrences in cnt. If that is the case, you can write the word otherwise you can't. If you can write the word you compute its score and maximize the score in a helper variable.
This approach will have linear complexity and this is also the best asymptotic complexity you can possibly have(after all the input you're given is of linear size).
Inspired by Programmer Person's answer (initially I thought that approach was O(n!) so I discarded it). It needs O(nr of words) setup and then O(2^(chars in query)) for each question. This is exponential, but in Scrabble you only have 7 letter tiles at a time; so you need to check only 128 possibilities!
First observation is that the order of characters in query or word doesn't matter, so you want to process your list into a set of bag of chars. A way to do that is to 'sort' the word so "bac", "cab" become "abc".
Now you take your query, and iterate all possible answers. All variants of keep/discard for each letter. It's easier to see in binary form: 1111 to keep all, 1110 to discard the last letter...
Then check if each possibility exists in your dictionary (hash map for simplicity), then return the one with the maximum score.
import nltk
from string import ascii_lowercase
from itertools import product
scores = {c:s for s, c in enumerate(ascii_lowercase)}
sanitize = lambda w: "".join(c for c in w.lower() if c in scores)
anagram = lambda w: "".join(sorted(w))
anagrams = {anagram(sanitize(w)):w for w in nltk.corpus.words.words()}
while True:
query = input("What do you have?")
if not query: break
# make it look like our preprocessed word list
query = anagram(sanitize(query))
results = {}
# all variants for our query
for mask in product((True, False), repeat=len(query)):
# get the variant given the mask
masked = "".join(c for i, c in enumerate(query) if mask[i])
# check if it's valid
if masked in anagrams:
# score it, also getting the word back would be nice
results[sum(scores[c] for c in masked)] = anagrams[masked]
print(*max(results.items()))
Build a lookup trie of just the sorted-anagram of each word of the dictionary. This is a one time cost.
By sorted anagram I mean: if the word is eat you represent it as aet. It the word is tea, you represent it as aet, bubble is represent as bbbelu etc
Since this is scrabble, assuming you have 8 tiles (say you want to use one from the board), you will need to maximum check 2^8 possibilities.
For any subset of the tiles from the set of 8, you sort the tiles, and lookup in the anagram trie.
There are at most 2^8 such subsets, and this could potentially be optimized (in case of repeating tiles) by doing a more clever subset generation.
If this is a more general problem, where 2^{number of tiles} could be much higher than the total number of anagram-words in the dictionary, it might be better to use frequency counts as in Ivaylo's answer, and the lookups potentially can be optimized using multi-dimensional range queries. (In this case 26 dimensions!)
Sorry, this might not help you as-is (I presume you are trying to do some exercise and have constraints), but I hope this will help the future readers who don't have those constraints.
If the number of dictionary entries is relatively small (up to a few million) you can use brute force: For each word, create a 32 bit mask. Preprocess the data: Set one bit if the letter a/b/c/.../z is used. For the six most common English characters etaoin set another bit if the letter is used twice.
Create a similar bitmap for the letters that you have. Then scan the dictionary for words where all bits that are needed for the word are set in the bitmap for the available letters. You have reduced the problem to words where you have all needed characters once, and the six most common characters twice if the are needed twice. You'll still have to check if a word can be formed in case you have a word like "bubble" and the first test only tells you that you have letters b,u,l,e but not necessarily 3 b's.
By also sorting the list of words by point values before doing the check, the first hit is the best one. This has another advantage: You can count the points that you have, and don't bother checking words with more points. For example, bubble has 12 points. If you have only 11 points, then there is no need to check this word at all (have a small table with the indexes of the first word with any given number of points).
To improve anagrams: In the table, only store different bitmasks with equal number of points (so we would have entries for bubble and blue because they have different point values, but not for team and mate). Then store all the possible words, possibly more than one, for each bit mask and check them all. This should reduce the number of bit masks to check.
Here is a brute force approach in python, using an english dictionary containing 58,109 words. This approach is actually quite fast timing at about .3 seconds on each run.
from random import shuffle
from string import ascii_lowercase
import time
def getValue(word):
return sum(map( lambda x: key[x], word))
if __name__ == '__main__':
v = range(26)
shuffle(v)
key = dict(zip(list(ascii_lowercase), v))
with open("/Users/james_gaddis/PycharmProjects/Unpack Sentance/hard/words.txt", 'r') as f:
wordDict = f.read().splitlines()
f.close()
valued = map(lambda x: (getValue(x), x), wordDict)
print max(valued)
Here is the dictionary I used, with one hyphenated entry removed for convenience.
Can we assume that the dictionary is fixed and the score are fixed and that only the letters available will change (as in scrabble) ? Otherwise, I think there is no better than looking up each word of the dictionnary as previously suggested.
So let's assume that we are in this setting. Pick an order < that respects the costs of letters. For instance Q > Z > J > X > K > .. > A >E >I .. > U.
Replace your dictionary D with a dictionary D' made of the anagrams of the words of D with letters ordered by the previous order (so the word buzz is mapped to zzbu, for instance), and also removing duplicates and words of length > 8 if you have at most 8 letters in your game.
Then construct a trie with the words of D' where the children nodes are ordered by the value of their letters (so the first child of the root would be Q, the second Z, .., the last child one U). On each node of the trie, also store the maximal value of a word going through this node.
Given a set of available characters, you can explore the trie in a depth first manner, going from left to right, and keeping in memory the current best value found. Only explore branches whose node's value is larger than you current best value. This way, you will explore only a few branches after the first ones (for instance, if you have a Z in your game, exploring any branch that start with a one point letter as A is discarded, because it will score at most 8x1 which is less than the value of Z). I bet that you will explore only a very few branches each time.

Correct order output in K Means and document clustering

I am doing a single document clustering with K Means, I am now working on preparing the data to be clustered and represent N sentences in their vector representations.
However, if I understand correctly, KMeans algorithm is set to create k clusters based on the euclidean distance to k center points. Regardless of the sentences order.
My problem is that I want to keep the order of the sentences and consider them in the clustering task.
Let say S = {1...n} a set of n vectors representing sentences, S_1 = sentence 1 , S_2 = sentence 2 .. etc.
I want that the clusters will be K_1 = S[1..i], K_2 = S[i..j] etc..
I thought maybe transform this into 1D and sum the index of each sentence to the transformed value. But not sure if it will help. And maybe there's a smarter way.
A quick and dirty way to do this would be to append each lexical item with the sentence number it's in. First sentence segment, then, for this document:
This document's really great. It's got all kinds of words in it. All the words are here.
You would get something like:
{"0_this": 1, "0_document": 1, "0_be": 1, "0_really": 1,...}
Whatever k-means you're using, this should be readily accepted.
I'd warn against doing this at all in general, though. You're introducing a lot of data sparsity, and your results will be more harmed by the curse of dimensionality. You should only do it if the genre you're looking at is (1) very predictable in lexical choice and (2) very predictable in structure. I can't think of a good linguistic reason that sentences should align precisely across texts.

relevant text search in a large text file

I have a large text document and I have a search query(e.g. : rock climbing). I want to return 5 most relevant sentences from the text. What are the approaches that can be followed? I am a complete newbie at this text retrieval domain, so any help is appreciated.
One approach I can think of is :
scan the file sentence by sentence, look for the whole search query in the sentence and if it matches then return the sentence.
above approach works only if some of the sentences contain the whole search query. what to do if there are no sentences containing whole query and if some of the sentences contain just one of the word? or what if they contain none of the words?
any help?
Another question I have is can we preprocess the text document to make building index easier? Is trie a good data structure for preprocessing?
In general, relevance is something you define using some sort of scoring function. I will give you an example of a naive scoring algorithm, as well as one of the common search engine ranking algorithms (used for documents, but I modified it for sentences for educational purposes).
Naive ranking
Here's an example of a naive ranking algorithm. The ranking could go as simple as:
Sentences are ranked based on the average proximity between the query terms (e.g. the biggest number of words between all possible query term pairs), meaning that a sentence "Rock climbing is awesome" is ranked higher than "I am not a fan of climbing because I am lazy like a rock."
More word matches are ranked higher, e.g. "Climbing is fun" is ranked higher than "Jogging is fun."
Pick alphabetical or random favorites in case of a tie, e.g. "Climbing is life" is ranked higher than "I am a rock."
Some common search engine ranking
BM25
BM25 is a good robust algorithm for scoring documents with relation to the query. For reference purposes, here's a Wikipedia article about BM25 ranking algorithm. You would want to modify it a little because you are dealing with sentences, but you can take a similar approach by treating each sentence as a 'document'.
Here it goes. Assuming your query consists of keywords q1, q2, ... , qm, the score of a sentence S with respect to the query Q is calculated as follows:
SCORE(S, Q) = SUM(i=1..m) (IDF(qi * f(qi, S) * (k1 + 1) / (f(qi, S) + k1 * (1 - b + b * |S| / AVG_SENT_LENGTH))
k1 and b are free parameters (could be chosen as k in [1.2, 2.0] and b = 0.75 -- you can find some good values empirically) f(qi, S) is the term frequency of qi in a sentence S (could treat is as just the number of times the term occurs), |S| is the length of your sentence (in words), and AVG_SENT_LENGTH is the average sentence length of your sentences in a document. Finally, IDF(qi) is the inverse document frequency (or, in this case, inverse sentence frequency) of the qi, which is usually computed as:
IDF(qi) = log ((N - n(qi) + 0.5) / (n(qi) + 0.5))
Where N is the total number of sentences, and n(qi) is the number of sentences containing qi.
Speed
Assume you don't store inverted index or any additional data structure for fast access.
These are the terms that could be pre-computed: N, *AVG_SENT_LENGTH*.
First, notice that the more terms are matched, the higher this sentence will be scored (because of the sum terms). So if you get top k terms from the query, you really need to compute the values f(qi, S), |S|, and n(qi), which will take O(AVG_SENT_LENGTH * m * k), or if you are ranking all the sentences in the worst case, O(DOC_LENGTH * m) time where k is the number of documents that have the highest number of terms matched and m is the number of query terms. Assuming each sentence is about AVG_SENT_LENGTH, and you have to go m times for each of the k sentences.
Inverted index
Now let's look at inverted index to allow fast text searches. We will treat your sentences as documents for educational purposes. The idea is to built a data structure for your BM25 computations. We will need to store term frequencies using inverted lists:
wordi: (sent_id1, tf1), (sent_id2, tf2), ... ,(sent_idk, tfk)
Basically, you have hashmaps where your key is word and your value is list of pairs (sent_id<sub>j</sub>, tf<sub>k</sub>) corresponding to ids of sentences and frequency of a word. For example, it could be:
rock: (1, 1), (5, 2)
This tells us that the word rock occurs in the first sentence 1 time and in the fifth sentence 2 times.
This pre-processing step will allow you to get O(1) access to the term frequencies for any particular word, so it will be fast as you want it.
Also, you would want to have another hashmap to store sentence length, which should be a fairly easy task.
How to build inverted index? I am skipping stemming and lemmatization in your case, but you are welcome to read more about it. In short, you traverse through your document, continuously creating pairs/increasing frequencies for your hashmap containing the words. Here are some slides on building the index.

Resources