How can I vectorize a list of words? - text-classification

I am working on SMS data where I have a list of words in my one column of dataframe
I want to train a classifier to predict it's type and subtype.
How would I convert the words into numerical format as they are in a list.

The idea is to use as vocabulary all the words found in this column across instances, except that the least frequent words should be removed (to avoid overfitting). Then for every instance the column is represented as vector of boolean features, where the nth value represents the nth word in the vocabulary: 1 if it is in the list for this instance, 0 if not.
In python you can use CountVectorizer, considering every list in the column as a sentence.

Related

Look for a data structure to match words by letters

Given a list of lowercase radom words, each word with same length, and many patterns each with some letters at some positions are specified while other letters are unknown, find out all words that matches each pattern.
For example, words list is:
["ixlwnb","ivknmt","vvqnbl","qvhntl"]
And patterns are:
i-----
-v---l
-v-n-l
With a naive algorithm, one can do an O(NL) travel for each pattern, where N is the words count and L is the word length.
But since there may be a lot of patterns travel on the same words list, is there any good data structure to preprocess and store the words list, then give a sufficient matching for all patterns?
One simple idea is to use an inverted index. First, number your words -- you'll refer to them using these indices rather than the words themselves for speed and space efficiency. Probably the index fits in a 32-bit int.
Now your inverted index: for each letter in each position, construct a sorted list of IDs for words that have that letter in that location.
To do your search, you take the lists of IDs for each of the letters in the positions you're given, and take the intersection of the lists, using a an algorithm like the "merge" in merge-sort. All IDs in the intersection match the input.
Alternatively, if your words are short enough (12 characters or fewer), you could compress them into 64 bit words (using 5 bits per letter, with letters 1-26). Construct a bit-mask with binary 11111 in places where you have a letter, and 00000 in places where you have a blank. And a bit-test from your input with the 5-bit code for each letter in each place, using 00000 where you have blanks. For example, if your input is a-c then your bitmask will be binary 111110000011111 and your bittest binary 000010000000011. Go through your word-list, take this bitwise and of each word with the bit-mask and test to see if it's equal to the bit-test value. This is cache friendly and the inner loop is tight, so may be competitive with algorithms that look like they should be faster on paper.
I'll preface this with it's more of a comment and less of an answer (I don't have enough reputation to comment though). I can't think of any data structure that will satisfy the requirements of of the box. It was interesting to think about, and figured I'd share one potential solution that popped into my head.
I keyed in on the "same length" part, and figured I could come up with something based on that.
In theory we could have N(N being the length) maps of char -> set.
When strings are added, it goes through each character and adds the string to the corresponding set. psuedocode:
firstCharMap[s[0]].insert(s);
secondCharMap[s[1]].insert(s);
thirdCharMap[s[2]].insert(s);
fourthCharMap[s[3]].insert(s);
fifthCharMap[s[4]].insert(s);
sixthCharMap[s[5]].insert(s);
Then to determine which strings match the pattern, we take just do an intersection of the sets ex: "-v-n-l" would be:
intersection of sets: secondCharMap[v], fourthCharMap[n], sixthCharMap[l]
One edge case that jumps out is if I wanted to just get all of the strings, so if that's a requirement--we may also need an additional set of all of the strings.
This solution feels clunky, but I think it could work. Depending on the language, number of strings, etc--I wouldn't be surprised if it performed worse than just iterating over all strings and checking a predicate.

Pseudocode of sorting a list of strings without using loops

I was trying to think of an algorithm that would sort a list of strings according to its first 4 chars (say each line from a file), without using the conventional looping methods such as while,for. An example of inputs would be:
1231COME1900123
1233COME1902030
2031COME1923919
1231GO 1231203
1233GO 1932911
2031GO 1239391
The thing is, we do not know the number of records there can be beforehand. And each 4-digit ID number can have multiple COME and GO records. But they are sorted as above beforehand. And I want to sort the file by their 4-digit ID number. And achieve this:
1231COME1900123
1231GO 1231203
1233COME1902030
1233GO 1932911
2031COME1923919
2031GO 1239391
The only logical comment I can have is that we should be using a recursive way to read through the records, but the sorting part is a bit tricky for me. Also GOTO could be used as well. Any ideas?
Assuming that the 1st 4 characters of each entry are always digits, you do something as follows:
Create a list of length 10000, where each element can hold a pair of values.
Enter into that element of the list based upon the first 4 digits.
The shape of the individual elements will be as follows -> [COME_ELEMENT, GO_ELEMENT].
Each COME_ELEMENT and GO_ELEMENT is a list in itself, of length equal to the maximum value + 1 that can appear after the words COME & GO.
Now, as the string arrives break it at the 1st 4 digits. Now, go to that element of the list.
After that, check whether it's a go or come.
If it's a go (suppose), then determine the number after the word go.
Insert the string at the index (determined in 7th step) in the inner list.
When you're done with inserting values, just traverse the non-empty elements.
The result so obtained will contain the sorted order that you require without the use of looping.

Sort array of strings based on matching words with input string

I was not able to find any solution for below problem in a coding contest.
Problem:
We have input string of "good words" separated by underscore and list of user reviews (basically array of strings where each array element is having some words separated by underscore).
We have to sort the list of user reviews such that elements having more number of good words comes first.
Example:
input:
good words: "pool_clean_food".
user review array:["food_bedroom_environment","view_sea_desert","clean_pool_table"].
output: [2,0,1]
Explanation:
Array[2]="clean_pool_table" having 2 good words i.e. pool and clean
Array[0]="food_bedroom_environment" having 1 good word i.e. food
Array[1]="view_sea_desert" having 0 good word i.e. nil
How can I approach the problem, which data structure shall I use so that my code can handle large inputs?
Split the words of input good words by underscore and store them in a hashset.
Now for each review, assign score 0 initially. split the words by underscore as well and check if the words are present in the hashset one by one. If the word is present, add 1 to score of that word.
Now consider every reviews as <review, score> pair and sort those reviews based on their score value in ascending order. You can use any standard sorting O(nlogn) algorithm for this.
Instead of hashset, you can use Trie which will be speed up the algorithm in case the words are too big.

Best data structure to count letter frequencies?

Task:
What is the most common first letter found in all the words in this document?
-unweighted (count a word once regardless of how many times it shows up)
-weighted (count a word separately for each time it shows up)
What is the most common word of a given length in this document?
I'm thinking of using a hashmap to count the most common first letter. But should I use a hashmap for both the unweighted and weighted?
And for most common word of a given length(ex. 5) could I use something more simple like an array list?
For the unweighted, you need a hash table to keep track of the words you've already seen, as well as a hash map to count the occurrences of the first letter. That is, you need to write:
if words_seen does not contain word
add word to words seen
update hash map with first letter of word
end-if
For the weighted, you don't need that hash table, because you don't care how many times the word occurs. So you can just write:
update hash map with first letter of word
For the most common words, you need a hash map to keep track of all the unique words you see, and the number of times you see the word. After you've scanned the entire document, make a pass through that hash map to determine the most frequent one with the desired length.
You probably don't want to use an array list for the last task, because you want to count occurrences. If you used an array list then after scanning the entire document you'd have to sort that list and count frequencies. That would take more memory and more time than just using the hash map.

Efficient algorithm to find most common phrases in a large volume of text

I am thinking about writing a program to collect for me the most common phrases in a large volume of the text. Had the problem been reduced to just finding words than that would be as simple as storing each new word in a hashmap and then increasing the count on each occurrence. But with phrases, storing each permutation of a sentence as a key seems infeasible.
Basically the problem is narrowed down to figuring out how to extract every possible phrase from a large enough text. Counting the phrases and then sorting by the number of occurrences becomes trivial.
I assume that you are searching for common patterns of consecutive words appearing in the same order (e.g. "top of the world" would not be counted as the same phrase as "top of a world" or "the world of top").
If so then I would recommend the following linear-time approach:
Split your text into words and remove things you don't consider significant (i.e. remove capitalisation, punctuation, word breaks, etc.)
Convert your text into an array of integers (one integer per unique word) (e.g. every instance of "cat" becomes 1, every "dog" becomes 2) This can be done in linear time by using a hash-based dictionary to store the conversions from words to numbers. If the word is not in the dictionary then assign a new id.
Construct a suffix-array for the array of integers (this is a sorted list of all the suffixes of your array and can be constructed by linear time - e.g. using the algorithm and C code here)
Construct the longest common prefix array for your suffix array. (This can also be done in linear-time, for example using this C code) This LCP array gives the number of common words at the start of each suffix between consecutive pairs in the suffix array.
You are now in a position to collect your common phrases.
It is not quite clear how you wish to determine the end of a phrase. One possibility is to simply collect all sequences of 4 words that repeat.
This can be done in linear time by working through your suffix array looking at places where the longest common prefix array is >= 4. Each run of indices x in the range [start+1...start+len] where the LCP[x] >= 4 (for all except the last value of x) corresponds to a phrase that is repeated len times. The phrase itself is given by the first 4 words of, for example, suffix start+1.
Note that this approach will potentially spot phrases that cross sentence ends. You may prefer to convert some punctuation such as full stops into unique integers to prevent this.

Resources