Where to find the frequency of 3-grams? - n-gram

Where can I get a dataset of the frequency of 3 letter combinations (3-grams)? When I try to look for 3-grams and other n-grams online, it usually shows me combinations of 3 words, not letters.
(I could generate my own, of course, but I'm wondering if there is a ready to use data-set of them.)

Related

Percentage of google N-grams containing a particular word

I'm trying to use the google n-grams API to understand what percentage of say 2,3, or 4-grams contain a particular word, say 'happy'.
Just using the query for 'happy' - which give the percentage of 1-grams that the word 'happy' accounts for - should be a very reasonable estimate for this, but I want to be more precise.
For a particular year:
https://books.google.com/ngrams/graph?content=happy&year_start=2018&year_end=2019&corpus=26&smoothing=3&case_insensitive=true
I see that you can download the raw frequency scores for all of the 1-5 grams, so if all else fails, I guess I can get the answer this way, but I thought this was a relatively natural question for the standard API.
I thought it might be something like 'happy *', but this returns the top 10 2-grams starting with happy.

How to get top K words distributed in N computers efficiently?

Suppose we have many words distributed in N computers and we want the top 10 frequent words.
There are three approaches I can think of:
Count the words on each N computer seperately and we can get top 20(This number can be discussed) words on each computer. Than merge these result together.
The drawback of this approach is some words might be ignored. These words are distributed evenly on each computer but cannot be the Top 20 on each computer, but the total frequency of these words might be top 10.
It's almost the same as the first one. The difference is getting all the counting results on each computer and merge them. Then calculate the TOP 10.
The drawback is the merge time and transmission time is relatively large.
Use a good hash function to redistribute the words. Different computer will not have same word. Then we can get TOP 10 on each computer and merge them.
The drawback is every word will be hashed and transmit to another computer. It will take much transmission time.
Do you have any better approach for this? Or which one of my approaches is the best?
Your idea in #1 was good but needed better execution. If F is the frequency of the Kth most common word on a single computer, then all words with frequency less than F/N on all N computers can be ignored. If you divide the machines into G groups, then the threshold F'/G applies, where F' is the frequency of the Kth most common word on the computers within a single group.
In two rounds, the computers can determine the best value for F and then aggregate a small Bloom filter that hits on all frequent words and gives false positives on some others, used to reduce the amount of data to merge with approaches #2 and #3.
The drawback of the first approach makes it a non-solution.
The second approach sends all results to a single machine for it to do all the work.
The third approach sends all results as well, but to multiple machines, who then share the workload (this is the important part) - the final results that gets sent to a single machine to be merged should be small in comparison to sending all word frequencies.
Clearly the third approach makes the most sense.

OCR: Choose the best string based on last N results (an adaptive filter for OCR)

I've seen some questions on deciding the best OCR result given output from different engines, and the answer is typically "choose the best engine".
I want, however, to capture several frames of text images, with possible temporary occlusions or temporary failures.
I'm using tesseract-ocr with python-tesseract.
Considering the OCR outputs of the last N frames, I want to decide what is the best result (line by line, for simplicity).
For example, for N=3, we could use a median filtering:
ABXD
XBCX
AXCD
When there are 2 out of 3 equal characters, the majority will win, so the result would be ABCD.
However, that's not so easy with different string sizes. If I expect a given size M (if scanning a price table, the rows are typically XX.XX), I can always penalize on strings bigger than M.
If we were talking numbers, a median filtering would work quite well (simple background subtraction in computer vision), or some least mean squares adaptive filtering.
There's also the problem of similar characters: l and 1 can be very similar, depending on the font.
I was also thinking of using string distances between each string. For example, choose the string with the smallest sum of distances with the others.
Has anyone addressed this kind of problem before? Is there any known algorithm for this kind of problem that I should know?
This problem is called multiple sequence alignment and you can read about it here

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

Algorithm for generating a 'top list' using word frequency

I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this?
Don't reinvent the wheel. Use a full text search engine such as Lucene.
The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go.
At the end of the process sort the key/value pairs by count.
the basic idea is simple -- in executable pseudocode,
from collections import defaultdict
def process(words):
d = defaultdict(int)
for w in words: d[w] += 1
return d
Of course, the devil is in the details -- how do you turn the big collection into an iterator yielding words? Is it big enough that you can't process it on a single machine but rather need a mapreduce approach e.g. via hadoop? Etc, etc. NLTK can help with the linguistic aspects (isolating words in languages that don't separate them cleanly).
On a single-machine execution (net of mapreduce), one issue that can arise is that the simple idea gives you far too many singletons or thereabouts (words occurring once or just a few times), which fill memory. A probabilistic retort to that is to do two passes: one with random sampling (get only one word in ten, or one in a hundred) to make a set of words that are candidates for the top ranks, then a second pass skipping words that are not in the candidate set. Depending on how many words you're sampling and how many you want in the result, it's possible to compute an upper bound on the probability that you're going to miss an important word this way (and for reasonable numbers, and any natural language, I assure you that you'll be just fine).
Once you have your dictionary mapping words to numbers of occurrences you just need to pick the top N words by occurrences -- a heap-queue will help there, if the dictionary is just too large to sort by occurrences in its entirety (e.g. in my favorite executable pseudocode, heapq.nlargest will do it, for example).
Look into the Apriori algorithm. It can be used to find frequent items and/or frequent sets of items.
Like the wikipedia article states, there are more efficient algorithms that do the same thing, but this could be a good start to see if this will apply to your situation.
Maybe you can try using a PATRICIA trie or practical algorithm to retrieve information coded in alphanumeric trie?
Why not a simple map with key as the word and the Counter as the Value.
It will give the top used words, by taking the high value counter.
It is just a O(N) operation.

Resources