word distribution problem - algorithm

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks

You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.

Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.

If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.

do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.

Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.

Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.

If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Related

optimizing the data structure implementation

There is a stream of random characters coming like 'a''b''c''a'... and so on. At any given point in time when I query I need to get the first non repeating character. For example, for the input "abca", 'b' should be returned since a is repeated and the first non repeating character is 'b'.
There needs to be two methods, one for inserting and one for querying.
My solution is to have a linkedList to store the incoming stream characters. While I get the next character, I just compare with all the current characters and if present I will not insert into the end of linkedlist, else I will insert at the end. By this approach, the query will take O(1) since I will get the first element on the linkedlist and insert will take O(n) since I need to compare from the first element till the last element in the worst case.
Is there any better performing way?
Either you haven't explained your algorithm well or it won't return the correct result. In the example a b a, would your algorithm return a (because it is the first element in the linked list)?
Anyway, here is a modification that improves performance. The idea is to use a hash map from characters to (doubly) linked list nodes. This map can be used to determine if a character has already been inserted and to get to the required node quickly. We should allow a null value for the map target (instead of the list node) to express a character that has ocurred more than once already.
The insertion method works as follows:
Check if the map contains the current character (O(1)). If not, add it to the end of the list and add a reference to the map (O(1)).
If the character is already in the map: Check if the pointed to node is null (O(1)). If so, just ignore it. If it is not, remove the pointed to node from the list and update the reference to a null value (O(1)).
Overall, a O(1) operation.
The query works as in your previous solution.
Here is a C# implementation. It's basically a 1:1 translation of the above explanation:
class StreamAnalyzer
{
LinkedList<char> characterList = new LinkedList<char>();
Dictionary<char, LinkedListNode<char>> characterMap
= new Dictionary<char, LinkedListNode<char>>();
public void AddCharacter(char c)
{
LinkedListNode<char> referencedNode;
if (characterMap.TryGetValue(c, out referencedNode))
{
if(referencedNode != null)
{
characterList.Remove(referencedNode);
characterMap[c] = null;
}
}
else
{
var node = new LinkedListNode<char>(c);
characterList.AddLast(node);
characterMap.Add(c, node);
}
}
public char? GetFirstNonRepeatingCharacter()
{
if (characterList.First == null)
return null;
else
return characterList.First.Value;
}
}

Any Java functions for blocking non English words?

Please suggest me the best Java api for removing non English words and blocking incorrect words using
I use an English words list file to parse the given string. The code is responding very slowly. `
String englishword;
while ((englishword = br.readLine()) != null) {
//System.out.println("#"+englishword);
for (String word : wordsArray) {
//System.out.println("#"+word);
if(englishword.trim().toUpperCase().equals(word.trim().toUpperCase()))
{
linetmp = linetmp.replaceAll(word, " ").trim();
break;
}
}
}
if(linetmp!=null)
for(String nonEnglish:linetmp.split("\\s+"))
{
line = line.replaceAll(nonEnglish, "");
}
line = line.replaceAll(" +", " ");
return line;
Please suggest me if there is any faster way to do this
Note: i am using Linux OS's dictionary listy
Make trim() and touppercase() of the checked word only once, out of the for (String word : wordsArray) cycle.
If you'll do excessive heavy operations in the inner cycle, no API will help you.
You can use a Java API function for searching
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);
You can make your code a lot faster1 by changing the wordsArray to a HashSet, and using the contains(String) method to do the checks. (Make sure you convert words to upper case when you build the set.)
However, I would point out that this approach doesn't scale. It is not practical to enumerate all possible "non-English or incorrect" words. You would be better off building a set containing all of the words that you are prepared to accept, and then eliminating the words not in the set.
1 - Currently, your inner loop takes time that is proportional to the number of words (N) in wordArray; i.e. O(N). If you use a HashSet, the operation takes O(1) time; i.e. roughly constant time.
There is a faster way.
Create a HashSet<String> containing all your elements in wordsArray (as lower cases/upper cases).
For each new word englishword check if set.contains(englishword.toLowerCase()).
This solution runs in O(n|S|) pre-processing (creating the HashSet), and checking each word is O(|S|) where |S| is the length of the string and n is number of words in the array, while your solution is basically O(n|S|) per word.
Code snap:
public static class EnglishChecker {
private final Set<String> set;
public EnglishChecker(String[] englishWords) {
set = new HashSet<>();
for (String s : englishWords) {
set.add(s.toLowerCase());
}
}
public boolean isWord(String s) {
return set.contains(s.toLowerCase());
}
}
public static void main(String[] args) {
String[] words = { "Cat", "dog", "mousE" };
EnglishChecker checker = new EnglishChecker(words);
System.out.println(checker.isWord("cat"));
System.out.println(checker.isWord("cccccccat"));
System.out.println(checker.isWord("MOUSE"));
}

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

How to eliminate duplicate filename in hadoop mapreduce?

I want to eliminate duplicate filenames in my output of the hadoop mapreduce inverted index program. For example, the output is like - things : doc1,doc1,doc1,doc2 but I want it to be like
things : doc1,doc2
Well you want to remove duplicates which were mapped, i.e. you want to reduce the intermediate value list to an output list with no duplicates. My best bet would be to simply convert the Iterator<Text> in the reduce() method to a java Set and iterate over it changing:
while (values.hasNext()) {
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
}
To something like:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
valueSet.add(values.next());
}
for(Text value : valueSet) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
}
Unfortunately I do not know of any better (more concise) way of converting an Iterator to a Set.
This should have a smaller time complexity than orange's solution but a higher memory consumption.
#Edit: a bit shorter:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
Text next = values.next();
if(!valueSet.contains(next)) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
valueSet.add(next);
}
}
Contains should be (just like add) constant time so it should be O(n) now.
To do this with the minimal amount of code change, just add an if-statement that checks to see if the thing you are about to append is already in toReturn:
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
gets changed to
String v = values.next().toString()
if (toReturn.indexOf(v) == -1) { // indexOf returns -1 if it is not there
if (!first) {
toReturn.append(", ") ;
}
toReturn.append(v);
first = false
}
The above solution is a bit slow because it has to traverse the entire string every time to see if that string is there. Likely the best way to do this is to use a HashSet to collect the items, then combining the values in the HashSet into a final output string.

Calculate each Word Occurrence in large document

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.
There is a large document. Which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
This question was asked in Microsoft... Any suggestions will be appreciated..!!
I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.
Using a dictionary or hash set will result in o(n) on average.
To solve it in o(n) worst case, a trie with a small change should be used:
add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.
If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.
class IntValue
{
public IntValue(int value)
{
Value = value;
}
public int Value;
}
static void Main(string[] args)
{
//assuming document is a enumerator for the word in the document:
Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
foreach (string word in document)
{
IntValue intValue;
if(!dict.TryGetValue(word, out intValue))
{
intValue = new IntValue(0);
dict.Add(word, intValue);
}
++intValue.Value;
}
//now dict contains the counts
}
Tree would not work here.
Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
}

Resources