Any Java functions for blocking non English words? - algorithm

Please suggest me the best Java api for removing non English words and blocking incorrect words using
I use an English words list file to parse the given string. The code is responding very slowly. `
String englishword;
while ((englishword = br.readLine()) != null) {
//System.out.println("#"+englishword);
for (String word : wordsArray) {
//System.out.println("#"+word);
if(englishword.trim().toUpperCase().equals(word.trim().toUpperCase()))
{
linetmp = linetmp.replaceAll(word, " ").trim();
break;
}
}
}
if(linetmp!=null)
for(String nonEnglish:linetmp.split("\\s+"))
{
line = line.replaceAll(nonEnglish, "");
}
line = line.replaceAll(" +", " ");
return line;
Please suggest me if there is any faster way to do this
Note: i am using Linux OS's dictionary listy

Make trim() and touppercase() of the checked word only once, out of the for (String word : wordsArray) cycle.
If you'll do excessive heavy operations in the inner cycle, no API will help you.
You can use a Java API function for searching
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);

You can make your code a lot faster1 by changing the wordsArray to a HashSet, and using the contains(String) method to do the checks. (Make sure you convert words to upper case when you build the set.)
However, I would point out that this approach doesn't scale. It is not practical to enumerate all possible "non-English or incorrect" words. You would be better off building a set containing all of the words that you are prepared to accept, and then eliminating the words not in the set.
1 - Currently, your inner loop takes time that is proportional to the number of words (N) in wordArray; i.e. O(N). If you use a HashSet, the operation takes O(1) time; i.e. roughly constant time.

There is a faster way.
Create a HashSet<String> containing all your elements in wordsArray (as lower cases/upper cases).
For each new word englishword check if set.contains(englishword.toLowerCase()).
This solution runs in O(n|S|) pre-processing (creating the HashSet), and checking each word is O(|S|) where |S| is the length of the string and n is number of words in the array, while your solution is basically O(n|S|) per word.
Code snap:
public static class EnglishChecker {
private final Set<String> set;
public EnglishChecker(String[] englishWords) {
set = new HashSet<>();
for (String s : englishWords) {
set.add(s.toLowerCase());
}
}
public boolean isWord(String s) {
return set.contains(s.toLowerCase());
}
}
public static void main(String[] args) {
String[] words = { "Cat", "dog", "mousE" };
EnglishChecker checker = new EnglishChecker(words);
System.out.println(checker.isWord("cat"));
System.out.println(checker.isWord("cccccccat"));
System.out.println(checker.isWord("MOUSE"));
}

Related

Google search suggestion implementation

In a recent amazon interview I was asked to implement Google "suggestion" feature. When a user enters "Aeniffer Aninston", Google suggests "Did you mean Jeniffer Aninston". I tried to solve it by using hashing but could not cover the corner cases. Please let me know your thought on this.
There are 4 most common types of erros -
Omitted letter: "stck" instead of "stack"
One letter typo: "styck" instead of "stack"
Extra letter: "starck" instead of "stack"
Adjacent letters swapped: "satck" instead of "stack"
BTW, we can swap not adjacent letters but any letters but this is not common typo.
Initial state - typed word. Run BFS/DFS from initial vertex. Depth of search is your own choice. Remember that increasing depth of search leads to dramatically increasing number of "probable corrections". I think depth ~ 4-5 is a good start.
After generating "probable corrections" search each generated word-candidate in a dictionary - binary search in sorted dictionary or search in a trie which populated with your dictionary.
Trie is faster but binary search allows searching in Random Access File without loading dictionary to RAM. You have to load only precomputed integer array[]. Array[i] gives you number of bytes to skip for accesing i-th word. Words in Random Acces File should be written in a sorted order. If you have enough RAM to store dictionary use trie.
Before suggesting corrections check typed word - if it is in a dictionary, provide nothing.
UPDATE
Generate corrections should be done by BFS - when I tried DFS, entries like "Jeniffer" showed "edit distance = 3". DFS doesn't works, since it make a lot of changes which can be done in one step - for example, Jniffer->nJiffer->enJiffer->eJniffer->Jeniffer instead of Jniffer->Jeniffer.
Sample code for generating corrections by BFS
static class Pair
{
private String word;
private byte dist;
// dist is byte because dist<=128.
// Moreover, dist<=6 in real application
public Pair(String word,byte dist)
{
this.word = word;
this.dist = dist;
}
public String getWord()
{
return word;
}
public int getDist()
{
return dist;
}
}
public static void main(String[] args) throws Exception
{
HashSet<String> usedWords;
HashSet<String> dict;
ArrayList<String> corrections;
ArrayDeque<Pair> states;
usedWords = new HashSet<String>();
corrections = new ArrayList<String>();
dict = new HashSet<String>();
states = new ArrayDeque<Pair>();
// populate dictionary. In real usage should be populated from prepared file.
dict.add("Jeniffer");
dict.add("Jeniffert"); //depth 2 test
usedWords.add("Jniffer");
states.add(new Pair("Jniffer", (byte)0));
while(!states.isEmpty())
{
Pair head = states.pollFirst();
//System.out.println(head.getWord()+" "+head.getDist());
if(head.getDist()<=2)
{
// checking reached depth.
//4 is the first depth where we don't generate anything
// swap adjacent letters
for(int i=0;i<head.getWord().length()-1;i++)
{
// swap i-th and i+1-th letters
String newWord = head.getWord().substring(0,i)+head.getWord().charAt(i+1)+head.getWord().charAt(i)+head.getWord().substring(i+2);
// even if i==curWord.length()-2 and then i+2==curWord.length
//substring(i+2) doesn't throw exception and returns empty string
// the same for substring(0,i) when i==0
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
// insert letters
for(int i=0;i<=head.getWord().length();i++)
for(char ch='a';ch<='z';ch++)
{
String newWord = head.getWord().substring(0,i)+ch+head.getWord().substring(i);
if(!usedWords.contains(newWord))
{
usedWords.add(newWord);
if(dict.contains(newWord))
{
corrections.add(newWord);
}
states.addLast(new Pair(newWord, (byte)(head.getDist()+1)));
}
}
}
}
for(String correction:corrections)
{
System.out.println("Did you mean "+correction+"?");
}
usedWords.clear();
corrections.clear();
// helper data structures must be cleared after each generateCorrections call - must be empty for the future usage.
}
Words in a dictionary - Jeniffer,Jeniffert. Jeniffert is just for testing)
Output:
Did you mean Jeniffer?
Did you mean Jeniffert?
Important!
I choose depth of generating = 2. In real application depth should be 4-6, but as number of combinations grows exponentially, I don't go so deep. There are some optomizations devoted to reduce number of branches in a searching tree but I don't think much about them. I wrote only main idea.
Also, I used HashSet for storing dictionary and for labeling used words. It seems HashSet's constant is too large when it containt million objects. May be you should use trie both for word in a dictionary checking and for is word labeled checking.
I didn't implement erase letters and change letters operations because I want to show only main idea.

How to eliminate duplicate filename in hadoop mapreduce?

I want to eliminate duplicate filenames in my output of the hadoop mapreduce inverted index program. For example, the output is like - things : doc1,doc1,doc1,doc2 but I want it to be like
things : doc1,doc2
Well you want to remove duplicates which were mapped, i.e. you want to reduce the intermediate value list to an output list with no duplicates. My best bet would be to simply convert the Iterator<Text> in the reduce() method to a java Set and iterate over it changing:
while (values.hasNext()) {
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
}
To something like:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
valueSet.add(values.next());
}
for(Text value : valueSet) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
}
Unfortunately I do not know of any better (more concise) way of converting an Iterator to a Set.
This should have a smaller time complexity than orange's solution but a higher memory consumption.
#Edit: a bit shorter:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
Text next = values.next();
if(!valueSet.contains(next)) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
valueSet.add(next);
}
}
Contains should be (just like add) constant time so it should be O(n) now.
To do this with the minimal amount of code change, just add an if-statement that checks to see if the thing you are about to append is already in toReturn:
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
gets changed to
String v = values.next().toString()
if (toReturn.indexOf(v) == -1) { // indexOf returns -1 if it is not there
if (!first) {
toReturn.append(", ") ;
}
toReturn.append(v);
first = false
}
The above solution is a bit slow because it has to traverse the entire string every time to see if that string is there. Likely the best way to do this is to use a HashSet to collect the items, then combining the values in the HashSet into a final output string.

Calculate each Word Occurrence in large document

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.
There is a large document. Which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
This question was asked in Microsoft... Any suggestions will be appreciated..!!
I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.
Using a dictionary or hash set will result in o(n) on average.
To solve it in o(n) worst case, a trie with a small change should be used:
add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.
If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.
class IntValue
{
public IntValue(int value)
{
Value = value;
}
public int Value;
}
static void Main(string[] args)
{
//assuming document is a enumerator for the word in the document:
Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
foreach (string word in document)
{
IntValue intValue;
if(!dict.TryGetValue(word, out intValue))
{
intValue = new IntValue(0);
dict.Add(word, intValue);
}
++intValue.Value;
}
//now dict contains the counts
}
Tree would not work here.
Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
}

LINQ and CASE Sensitivity

I have this LINQ Query:
TempRecordList = new ArrayList(TempRecordList.Cast<string>().OrderBy(s => s.Substring(9, 30)).ToArray());
It works great and performs sorting in a way that's accurate but a little different from what I want. Among the the result of the query I see something like this:
Palm-Bouter, Peter
Palmer-Johnson, Sean
Whereas what I really need is to have names sorted like this:
Palmer-Johnson, Sean
Palm-Bouter, Peter
Basically I want the '-' character to be treated as being lower than the character so that names that contain it show up later in an ascending search.
Here is another example. I get:
Dias, Reginald
DiBlackley, Anton
Instead of:
DiBlackley, Anton
Dias, Reginald
As you can see, again, the order is switched due to how the uppercase letter 'B' is treated.
So my question is, what do I need to change in my LINQ query to make it return results in the order I specified. Any feedback would be greatly appreaciated.
By the way, I tried using s.Substring(9, 30).ToLower() but that didn't help.
Thank you!
To customize the sorting order you will need to create a comparer class that implements IComparer<string> interface. The OrderBy() method takes comparer as second parameter.
internal sealed class NameComparer : IComparer<string> {
private static readonly NameComparer DefaultInstance = new NameComparer();
static NameComparer() { }
private NameComparer() { }
public static NameComparer Default {
get { return DefaultInstance; }
}
public int Compare(string x, string y) {
int length = Math.Min(x.Length, y.Length);
for (int i = 0; i < length; ++i) {
if (x[i] == y[i]) continue;
if (x[i] == '-') return 1;
if (y[i] == '-') return -1;
return x[i].CompareTo(y[i]);
}
return x.Length - y.Length;
}
}
This works at least with the following test cases:
var names = new[] {
"Palmer-Johnson, Sean",
"Palm-Bouter, Peter",
"Dias, Reginald",
"DiBlackley, Anton",
};
var sorted = names.OrderBy(name => name, NameComparer.Default).ToList();
// sorted:
// [0]: "DiBlackley, Anton"
// [1]: "Dias, Reginald"
// [2]: "Palmer-Johnson, Sean"
// [3]: "Palm-Bouter, Peter"
As already mentioned, the OrderBy() method takes a comparer as a second parameter.
For strings, you don't necessarily have to implement an IComparer<string>. You might be fine with System.StringComparer.CurrentCulture (or one of the others in System.StringComparer).
In your exact case, however, there is no built-in comparer which will handle also the - after letter sort order.
OrderBy() returns results in ascending order.
e comes before h, thus the first result (remember you're comparing on a substring that starts with the character in the 9th position...not the beginning of the string) and i comes before y, thus the second. Case sensitivity has nothing to do with it.
If you want results in descending order, you should use OrderByDescending():
TempRecordList.Cast<string>
.OrderByDescending(s => s.Substring(9, 30)).ToArray());
You might want to just implement a custom IComparer object that will give a custom priority to special, upper-case and lower-case characters.
http://msdn.microsoft.com/en-us/library/system.collections.icomparer.aspx

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Resources