How to eliminate duplicate filename in hadoop mapreduce? - hadoop

I want to eliminate duplicate filenames in my output of the hadoop mapreduce inverted index program. For example, the output is like - things : doc1,doc1,doc1,doc2 but I want it to be like
things : doc1,doc2

Well you want to remove duplicates which were mapped, i.e. you want to reduce the intermediate value list to an output list with no duplicates. My best bet would be to simply convert the Iterator<Text> in the reduce() method to a java Set and iterate over it changing:
while (values.hasNext()) {
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
}
To something like:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
valueSet.add(values.next());
}
for(Text value : valueSet) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
}
Unfortunately I do not know of any better (more concise) way of converting an Iterator to a Set.
This should have a smaller time complexity than orange's solution but a higher memory consumption.
#Edit: a bit shorter:
Set<Text> valueSet = new HashSet<Text>();
while (values.hasNext()) {
Text next = values.next();
if(!valueSet.contains(next)) {
if(!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(value.toString());
valueSet.add(next);
}
}
Contains should be (just like add) constant time so it should be O(n) now.

To do this with the minimal amount of code change, just add an if-statement that checks to see if the thing you are about to append is already in toReturn:
if (!first)
toReturn.append(", ") ;
first = false;
toReturn.append(values.next().toString());
gets changed to
String v = values.next().toString()
if (toReturn.indexOf(v) == -1) { // indexOf returns -1 if it is not there
if (!first) {
toReturn.append(", ") ;
}
toReturn.append(v);
first = false
}
The above solution is a bit slow because it has to traverse the entire string every time to see if that string is there. Likely the best way to do this is to use a HashSet to collect the items, then combining the values in the HashSet into a final output string.

Related

What's the most efficient way of filtering a string with numbers at the end (e.g. foo12)?

Here's a self-thought up quiz very similar to a real life problem that I'm facing.
Say I have a list of strings (say it's called stringlist), and among them some have two digit numbers attached at the end. For example, "foo", "foo01", "foo24".
I want to group those with the same letters (but with different two digit numbers at the end).
So, "foo", "foo01", and "foo24" would be under the group "foo".
However, I can't just check for any string that begins with "foo", because we can also have "food", "food08", "food42".
There are no duplicates.
It is possible to have numbers in the middle. Ex) "foo543food43" is under group "foo543food"
Or even multiple numbers at then end. Ex) "foo1234" is under group "foo12"
Most obvious solution I can think of is having a list of numbers.
numbers = ["0", "1", "2", ... "9"]
Then, I would do
grouplist = [[]] //Of the form: [[group_name1, word_index1, word_index2, ...], [group_name2, ...]]
for(word_index=0; word_index < len(stringlist); word_index++) //loop through stringlist
for(char_index=0; char_index < len(stringlist[word_index]); char_index++) //loop through the word
if(char_index == len(stringlist[word_index])-1) //Reached the end
for(number1 in numbers)
if(char_index == number1) //Found a number at the end
for(number2 in numbers)
if(char_index-1 == number2) //Found another number one before the end
group_name = stringlist[word_index].substring(0,char_index-1)
for(group_element in grouplist)
if(group_element[0] == group_name) //Does that group name exist already? If so, add the index to the end. If not, add the group name and the index.
group_element.append(word_index)
else
group_element.append([stringlist[word_index].substring(0,char_index-1), word_index])
break //If you found the first number, stop looping through numbers
break //If you found the second number, stop looping through numbers
Now this looks messy as hell. Any cleaner way you guys can think of?
Any of the data structures including the final result's can be what you want it to be.
I would create a map that maps the group-name to a list of all String of the corresponding group.
Here my approach in java:
public Map<String, List<String>> createGroupMap(Lust<String> listOfAllStrings){
Map<String, List<String>> result= new Hashmap<>();
for(String s: listOfAllStrings){
addToMap(result, s)
}
}
private addToMap(Map<String, List<String>> map, String s){
String group=getGroupName(s);
if(!map.containsKey(group))
map.put(group,new ArrayList<String>();
map.get(group).add(s);
}
private String getGroupName(String s){
return s.replaceFirst("\\d+$", "");
}
Maybe you can gain some speed by avoiding the RegExp in getGroupName(..) but you need to profile it to be sure that an implementation without RegExp would be faster.
You can divide the string into 2 parts like this.
pair<string, int> divide(string s) {
int r = 0;
if(isdigit(s.back())) {
r = s.back() - '0';
s.pop_back();
if(isdigit(s.back())) {
r += 10 * (s.back() - '0');
s.pop_back();
}
}
return {s, r}
}

Difference between one pass (scan) and two pass(scan)

I had an Interview, a day before.
The Interviewer told me to , " Write a program to add a node at the end of a linked list ".
I had given him a solution. but he told me to implement it in one pass (one scan).
Can Anybody explain me, whats the meaning of one pass, and how to find the program written is in one pass or two pass?
Here is my code
public void atLast(int new_data)
{
Node new_node=new Node(new_data);
if(head==null)
{
head=new Node(new_data);
return;
}
new_node.next=null;
Node last=head;
while(last.next!=null)
{
last=last.next;
}
last.next=new_node;
return;
}
If that is the code you gave the interviewer must have misread it because it is a single pass.
In your case a "pass" would be your while loop. It could also be done with recursion, for, or any other type of loop that goes through the elements in the array (or other form of a list of items).
In your code you run through the list of Node and insert the element at the end. This is done in one loop making it a single pass.
Now to look at a case with two passes. Say for example you were asked to remove the element with the largest value and wrote something similar to this:
int index = 0;
int count = 0;
int max = 0;
while(temp_node != null)
{
if(temp_node.data > max)
{
index = count;
max = temp_node.data;
}
count++;
temp_node = temp_node.next;
}
for(int i = 0; i < count; i++)
{
if(i == index)
{
//Functionality to remove node.
}
}
The first pass (while) detects the Node which has the maximum value. The second pass (for) removes this Node by looping through all the elements again until the correct one is found.
I'd imagine "two passes" here means that you iterated through the whole list twice in your code. You shouldn't need to do that to add a new node.

Any Java functions for blocking non English words?

Please suggest me the best Java api for removing non English words and blocking incorrect words using
I use an English words list file to parse the given string. The code is responding very slowly. `
String englishword;
while ((englishword = br.readLine()) != null) {
//System.out.println("#"+englishword);
for (String word : wordsArray) {
//System.out.println("#"+word);
if(englishword.trim().toUpperCase().equals(word.trim().toUpperCase()))
{
linetmp = linetmp.replaceAll(word, " ").trim();
break;
}
}
}
if(linetmp!=null)
for(String nonEnglish:linetmp.split("\\s+"))
{
line = line.replaceAll(nonEnglish, "");
}
line = line.replaceAll(" +", " ");
return line;
Please suggest me if there is any faster way to do this
Note: i am using Linux OS's dictionary listy
Make trim() and touppercase() of the checked word only once, out of the for (String word : wordsArray) cycle.
If you'll do excessive heavy operations in the inner cycle, no API will help you.
You can use a Java API function for searching
import org.apache.commons.lang.ArrayUtils;
ArrayUtils.indexOf(array, string);
You can make your code a lot faster1 by changing the wordsArray to a HashSet, and using the contains(String) method to do the checks. (Make sure you convert words to upper case when you build the set.)
However, I would point out that this approach doesn't scale. It is not practical to enumerate all possible "non-English or incorrect" words. You would be better off building a set containing all of the words that you are prepared to accept, and then eliminating the words not in the set.
1 - Currently, your inner loop takes time that is proportional to the number of words (N) in wordArray; i.e. O(N). If you use a HashSet, the operation takes O(1) time; i.e. roughly constant time.
There is a faster way.
Create a HashSet<String> containing all your elements in wordsArray (as lower cases/upper cases).
For each new word englishword check if set.contains(englishword.toLowerCase()).
This solution runs in O(n|S|) pre-processing (creating the HashSet), and checking each word is O(|S|) where |S| is the length of the string and n is number of words in the array, while your solution is basically O(n|S|) per word.
Code snap:
public static class EnglishChecker {
private final Set<String> set;
public EnglishChecker(String[] englishWords) {
set = new HashSet<>();
for (String s : englishWords) {
set.add(s.toLowerCase());
}
}
public boolean isWord(String s) {
return set.contains(s.toLowerCase());
}
}
public static void main(String[] args) {
String[] words = { "Cat", "dog", "mousE" };
EnglishChecker checker = new EnglishChecker(words);
System.out.println(checker.isWord("cat"));
System.out.println(checker.isWord("cccccccat"));
System.out.println(checker.isWord("MOUSE"));
}

Calculate each Word Occurrence in large document

I was wondering how can I solve this problem by using which data structure.. Can anyone explain this in detail...!! I was thinking to use tree.
There is a large document. Which contains millions of words. so how you will calculate a each word occurrence count in an optimal way?
This question was asked in Microsoft... Any suggestions will be appreciated..!!
I'd just use a hash map (or Dictionary, since this is Microsoft ;) ) of strings to integers. For each word of the input, either add it to the dictionary if it's new, or increment its count otherwise. O(n) over the length of the input, assuming the hash map implementation is decent.
Using a dictionary or hash set will result in o(n) on average.
To solve it in o(n) worst case, a trie with a small change should be used:
add a counter to each word representation in the trie; Each time a word that is inserted already exists, increment its counter.
If you want to print all the amounts at the end, you can keep the counters on a different list, and reference it from the trie instead storing the counter in the trie.
class IntValue
{
public IntValue(int value)
{
Value = value;
}
public int Value;
}
static void Main(string[] args)
{
//assuming document is a enumerator for the word in the document:
Dictionary<string, IntValue> dict = new Dictionary<string, IntValue>();
foreach (string word in document)
{
IntValue intValue;
if(!dict.TryGetValue(word, out intValue))
{
intValue = new IntValue(0);
dict.Add(word, intValue);
}
++intValue.Value;
}
//now dict contains the counts
}
Tree would not work here.
Hashtable ht = new Hashtable();
// Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
}

word distribution problem

I have a big file of words ~100 Gb and have limited memory 4Gb. I need to calculate word distribution from this file. Now one option is to divide it into chunks and sort each chunk and then merge to calculate word distribution. Is there any other way it can be done faster? One idea is to sample but not sure how to implement it to return close to correct solution.
Thanks
You can build a Trie structure where each leaf (and some nodes) will contain the current count. As words will intersect with each other 4GB should be enough to process 100 GB of data.
Naively I would just build up a hash table until it hits a certain limit in memory, then sort it in memory and write this out. Finally, you can do n-way merging of each chunk. At most you will have 100/4 chunks or so, but probably many fewer provided some words are more common than others (and how they cluster).
Another option is to use a trie which was built for this kind of thing. Each character in the string becomes a branch in a 256-way tree and at the leaf you have the counter. Look up the data structure on the web.
If you can pardon the pun, "trie" this:
public class Trie : Dictionary<char, Trie>
{
public int Frequency { get; set; }
public void Add(string word)
{
this.Add(word.ToCharArray());
}
private void Add(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
this.Add(first, new Trie());
}
if (chars.Length == 1)
{
this[first].Frequency += 1;
}
else
{
this[first].Add(chars.Skip(1).ToArray());
}
}
public int GetFrequency(string word)
{
return this.GetFrequency(word.ToCharArray());
}
private int GetFrequency(char[] chars)
{
if (chars == null || chars.Length == 0)
{
throw new System.ArgumentException();
}
var first = chars[0];
if (!this.ContainsKey(first))
{
return 0;
}
if (chars.Length == 1)
{
return this[first].Frequency;
}
else
{
return this[first].GetFrequency(chars.Skip(1).ToArray());
}
}
}
Then you can call code like this:
var t = new Trie();
t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");
var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1
You should be able to add code to traverse the trie and return a flat list of words and their frequencies.
If you find that this too still blows your memory limit then might I suggest that you "divide and conquer". Maybe scan the source data for all the first characters and then run the trie separately against each and then concatenate the results after all of the runs.
do you know how many different words you have? if not a lot (i.e. hundred thousand) then you can stream the input, determine words and use a hash table to keep the counts. after input is done just traverse the result.
Just use a DBM file. It’s a hash on disk. If you use the more recent versions, you can use a B+Tree to get in-order traversal.
Why not use any relational DB? The procedure would be as simple as:
Create a table with the word and count.
Create index on word. Some databases have word index (f.e. Progress).
Do SELECT on this table with the word.
If word exists then increase counter.
Otherwise - add it to the table.
If you are using python, you can check the built-in iter function. It will read line by line from your file and will not cause memory problems. You should not "return" the value but "yield" it.
Here is a sample that I used to read a file and get the vector values.
def __iter__(self):
for line in open(self.temp_file_name):
yield self.dictionary.doc2bow(line.lower().split())

Resources