is it possible to modify porter stemmer result? - porter-stemmer

I've used porter stemmer in my project(using python). but I see some errors in output. for example the term "intrductory" changed to "introductori" instead of "introduct".
is it possible to improve this result?

Why do you think it is an error? Step 2 in the Porter Stemmer algorithm states:
Step2() turns terminal 'y' to 'i' when there is another vowel in the stem.
So introductory should indeed be converted to introductori
That said, if you do want to break it down to a base word you can do so in Step4()
case 'i': if (ends("iciti")) { r("ic"); break; }
if (ends("tori")) { r("t"); break; }
break;

Related

How can i do that counting limits take too much time for big integers?

Im Vladimir Grygov and I have very serious problem.
In our work we now work on really hard algorithm, which using limits to cout the specific result.
Alghoritm is veary heavy and after two months of work we found really serious problem. Our team of analytics told me to solve this problem.
For the first I tell you the problem, which must be solve by limits:
We have veary much datas in the database. Ec INT_MAX.
For each this data we must sort them by the alghoritm to two groups and one must have red color interpretation and second must be blue.
The algorithm counts with ID field, which is some AUTO_INCREMENT value. For this value we check, if this value is eequal to 1. If yeas, this is red color data. If it is zero, this is blue data. If it is more. Then one, you must substract number 2 and check again.
We choose after big brainstorming method by for loop, but this was really slow for bigger number. So we wanted to remove cycle, and my colegue told me use recursion.
I did so. But... after implementation I had got unknown error for big integers and for example long long int and after him was wrote that: "Stack Overflow Exception"
From this I decided to write here, because IDE told me name of this page, so I think that here may be Answer.
Thank You so much. All of you.
After your comment I think I can solve it:
public bool isRed(long long val) {
if (val==1)
{return true; }
else if (val==0)
{ return false; }
else { return isRed(val - 2); }
}
Any halfway decent value for val will easily break this. There is just no way this could have worked with recursion. No CPU will support a stacktrace close to half long.MaxInt!
However there are some general issues with your code:
Right now this is the most needlesly complex "is the number even" check ever. Most people use Modulo to figure that out. if(val%2 == 0) return false; else return true;
the type long long seems off. Did you repeat the type? Did you mean to use BigInteger?
If the value you substract by is not static and it is not solveable via modulo, then there is no reason not to use a loop here.
public bool isRed (long long val){
for(;val >= 0; val = val -2){
if(value == 0)
return false;
}
return true;
}

performance issue variable assignment

I am wondering, what scenario would be best ?
(Please bare with my examples, these are just small examples of the situation in question. I know you could have the exact same function without a result variable.)
A)
public String doSomthing(){
String result;
if(condition){ result = "Option A";}
else{ result = "Option B";}
return result;
}
B)
public String doSomthing(){
String result = "Option B";
if(condition){ result = " Option A";}
return result;
}
Cause in scenario B: if the condition is met, Then you would be assigning result a value twice.
Yet in code, i keep seeing scenario A.
Actually, the overhead here is minimal, if any, considering the compiler optimisations. You would not care about it in a professional coding environment, unless you are writing a compiler yourself.
What is more important, considering (modern) programming paradigms, is the code style and readability.
Example A is far more readable, as it has a well-presented reason-outcome hierarchy. This is important especially for big methods, as it saves the programmer lots of analysis time.

If/ Else, test true first or false first

I have a rather specific question.
Say I am at the end of a function, and am determining whether to return true or false.
I would like to do this using an if/else statement, and have two options: (examples are in pseudocode)
1) Check if worked first:
if(resultVar != error){
return true;
}else{
return false;
}
2) Check if it failed first:
if(resultVar == error){
return false;
}else{
return true;
}
My question is simple: Which case is better (faster? cleaner?)?
I am really looking at the if/else itself, disregarding that the example is returning (but thanks for the answers)
The function is more likely to want to return true than false.
I realize that these two cases do the exact same thing, but are just 'reversed' in the order of which they do things. I would like to know if either one has any advantage over the other, whether one is ever so slightly faster, or follows a convention more closely, etc.
I also realize that this is extremely nitpicky, I just didn't know if there is any difference, and which would be best (if it matters).
Clarifications:
A comparison needs to be done to return a boolean. The fact that of what the examples are returning is less relevant than how the comparison happens.
This is by far the cleanest:
return resultvar != error;
The only difference in both examples may be the implementation of the operator. A != operator inverses the operation result. So it adds an overhead, but very small one. The == is a straight comparison.
But depending on what you plan to do on the If/else, if there is simply assigning a value to a variable, then the conditional ternary operator (?) is faster. For complex multi value decisions, a switch/case is more flexible, but slower.
This will be faster in your case:
return (resultVar == error) ? false : true;
This will depend entirely on the language and the compiler. There is no specific answer. In C for instance, both of these would be encoded rather like:
return (resultVar!=error);
by any decent compiler.
Put true first, because it potentially eleiminates a JUMP command in assembly. However, the difference is negligible, since it's an else, rather than an else if. There may /technically be a difference/, but you will see no performance difference in this case.

String matching alternate approach [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Trying to write my own fast pattern matching algo. Dont want to use language specific solution. I am focussing on writing the algo. This is because I was reading about different techniques to do string matching. Some are complicate yet very interesting like Rabin karp, etc.
I came up with this method which is fast and linear. It works well with the different inputs I have tried with. So I was thinking is there any reason I shouldnt be using this approach over the very well know approaches. Basically I am taking a char of text and comparing with the corresponding character of the pattern - one at a time.
Also, if someone could point out my mistake in this one - it will be great. Thank you for your replies and comments in advance :)
public static boolean patternMatch(String pattern, String text)
{
if(pattern == null)
return true;
if(text == null)
return false;
char[] patternArray = pattern.toCharArray();
char[] textArray = text.toCharArray();
int length = pattern.length();
int j = 0;
for(char t : textArray)
{
if(t == patternArray[j])
{
j++;
if(j == length)
return true;
}
else {
j = 0;
if(t == patternArray[j]) j++;
}
}
return false;
}
Two reasons for using a standard approach:
It's easy to write a method that simply does the wrong thing. Your method is like that, because it will fail to match, for instance, the pattern "ab" against the string "aab". (It matches the first "a"s of the pattern and the string, then fails to match "b" to the second "a" of the string, then goes on to see if it can find a match starting at the third character of the string.)
Standard approaches are fast. Your algorithm is linear, which is pretty good (if only it were also correct!). However, many string matching algorithms will work in sublinear time. That is, the time it takes to match a string grows more slowly than linearly in the size of the input problem. Perhaps hard to believe, but true. (Read the literature for substantiations of this claim.)

Breaking a string apart into a sequence of words

I recently came across the following interview question:
Given an input string and a dictionary of words, implement a method that breaks up the input string into a space-separated string of dictionary words that a search engine might use for "Did you mean?" For example, an input of "applepie" should yield an output of "apple pie".
I can't seem to get an optimal solution as far as complexity is concerned. Does anyone have any suggestions on how to do this efficiently?
Looks like the question is exactly my interview problem, down to the example I used in the post at The Noisy Channel. Glad you liked the solution. Am quite sure you can't beat the O(n^2) dynamic programming / memoization solution I describe for worst-case performance.
You can do better in practice if your dictionary and input aren't pathological. For example, if you can identify in linear time the substrings of the input string are in the dictionary (e.g., with a trie) and if the number of such substrings is constant, then the overall time will be linear. Of course, that's a lot of assumptions, but real data is often much nicer than a pathological worst case.
There are also fun variations of the problem to make it harder, such as enumerating all valid segmentations, outputting a best segmentation based on some definition of best, handling a dictionary too large to fit in memory, and handling inexact segmentations (e.g., correcting spelling mistakes). Feel free to comment on my blog or otherwise contact me to follow up.
This link describes this problem as a perfect interview question and provides several methods to solve it. Essentially it involves recursive backtracking. At this level it would produce an O(2^n) complexity. An efficient solution using memoization could bring this problem down to O(n^2).
Using python, we can write two function, the first one segment returns the first segmentation of a piece of contiguous text into words given a dictionary or None if no such segmentation is found. Another function segment_all returns a list of all segmentations found. Worst case complexity is O(n**2) where n is the input string length in chars.
The solution presented here can be extended to include spelling corrections and bigram analysis to determine the most likely segmentation.
def memo(func):
'''
Applies simple memoization to a function
'''
cache = {}
def closure(*args):
if args in cache:
v = cache[args]
else:
v = func(*args)
cache[args] = v
return v
return closure
def segment(text, words):
'''
Return the first match that is the segmentation of 'text' into words
'''
#memo
def _segment(text):
if text in words: return text
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix = _segment(suffix)
if prefix in words and segmented_suffix:
return '%s %s' % (prefix, segmented_suffix)
return None
return _segment(text)
def segment_all(text, words):
'''
Return a full list of matches that are the segmentation of 'text' into words
'''
#memo
def _segment(text):
matches = []
if text in words:
matches.append(text)
for i in xrange(1, len(text)):
prefix, suffix = text[:i], text[i:]
segmented_suffix_matches = _segment(suffix)
if prefix in words and len(segmented_suffix_matches):
for match in segmented_suffix_matches:
matches.append('%s %s' % (prefix, match))
return matches
return _segment(text)
if __name__ == "__main__":
string = 'cargocultscience'
words = set('car cargo go cult science'.split())
print segment(string, words)
# >>> car go cult science
print segment_all(string, words)
# >>> ['car go cult science', 'cargo cult science']
One option would be to store all valid English words in a trie. Once you've done this, you could start walking the trie from the root downward, following the letters in the string. Whenever you find a node that's marked as a word, you have two options:
Break the input at this point, or
Continue extending the word.
You can claim that you've found a match once you have broken the input up into a set of words that are all legal and have no remaining characters left. Since at each letter you either have one forced option (either you are building a word that isn't valid and should stop -or- you can keep extending the word) or two options (split or keep going), you could implement this function using exhaustive recursion:
PartitionWords(lettersLeft, wordSoFar, wordBreaks, trieNode):
// If you walked off the trie, this path fails.
if trieNode is null, return.
// If this trie node is a word, consider what happens if you split
// the word here.
if trieNode.isWord:
// If there is no input left, you're done and have a partition.
if lettersLeft is empty, output wordBreaks + wordSoFar and return
// Otherwise, try splitting here.
PartitinWords(lettersLeft, "", wordBreaks + wordSoFar, trie root)
// Otherwise, consume the next letter and continue:
PartitionWords(lettersLeft.substring(1), wordSoFar + lettersLeft[0],
wordBreaks, trieNode.child[lettersLeft[0])
In the pathologically worst case this will list all partitions of the string, which can t exponentially long. However, this only occurs if you can partition the string in a huge number of ways that all start with valid English words, and is unlikely to occur in practice. If the string has many partitions, we might spend a lot of time finding them, though. For example, consider the string "dotheredo." We can split this many ways:
do the redo
do the red o
doth ere do
dot here do
dot he red o
dot he redo
To avoid this, you might want to institute a limit of the number of answers you report, perhaps two or three.
Since we cut off the recursion when we walk off the trie, if we ever try a split that doesn't leave the remainder of the string valid, we will detect this pretty quickly.
Hope this helps!
import java.util.*;
class Position {
int indexTest,no;
Position(int indexTest,int no)
{
this.indexTest=indexTest;
this.no=no;
} } class RandomWordCombo {
static boolean isCombo(String[] dict,String test)
{
HashMap<String,ArrayList<String>> dic=new HashMap<String,ArrayList<String>>();
Stack<Position> pos=new Stack<Position>();
for(String each:dict)
{
if(dic.containsKey(""+each.charAt(0)))
{
//System.out.println("=========it is here");
ArrayList<String> temp=dic.get(""+each.charAt(0));
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
else
{
ArrayList<String> temp=new ArrayList<String>();
temp.add(each);
dic.put(""+each.charAt(0),temp);
}
}
Iterator it = dic.entrySet().iterator();
while (it.hasNext()) {
Map.Entry pair = (Map.Entry)it.next();
System.out.println("key: "+pair.getKey());
for(String str:(ArrayList<String>)pair.getValue())
{
System.out.print(str);
}
}
pos.push(new Position(0,0));
while(!pos.isEmpty())
{
Position position=pos.pop();
System.out.println("position index: "+position.indexTest+" no: "+position.no);
if(dic.containsKey(""+test.charAt(position.indexTest)))
{
ArrayList<String> strings=dic.get(""+test.charAt(position.indexTest));
if(strings.size()>1&&position.no<strings.size()-1)
pos.push(new Position(position.indexTest,position.no+1));
String str=strings.get(position.no);
if(position.indexTest+str.length()==test.length())
return true;
pos.push(new Position(position.indexTest+str.length(),0));
}
}
return false;
}
public static void main(String[] st)
{
String[] dic={"world","hello","super","hell"};
System.out.println("is 'hellworld' a combo: "+isCombo(dic,"superman"));
} }
I have done similar problem. This solution gives true or false if given string is combination of dictionary words. It can be easily converted to get space-separated string. Its average complexity is O(n), where n: no of dictionary words in given string.

Resources