String matching alternate approach [closed] - algorithm

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Trying to write my own fast pattern matching algo. Dont want to use language specific solution. I am focussing on writing the algo. This is because I was reading about different techniques to do string matching. Some are complicate yet very interesting like Rabin karp, etc.
I came up with this method which is fast and linear. It works well with the different inputs I have tried with. So I was thinking is there any reason I shouldnt be using this approach over the very well know approaches. Basically I am taking a char of text and comparing with the corresponding character of the pattern - one at a time.
Also, if someone could point out my mistake in this one - it will be great. Thank you for your replies and comments in advance :)
public static boolean patternMatch(String pattern, String text)
{
if(pattern == null)
return true;
if(text == null)
return false;
char[] patternArray = pattern.toCharArray();
char[] textArray = text.toCharArray();
int length = pattern.length();
int j = 0;
for(char t : textArray)
{
if(t == patternArray[j])
{
j++;
if(j == length)
return true;
}
else {
j = 0;
if(t == patternArray[j]) j++;
}
}
return false;
}

Two reasons for using a standard approach:
It's easy to write a method that simply does the wrong thing. Your method is like that, because it will fail to match, for instance, the pattern "ab" against the string "aab". (It matches the first "a"s of the pattern and the string, then fails to match "b" to the second "a" of the string, then goes on to see if it can find a match starting at the third character of the string.)
Standard approaches are fast. Your algorithm is linear, which is pretty good (if only it were also correct!). However, many string matching algorithms will work in sublinear time. That is, the time it takes to match a string grows more slowly than linearly in the size of the input problem. Perhaps hard to believe, but true. (Read the literature for substantiations of this claim.)

Related

How do I make this code for detecting palindrome in a Linked List cover all cases?

So, I was solving this problem of detecting a palindrome in a linked list. I came up with the following solution for it:
class Solution {
public boolean isPalindrome(ListNode head) {
ListNode temp=head;
boolean [] arr = new boolean[10];
int count=0;
if(head==null) return false;
while(temp!=null)
{
if(arr[temp.val]==false)
arr[temp.val]=true;
else
arr[temp.val]=false;
temp=temp.next;
}
for(int i=0;i<10;i++)
{
if(arr[i]==true)
count++;
}
if(count<2)return true;
return false;
}
Now, the logic behind this solution is correct as far as I can see but it fails for cases like this: [1,0,0], [4,0,0,0,0] etc. How do I overcome this? (Pls dont reply with a shorter method I want to know the reason behind why this fails for certain cases.)
First of all Welcome to StackOverflow!
Because of how simple this problem's solution can be I feel obligated to tell you that a solution with an auxiliary stack is not only easy to implement but also easy to understand. But since you asked why your code fails for certain cases I'll answer that question first. Your code in particular is counting the number of digits that have an odd count.
Although this seems to be what you are supposed to do to detect a palindrome notice that a linked list that looks like 1 -> 1 -> 0 -> 0 is also considered a palindrome under your code because the count is always going to be less than 0.
Your solution works for telling us if it is possible to create a palindrome given a set of digits. Suppose that the question was like "given a linked list tell me if you can rearrange it to create a palindrome" but it does not work for "is this linked list a palindrome".

How can i do that counting limits take too much time for big integers?

Im Vladimir Grygov and I have very serious problem.
In our work we now work on really hard algorithm, which using limits to cout the specific result.
Alghoritm is veary heavy and after two months of work we found really serious problem. Our team of analytics told me to solve this problem.
For the first I tell you the problem, which must be solve by limits:
We have veary much datas in the database. Ec INT_MAX.
For each this data we must sort them by the alghoritm to two groups and one must have red color interpretation and second must be blue.
The algorithm counts with ID field, which is some AUTO_INCREMENT value. For this value we check, if this value is eequal to 1. If yeas, this is red color data. If it is zero, this is blue data. If it is more. Then one, you must substract number 2 and check again.
We choose after big brainstorming method by for loop, but this was really slow for bigger number. So we wanted to remove cycle, and my colegue told me use recursion.
I did so. But... after implementation I had got unknown error for big integers and for example long long int and after him was wrote that: "Stack Overflow Exception"
From this I decided to write here, because IDE told me name of this page, so I think that here may be Answer.
Thank You so much. All of you.
After your comment I think I can solve it:
public bool isRed(long long val) {
if (val==1)
{return true; }
else if (val==0)
{ return false; }
else { return isRed(val - 2); }
}
Any halfway decent value for val will easily break this. There is just no way this could have worked with recursion. No CPU will support a stacktrace close to half long.MaxInt!
However there are some general issues with your code:
Right now this is the most needlesly complex "is the number even" check ever. Most people use Modulo to figure that out. if(val%2 == 0) return false; else return true;
the type long long seems off. Did you repeat the type? Did you mean to use BigInteger?
If the value you substract by is not static and it is not solveable via modulo, then there is no reason not to use a loop here.
public bool isRed (long long val){
for(;val >= 0; val = val -2){
if(value == 0)
return false;
}
return true;
}

Should I expect to see the counter in `for` loop changed inside its body? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm reading someone else's code and they separately increment their for loop counter inside the loop, as well as including the usual afterthought. For example:
for( int y = 4; y < 12; y++ ) {
// blah
if( var < othervar ) {
y++;
}
// blah
}
Based on the majority of code others have written and read, should I be expecting to see this?
The practice of manipulating the loop counter within a for loop is not exactly widespread. It would surprise many of the people reading that code. And surprising your readers is rarely a good idea.
The additional manipulation of your loop counter adds a ton of complexity to your code because you have to keep in mind what it means and how it affects the overall behavior of the loop. As Arkady mentioned, it makes your code much harder to maintain.
To put it simply, avoid this pattern. When you follow "clean code" principles, especially the single layer of abstraction (SLA) principle, there is no such thing as
for(something)
if (somethingElse)
y++
Following the principle requires you to move that if block into its own method, making it awkward to manipulate some outer counter within that method.
But beyond that, there might be situations where "something" like your example makes; but for those cases - why not use a while loop then?
In other words: the thing that makes your example complicated and confusing is the fact that two different parts of the code change your loop counter. So another approach could look like:
while (y < whatever) {
...
y = determineY(y, who, knows);
}
That new method could then be the central place to figure how to update the loop variable.
I beg to differ with the acclaimed answer above. There is nothing wrong with manipulating loop control variable inside the loop body. For example, here is the classical example of cleaning up the map:
for (auto it = map.begin(), e = map.end(); it != e; ) {
if (it->second == 10)
it = map.erase(it);
else
++it;
}
Since I have been rightfully pointed out to the fact that iterators are not the same as numeric control variable, let's consider an example of parsing the string. Let's assume the string consists of a series of characters, where characters prefixed with '\' are considered to be special and need to be skipped:
for (size_t i = 0; i < s_len; ++i) {
if (s[i] == '\\') {
++i;
continue;
}
process_symbol(s[i]);
}
Use a while loop instead.
While you can do this with a for loop, you should not. Remember that a program is like any other piece of communication, and must be done with your audience in mind. For a program, the audience includes the compiler and the next person to do maintenance on the code (likely you in about 6 months).
To the compiler, the code is taken very literally -- set up a index variable, run the loop body, execute the increment, then check the condition to see if you are looping again. The compiler doesn't care if you monkey with the loop index.
To a person however, a for loop has a specific implied meaning: Run this loop a fixed number of times. If you monkey with the loop index, then this violates the implication. It's dishonest in a sense, and it matters because the next person to read the code will either have to spend extra effort to understand the loop, or will fail to do so and will therefore fail to understand.
If you want to monkey with the loop index, use a while loop. Especially in C/C++/related languages, a for loop is exactly as powerful as a while loop, so you never lose any power or expressiveness. Any for loop can be converted to a while loop and vice versa. However, the next person who reads it won't depend on the implication that you don't monkey with the loop index. Making it a while loop instead of a for loop is a warning that this kind of loop may be more complicated, and in your case, it is in fact more complicated.
If you increment inside the loop, make sure to comment it. A canonical example (based on a Scott Meyers Effective C++ item) is given in the Q&A How to remove from a map while iterating it? (verbatim code copy)
for (auto it = m.cbegin(); it != m.cend() /* not hoisted */; /* no increment */)
{
if (must_delete)
{
m.erase(it++); // or "it = m.erase(it)" since C++11
}
else
{
++it;
}
}
Here, both the non-constant nature of the end() iterator and the increment inside the loop are surprising, so they need to be documented. Note: the loop hoisting here is after all possible so probably should be done for code clarity.
For what it's worth, here is what the C++ Core Guidelines has to say on the subject:
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Res-loop-counter
ES.86: Avoid modifying loop control variables inside the body of raw
for-loops
Reason The loop control up front should enable correct
reasoning about what is happening inside the loop. Modifying loop
counters in both the iteration-expression and inside the body of the
loop is a perennial source of surprises and bugs.
Also note that in the other answers here that discuss the case with std::map, the increment of the control variable is still only done once per iteration, where in your example, it can be done more than once per iteration.
So after the some confusion, i.e. close, reopen, question body update, title update, I think the question is finally clear. And also no longer opinion based.
As I understand it the question is:
When I look at code written by others, should I be expecting to see "loop condition variable" being changed in the loop body ?
The answer to this is a clear:
yes
When you work with others code - regardless of whether you do a review, fix a bug, add a new feature - you shall expect the worst.
Everything that are valid within the language is to be expected.
Don't make any assumptions about the code being in acordance with any good practice.
It's really better to write as a while loop
y = 4;
while(y < 12)
{
/* body */
if(condition)
y++;
y++;
}
You can sometimes separate out the loop logic from the body
while(y < 12)
{
/* body */
y += condition ? 2 : 1;
}
I would allow the for() method if and only if you rarely "skip" an item,
like escapes in a quoted string.

sort big amount of records by date [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I got 10700 record I need to sort them fast as possible
I've been reading about types of sorting algorithms but got lost I didn't know what's the best to choose: http://en.wikipedia.org/wiki/Sorting_algorithm
EDIT 1:
I need to write down a code that calculates the time of executing the algorithm
EDIT 1-2 : Is there any language that has the functionality of sorting and calculating the time of sorting it ?
And one more question is the language used to implement the algorithms affects the speed ?
(e.g if I used c++ will it be faster than java or .Net lang. ?? )
Note this not a Home Work.
Unless this is a homework problem, don't implement your own sorting algorithm.
Use the one already provided by your development environment - it'll be robust, debugged, and almost certainly faster than anything you'll write yourself.
FWIW, the Sort() method on List<T> in .NET uses a QuickSort.
The actual environment (C++ vs .NET vs Java) will have negligable impact, unless you're doing this in an absurdly small amount of memory. Use whatever you have experience with.
This chunk of code in Java shows how you could determine at least some of the figures you're after :
public class Main {
private static long test (double[] tosort) {
Date begin = new Date();
Arrays.sort(tosort);
Date end = new Date();
return end.getTime() - begin.getTime();
}
public static void main(String[] args) {
double[] tosort = new double[10700];
for (int jj=0;jj<10;jj++) {
for (int ii=0;ii<tosort.length;ii++) {
tosort[ii] = Math.random();
}
System.out.println("Random data " + test(tosort));
}
for (int jj=0;jj<10;jj++) {
for (int ii=0;ii<tosort.length;ii++) {
tosort[ii] = ii;
}
System.out.println("Presorted data " + test(tosort));
}
for (int jj=0;jj<10;jj++) {
for (int ii=0;ii<tosort.length;ii++) {
tosort[ii] = tosort.length - ii;
}
System.out.println("Inverted data " + test(tosort));
}
}
}
Fyi, only my computer each run that code executed stayed below 1 millisecond spent in the sorting routine, I had to increase the data size 100 fold to get some meaningful data.
This piece of code makes entire abstraction of things like the time the comparator code needs (the elements are primitive doubles, comparing other objects will probably take a whole lot more time)
once the just in time compiler has figured out the code, it should become a bit faster as well
you could easily add test runs with alternative sorting algorithms and see how those behave
These figures will vary in function of hardware, input data type, load on your computer, etc, but you can at least get a feeling for what to expect.
You don't need to implement any algorithm (unless this is homework). Every language has its sorting functions, and they are pretty efficient. For example, in C++ you'd use std::sort which on many implementation uses quick sort (and insertion sort if the number of elements is small).

How to split a string into words. Ex: "stringintowords" -> "String Into Words"?

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)
For example: "stringintowords" -> "String Into Words"
Could you please advise what algorithm should be used here ?
! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.
Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.
Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate
S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).
This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.
As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:
(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).
(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).
There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.
In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.
If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.
For example: windowsteamblog (of http://windowsteamblog.com/ fame)
windows team blog
window steam blog
Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.
You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).
Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).
If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:
Needle: isashort
Haystack: This is a short phrase
Preprocessed: thisisashortphrase
NumSpaces : 000011233333444444
And your answer would come from:
location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)
Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.
This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).
This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.
With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.
Create a list of possible words, sort it from long words to short words.
Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.
A simple Java solution which has O(n^2) running time.
public class Solution {
// should contain the list of all words, or you can use any other data structure (e.g. a Trie)
private HashSet<String> dictionary;
public String parse(String s) {
return parse(s, new HashMap<String, String>());
}
public String parse(String s, HashMap<String, String> map) {
if (map.containsKey(s)) {
return map.get(s);
}
if (dictionary.contains(s)) {
return s;
}
for (int left = 1; left < s.length(); left++) {
String leftSub = s.substring(0, left);
if (!dictionary.contains(leftSub)) {
continue;
}
String rightSub = s.substring(left);
String rightParsed = parse(rightSub, map);
if (rightParsed != null) {
String parsed = leftSub + " " + rightParsed;
map.put(s, parsed);
return parsed;
}
}
map.put(s, null);
return null;
}
}
I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:
string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);
/** this way, one does not check the dictionary to check for word validity
* on every substring; It would only be queried once and for all,
* eliminating multiple travels to the data storage
*/
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();
validwords = validwords.sort(length, desc);
array segments = [];
while(mainword != ""){
for(x = 0; x < validwords.length; x++){
if(mainword.startswith(validwords[x])) {
segments.push(validwords[x]);
mainword = mainword.remove(v);
x = 0;
}
}
/**
* remove the first character if any of valid words do not match, then start again
* you may need to add the first character to the result if you want to
*/
mainword = mainword.substring(1);
}
string result = segments.join(" ");

Resources