Algorithm that spans two pages in Latex - algorithm

i have a long algorithm that i need to put in a report. i am using latex for this report. but due to the length of the algorithm it is more than one page but i cannot get it to fit into the next page. i am new to latex. can someone tell me how to do this? i am new to latex.

You should manually split the algorithm into two parts. You can just chop it in half, as redtuna suggested, or even better, you can factor out an interesting chunk into a new function and put that on a separate page. This will likely make the algorithm more readable too.

split it into two. If you're using one of the packages that lets you number lines, tell the second half to start with a line number that's one plus the last line number.
You'll probably be able to get better quality answers if you tell us which package you're using to format your algorithm (or ask for suggestions; I've had good results with "listings").

Related

My Algorithm only fails for large values - How do I debug this?

I'm working on transcribing as3delaunay to Objective-C. For the most part, the entire algorithm works and creates graphs exactly as they should be. However, for large values (thousands of points), the algorithm mostly works, but creates some incorrect graphs.
I've been going back through and checking the most obvious places for error, and I haven't been able to actually find anything. For smaller values I ran the output of the original algorithm and placed it into JSON files. I then read that output in to my own tests (tests with 3 or 4 points only), and debugged until the output matched; I checked the output of the two algorithms line for line, and found the discrepancies. But I can't feasibly do that for 1000 points.
Answers don't need to be specific to my situation (although suggesting tools I can use would be excellent).
How can I debug algorithms that only fail for large values?
If you are transcribing an existing algorithm to Objective-C, do you have a working original in some other language? In that case, I would be inclined to put in print statements in both versions and debug the first discrepancy (the first, because later discrepancies could be knock-on errors).
I think it is very likely that the program also makes mistakes for smaller graphs, but more rarely. My first step would in fact be to use the working original (or some other means) to run a large number of automatically checked test runs on small graphs, hoping to find the bug on some more manageable input size.
Find the threshold
If it works for 3 or 4 items, but not for 1000, then there's probably some threshold in between. Use a binary search to find that threshold.
The threshold itself may be a clue. For example, maybe it corresponds to a magic value in the algorithm or to some other value you wouldn't expect to be correlated. For example, perhaps it's a problem when the number of items exceeds the number of pixels in the x direction of the chart you're trying to draw. The clue might be enough to help you solve the problem. If not, it may give you a clue as to how to force the problem to happen with a smaller value (e.g., debug it with a very narrow chart area).
The threshold may be smaller than you think, and may be directly debuggable.
If the threshold is a big value, like 1000. Perhaps you can set a conditional breakpoint to skip right to iteration 999, and then single-step from there.
There may not be a definite threshold, which suggests that it's not the magnitude of the input size, but some other property you should be looking at (e.g., powers of 10 don't work, but everything else does).
Decompose the problem and write unit tests
This can be tedious but is often extremely valuable--not just for the current issue, but for the future. Convince yourself that each individual piece works in isolation.
Re-visit recent changes
If it used to work and now it doesn't, look at the most recent changes first. Source control tools are very useful in helping you remember what has changed recently.
Remove code and add it back piece by piece
Comment out as much code as you can and still get some kind of reasonable output (even if that output doesn't meet all the requirements). For example, instead of using a complicated rounding function, just truncate values. Comment out code that adds decorative touches. Put assert(false) in any special case handlers you don't think should be activated for the test data.
Now verify that output, and slowly add back the functionality you removed, one baby step at a time. Test thoroughly at each step.
Profile the code
Profiling is usually for optimization, but it can sometimes give you insight into code, especially when the data size is too large for single-stepping through the debugger. I like to use line or statement counts. Is the loop body executing the number of times you expect? Or twice as often? Or not at all? How about the then and else clauses of those if statements? Logic bugs often become very obvious with this type of profiling.

Re-arrange the picture

This question was asked in a recent interview. please suggest something:
A picture of 16x16 is divided into pieces with sizes of 4x4 (16 pieces) and shuffled. Suggest an algorithm to rearrange it back.
If it's a software engineering type of problem and you divide it yourself you can cheat and store each location with each piece. ;)
They're probably looking for some pattern-matching solution though. Perhaps compare the last row of pixels on each side (top/bottom/left/right) with the other (horizontal/vertical) sides (with a certain tolerance). Each side will get a certain score against the others, progressively matching until all are done.
Without going into the Pixel matching algorithms, I think I would take a Dynamic Programming bottom up approach here. First find 8 sets of 2 pieces which are most likely adjacent and then try to build the whole thing from the smaller subsets.
I hope each of these pieces will have a identification (like a number to order/rearrange them). I can think this problem as a analogy to Reception of UDP Packets(Usually UDP Packets might get received out of order and then we need to order them.)
So any sorting algorithm should work.
Please correct me if I have misunderstood the question.
Assuming nothing is available expect the pixels of the pieces, this is a great approach at solving it probabilistically
http://people.csail.mit.edu/taegsang/JigsawPuzzle.html

OCR error correction: How to combine three erroneous results to reduce errors

The problem
I am trying to improve the result of an OCR process by combining the output from three different OCR systems (tesseract, cuneinform, ocrad).
I already do image preprocessing (deskewing, despeckling, threholding and some more). I don't think that this part can be improved much more.
Usually the text to recognize is between one and 6 words long. The lanuage of the text is unknown and quite often they contain fantasy words.
I am on Linux. Preferred language would be Python.
What I have so far
Often every result has one or two errors. But they have errors at different characters/positions. Errors could be that they recognize a wrong character or that they include a non existing character. Not so often they ignore a character.
An example might look in the following way:
Xorem_ipsum
lorXYm_ipsum
lorem_ipuX
A X is a wrong recognized character and an Y is a character which does not exist in the text. Spaces are replaced by "_" for better readibilty.
In cases like this I try to combine the different results.
Using repeatedly the "longest common substring" algorithm between the three pairs I am able to get the following structure for the given example
or m_ipsum
lor m_ip u
orem_ip u
But here I am stuck now. I am not able to combine those pieces to a result.
The questions
Do you have
an idea how to combine the different
common longest substrings?
Or do you have a better idea how to solve this problem?
It all depends on the OCR engines you are using as to the quality of the results you can expect to get. You may find that by choosing a higher quality OCR engine that gives you confidence levels and bounding boxes would give you much better raw results in the first place and then extra information that could be used to determine the correct result.
Using Linux will restrict the possible OCR engines available to you. Personally I would rate Tesseract as 6.5/10 compared to commercial OCR engines available under Windows.
http://www.abbyy.com/ocr_sdk_linux/overview/ - The SDK may not be cheap though.
http://irislinktest.iriscorporate.com/c2-1637-189/iDRS-14-------Recognition--Image-preprocessing--Document-formatting-and-more.aspx - Available for Linux
http://www.rerecognition.com/ - Is available as a Linux version. This engine is used by many other companies.
All of the engines above should give you confidence levels, bounding boxes and better results than Tesseract OCR.
https://launchpad.net/cuneiform-linux - Cuneiform, now open sourced and running under Linux. This is likely one of your three engnines you are using. If not you should probably look at adding it.
Also you may want to look at http://tev.fbk.eu/OCR/Products.html for more options.
Can you past a sample or two of typical images and the OCR results from the engines. There are other ways to improve OCR recognition but it would depend on the images.
Maybe repeat the "longest common substring" until all results are the same.
For your example, you would get the following in the next step:
or m_ip u
or m_ip u
or m_ip u
OR do the "longest common substring" algorithm with the first and second string and then again the result with the third string. So you get the same result or m_ip u more easy.
So you can assume that letters should be correct. Now look at the spaces. Before or there are two times l and once X, so choose l. Between or and m_ip there are two times e and once XY, so choose e. And so on.
I'm new to OCR, but until now I find out that those systems are build to work based on a dictionary of words rather than letter by letter. So, if your images doesn't have real words, maybe you will have to look closer to the letter recognition & training part of the systems you are using.
I afforded a very similar problem.
I hope that this can help: http://dl.tufts.edu/catalog/tufts:PB.001.011.00001
See also software developed by Bruce Robertson: https://github.com/brobertson/rigaudon

String search algorithms

I have a project on benchmarking String Matching Algorithms and I would like to know if there is a standard for every algorithm so that I would be able to get fair results with my experimentation. I am planning to use java's system.nanotime in getting the running time of every algorithm. Any comment or reactions regarding my problem is very much appreciated. Thanks!
I am not entirely sure what you're asking. However, I am guessing you are asking how to get the most realistic results. You need to run your algorithm hundreds, or even thousands of iterations to get an average. It is also very important to turn off any caching that your language may do, and don't reuse objects, unless it is part of your algorithm.
I am not entirely sure what you're asking. However, another interpretation of what you are asking can be answered by trying to work out how a given algorithm performs as you increase the size of the problem. Using raw time to compare algorithms at a given string size does not necessarily allow for accurate comparison. Instead, you could try each algorithm with different string sizes and see how the algorithm behaves as string size varies.
And Mark's advice is good too. So you are running repeated trials for many different string lengths to get a picture of how one algorithm works, then repeating that for the next algorithm.
Again, it's not clear what you're asking, but here's another thought in addition to what Tony and Mark said:
Be very careful about testing only "real" input or only "random" input. Some algorithms are tuned to do well on typical input (searching for a word in English text), while others are tuned for working well on pathologically hard cases. You'll need a huge mix of possible inputs of all different types and sizes to do a truly good benchmark.

Extract small relevant bits text (as Google does) from the full text search results

I have implemented a full text search in a discussion forum database and I want to display
the search results in a way Google does. Even for a very long html page only a two or three
lines of the texts displayed in a search result list. Usually these are the lines
which contain a search terms.
What would be the good algorithm of how to extract a few lines of the text based on the text itself and a search terms. I could think of something as easy as just using one line of text before the search term occurrence in a text and a line after - but that seems to be too simple to work.
Would like to get a few directions, ideas and insights.
Thank you.
If you are looking for something fancier than the 'line before/after' approach, a summarizer might do the trick.
Here's a Naive Bayes based system: http://classifier4j.sourceforge.net/
Bayes is the statistical system used by many spam filters - I researched Bayes summarizers a few years back, and found that they do a pretty good job of summarizing text, as long as there is a decent amount of text to process. I haven't actually tried the above library, though, so your mileage may vary.
Have you tried the "line before/after search term occurrance" in code to see if for that simple coding investment the results are good enough for what you want? Might already be enough?
Otherwise, you could go for pieces of sentences: so don't split on lines, but on newlines, full stops, comma's, spaced out hyphens etc. Then show the pieces that contain the search terms. You could separate each matching sentence piece with "..." or something.
If you get a lot of these pieces, you could try to prioritize the pieces, sort on descending priority and only show the first n of them. And/or cut down the pieces to just the search term and a couple of words around the search term.
Just a couple of informal ideas that might get you started?
Concentrate on the beginning of the content. Think of where you would look when you visit a blog. The beginning para tells you whether the article is in the right direction. So in your algorithm it will make sense to reflect this.
Check for occurrences of the search term in headings (H1,H2 etc) and give more priority to them.
This should get you started.

Resources