There is a linear time algorithm (or quadratic time algorithm by Knuth & Plass) for breaking text evenly into lines of maximum width. It uses SMAWK and "evenly" means:
http://en.wikipedia.org/wiki/Word_wrap#Minimum_raggedness
Is there an algorithm or a concave cost function for algorithm above which would take into account the number of lines I would like the text break into, instead of the maximum line width?
In other words, I'm looking for a line breaking (or paragraph formation, or word wrapping) algorithm where the input is the desired number of lines, not the desired line width.
Just to describe a practically unusable approach: There are N words and N-1 spaces in-between each word pair, M is the desired number of lines (M <= N). After each space there might be at most one (possibly zero) line-break. Now, the algorithm would try to place the breaks in each possible combination, calculating the "raggedness" and return the best one. How to do it much faster?
You could simply reduce the problem of achieving a given number of lines to the problem of breaking lines after a maximum length by calculating the maximum length as the total length of the string divided by the number of lines you want. As the actual length of a line is going to be less than the maximum length in many cases, you would probably need to subtract 1 from the number of lines you want.
Related
I am currently doing an assignment and I'm stuck with the approach.
I have a crossword problem which consists of an empty grid (no solid square as a conventional crossword would), with a varied width and height between 4 and 400 (inclusive).
Rules:
Words are part of the input - a list of 10 - 1000 (inclusive) English words of varying lengths.
A horizontal word can only intersect a vertical word.
A vertical word can only intersect a horizontal word.
A word can only intersect 1 or 2 other words.
Each letter is worth one point.
Words must have a 1 grid space gap surrounding it unless it is a part of an intersecting word.
Example:
X X X X X X
X B O S S X
X X X X X X
Goal:
Get the maximum possible score within a 5 minute time limit.
So far:
After some research I am aware that this is an NP-Hard problem. Thus the most optimal solution cannot be calculated because every combination cannot be examined.
The easiest solution would appear to be to sort the words according to length and inserting the highest scoring words for maximum score (greedy algorithm).
I’ve also been told a recursive tree with the nodes consisting of alternative equally scoring word insertions and the knapsack algorithm apply to this problem (not sure what the implementation would look like).
Questions:
What allows me to check the maximum number of combinations within a 5 minute time span that scales accordingly to the maximum possible word list and grid size?
What heuristics might I apply when inserting words?
Btw the goal here is to get the best possible solution in 5 minutes.
To clarify each letter of a valid word is worth 1 point, thus a 5 letter word is worth 5 points.
Thanks in advance I have been reading a lot of mathematical notation on crossword research papers all day which has seem to have lead me in a circle.
I'd start with a word with following characteristics:
It should have max possible intersections.
Its length should be such that number of words of that length are minimum in the list.
ie, word length should be least frequent and most number of intersections.
Reason for this kind of selection is that it would minimize further possibility of words that can be selected. eg. A word of size 9 with 2 further intersections is selected. These intersecting words are of length 6 and 5 (say). Now, you have removed possibility of all those words of length 6 and 5 whose 3rd char is 'a' and 2nd char is 's' (say, 'a' and 's' are the intersecting letters).
If there are many places with same configuration, run this selection procedure one or two steps deeper to get a better selection of which part (word) of the grid to fill first.
Now, try filling in all words in this 1st selected position (since this had min frequency, it should be good to use) and then going deeper in the crossword to fill it. Whichever word results in most points till a deadend is reached, should be your solution. When you reach a dead-end, you can start over with a new word.
This seems like a really interesting problem in discrete optimization. You're certainly right; with the number of words and number of possible placements there is no way you could ever explore a fraction of the space.
Also given the 5 minute time limit (quite short), I think you're going to have a really hard time with any solid heuristic. I think your best bet might be some sort of random permutation / simulated annealing algorithm.
If I was doing this, I would first calculate clusters of words, completely ignoring the crossword structure itself. Take one word, find a second word that intersects it. Then find another word that can fit onto this structure (obeying the max of 2 intersections per word), and so on. You should end up with many of these clusters, which you can rank by density (points / area used). I think you should be able to do this relatively quickly.
Then for the random permutation / simulated annealing part, for my moves I would place either a cluster or unused word onto the crossword itself, or move an existing cluster / word. Just save the current highest-scoring configuration as you go, and return this after the 5 minutes.
If the 5 min is too short to find anything meaningful using random permutations, another approach might be to use a constraint propagation idea working with those clusters.
At a given instance can we find out unique words in a text stream.
One naive solution i can think of is using a hashmap to keep words count.
But this would require keeping words which have word count more than 1 in hashmap. In case of long text stream, it will be lot of words to maintain. Is there a way to crunch on space complexity for this.
You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.
Assuming the data stream consists of a universe of elements chosen from a set of size N, you can do the following steps, copied from the slides linked above.
Pick a hash function h that maps each of the N elements to at least log_2 (N) bits.
For each stream element a, let r(a) be the number of trailing 0's in h(a).
Record R = the maximum r(a) seen.
Estimated number of distinct elements = 2^R.
I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.
I two sets of intervals that correspond to the same 1-dimensional (linear) space. Here is a rough visual--in reality, there are many more intervals and they are much more spread out, but this gives the basic idea.
Each of these intervals contains information, and I am writing a program to compare the information in one set of intervals (the red) to the information contained in the other set (the blue).
Here is my problem. I would like to partition the space into n chunks such that there is roughly an equal amount of comparison work to be done in each chunk (the amount of work depends on the number of intervals in that portion of the space). Also, the partition should not split any red or blue interval across two chunks.
So the input is two sets of intervals, and the desired output is a partition of the space such that
the intervals are (roughly) equally distributed across each element of the partition
no interval overlaps with multiple partition elements
Can anyone suggest an approach or an algorithm for doing this?
Define a "word" to be a maximal interval in which every point belongs either to a red interval or a blue interval. No chunk can end in the middle of a word, and every union of consecutive words is a potential chunk. Now apply a minimum raggedness word-wrap algorithm to the words, where the length of a word is defined to be the number of intervals it contains (line = chunk).
Suppose there is given two String:
String s1= "MARTHA"
String s2= "MARHTA"
here we exchange positions of T and H. I am interested to write code which counts how many changes are necessary to transform from one String to another String.
There are several edit distance algorithms, the given Wikipeida link has links to a few.
Assuming that the distance counts only swaps, here is an idea based on permutations, that runs in linear time.
The first step of the algorithm is ensuring that the two strings are really equivalent in their character contents. This can be done in linear time using a hash table (or a fixed array that covers all the alphabet). If they are not, then s2 can't be considered a permutation of s1, and the "swap count" is irrelevant.
The second step counts the minimum number of swaps required to transform s2 to s1. This can be done by inspecting the permutation p that corresponds to the transformation from s1 to s2. For example, if s1="abcde" and s2="badce", then p=(2,1,4,3,5), meaning that position 1 contains element #2, position 2 contains element #1, etc. This permutation can be broke up into permutation cycles in linear time. The cycles in the example are (2,1) (4,3) and (5). The minimum swap count is the total count of the swaps required per cycle. A cycle of length k requires k-1 swaps in order to "fix it". Therefore, The number of swaps is N-C, where N is the string length and C is the number of cycles. In our example, the result is 2 (swap 1,2 and then 3,4).
Now, there are two problems here, and I think I'm too tired to solve them right now :)
1) My solution assumes that no character is repeated, which is not always the case. Some adjustment is needed to calculate the swap count correctly.
2) My formula #MinSwaps=N-C needs a proof... I didn't find it in the web.
Your problem is not so easy, since before counting the swaps you need to ensure that every swap reduces the "distance" (in equality) between these two strings. Then actually you look for the count but you should look for the smallest count (or at least I suppose), otherwise there exists infinite ways to swap a string to obtain another one.
You should first check which charaters are already in place, then for every character that is not look if there is a couple that can be swapped so that the next distance between strings is reduced. Then iterate over until you finish the process.
If you don't want to effectively do it but just count the number of swaps use a bit array in which you have 1 for every well-placed character and 0 otherwise. You will finish when every bit is 1.