Ellipsizing a set of names - algorithm

OK, I'm sure somebody, somewhere must have come up with an algorithm for this already, so I figured I'd ask before I go off to (re)invent it myself.
I have a list of arbitrary (user-entered) non-empty text strings. Each string can be any length (except 0), and they're all unique. I want to display them to the user, but I want to trim them to some fixed length that I decide, and replace part of them with an ellipsis (...). The catch is that I want all of the output strings to be unique.
For example, if I have the strings:
Microsoft Internet Explorer 6
Microsoft Internet Explorer 7
Microsoft Internet Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
then I wouldn't want to trim the ends of the strings, because that's the unique part (don't want to display "Microsoft Internet ..." 3 times), but it's OK to cut out the middle part:
Microsoft...rer 6
Microsoft...rer 7
Microsoft...rer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
Other times, the middle part might be unique, and I'd want to trim the end:
Minutes of Company Meeting, 5/25/2010 -- Internal use only
Minutes of Company Meeting, 6/24/2010 -- Internal use only
Minutes of Company Meeting, 7/23/2010 -- Internal use only
could become:
Minutes of Company Meeting, 5/25/2010...
Minutes of Company Meeting, 6/24/2010...
Minutes of Company Meeting, 7/23/2010...
I guess it should probably never ellipsize the very beginning of the strings, even if that would otherwise be allowed, since that would look weird. And I guess it could ellipsize more than one place in the string, but within reason -- maybe 2 times would be OK, but 3 or more seems excessive. Or maybe the number of times isn't as important as the size of the chunks that remain: less than about 5 characters between ellipses would be rather pointless.
The inputs (both number and size) won't be terribly large, so performance is not a major concern (well, as long as the algorithm doesn't try something silly like enumerating all possible strings until it finds a set that works!).
I guess these requirements seem pretty specific, but I'm actually fairly lenient -- I'm just trying to describe what I have in mind.
Has something like this been done before? Is there some existing algorithm or library that does this? I've googled some but found nothing quite like this so far (but maybe I'm just bad at googling). I have to believe somebody somewhere has wanted to solve this problem already!

It sounds like an application of the longest common substring problem.
Replace the longest substring common to all strings with ellipsis. If the string is still too long and you are allowed to have another ellipsis, repeat.
You have to realize that you might not be able to "ellipsize" a given set of strings enough to meet length requirements.

Sort the strings. Keep the first X characters of each string. If this prefix is not unique to the string before and after, then advance until unique characters (compared to the string before and after) are found. (If no unique characters are found, the string has no unique part, see bottom of post) Add ellipses before and after those unique characters.
Note that this still might look funny:
Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook
I don't know what language you're looking to do this in, but here's a Python implementation.
def unique_index(before, current, after, size):
'''Returns the index of the first part of _current_ of length _size_ that is
unique to it, _before_, and _after_. If _current_ has no part unique to it,
_before_, and _after_, it returns the _size_ letters at the end of _current_'''
before_unique = False
after_unique = False
for i in range(len(current)-size):
#this will be incorrect in the case mentioned below
if i > len(before)-1 or before[i] != current[i]:
before_unique = True
if i > len(after)-1 or after[i] != current[i]:
after_unique = True
if before_unique and after_unique:
return i
return len(current)-size
def ellipsize(entries, prefix_size, max_string_length):
non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.
#If you want to preserve order then make a copy and make a mapping from the copy to the original
entries.sort()
ellipsized = []
# you could probably remove all this indexing with something out of itertools
for i in range(len(entries)):
current = entries[i]
#entry is already short enough, don't need to truncate
if len(current) <= max_string_length:
ellipsized.append(current)
continue
#grab empty strings if there's no string before/after
if i == 0:
before = ''
else:
before = entries[i-1]
if i == len(entries)-1:
after = ''
else:
after = entries[i+1]
#Is the prefix unique? If so, we're done.
current_prefix = entries[i][:prefix_size]
if not before.startswith(current_prefix) and not after.startswith(current_prefix):
ellipsized.append(current[:max_string_length] + '...') #again, possibly -3
#Otherwise find the unique part after the prefix if it exists.
else:
index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
if index == prefix_size:
header = ''
else:
header = '...'
if index + non_prefix_size == len(current):
trailer = ''
else:
trailer = '...'
ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
return ellipsized
Also, you mention the string themselves are unique, but do they all have unique parts? For example, "Microsoft" and "Microsoft Internet Explorer 7" are two different strings, but the first has no part that is unique from the second. If this is the case, then you'll have to add something to your spec as to what to do to make this case unambiguous. (If you add "Xicrosoft", "MXcrosoft", "MiXrosoft", etc. to the mix with these two strings, there is no unique string shorter than the original string to represent "Microsoft") (Another way to think about it: if you have all possible X letter strings you can't compress them all to X-1 or less strings. Just like no compression method can compress all inputs, as this is essentially a compression method.)
Results from original post:
>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
print entry
Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
print entry
Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...

Related

create a URL shortener with Base 62?

I understood the process to shorten the URL with base 62 at How do I create a URL shortener?.
Steps given are
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62)
My question is why not just create unique numerical key and return it ? What is the advantage of concerting numerical key > Base 62 > then Finally some alphanumeric number ?
Is it because final alphanumeric number will be much smaller than unique numerical key ?
Yes. The idea is to make it short and usable in a URL. A number in base 62 will use fewer characters than the same number in base 10. Notice also that URL shorteners use short hosts, such as g.co.
I can see you understand that, yes, a number written in base 62 takes less characters than a number in base 10 just like a number in base 10 takes less characters than a number in base 2 (e.g. 0101 is 3 characters longer than just '5').
So, I'll answer specifically "Why".
Sometimes a link is shortened to be more visually pleasing. A company worried about their public perception likely doesn't want their links to look like an error code due to how long they are so they resort to shortening. That's why some url shortening services allow you to add your own "vanity url" which customizes the domain name, so that a link can be shortened and branded.
Other times a link is shortened to minimize character count when working with constraints, like Twitter. For example, at my company we shortened the links in our automated Twilio messages because SMS messages that contain more than 160 characters are technically 2 concatenated messages so it is more expensive to send.
And finally if the link is being shared through a medium that cannot be directly clicked on (e.g. verbally, on paper), making it shorter makes it much easier to type into an address bar manually. (Imagine trying to type the url to this SO question when someone is reading it to you.) I assume this is also at least partially why the base used for these links usually stop at around 62. If you start including other arbitrary characters to higher the base and consequentially make the link marginally shorter, it'll become harder to communicate, read and type. ("domain.name/5omeC0d3" vs "domian.name/🈲}♠ "

totalEstimatedMatches behavior with Microsoft (Bing) Cognitive search API (v5)

Recently converted some Bing Search API v2 code to v5 and it works but I am curious about the behavior of "totalEstimatedMatches". Here's an example to illustrate my question:
A user on our site searches for a particular word. The API query returns 10 results (our page size setting) and totalEstimatedMatches set to 21. We therefore indicate 3 pages of results and let the user page through.
When they get to page 3, totalEstimatedMatches returns 22 rather than 21. Seems odd that with such a small result set it shouldn't already know it's 22, but okay I can live with that. All results are displayed correctly.
Now if the user pages back again from page 3 to page 2, the value of totalEstimatedMatches is 21 again. This strikes me as a little surprising because once the result set has been paged through, the API probably ought to know that there are 22 and not 21 results.
I've been a professional software developer since the 80s, so I get that this is one of those devil-in-the-details issues related to the API design. Apparently it is not caching the exact number of results, or whatever. I just don't remember that kind of behavior in the V2 search API (which I realize was 3rd party code). It was pretty reliable on number of results.
Does this strike anyone besides me as a little bit unexpected?
Turns out this is the reason why the response JSON field totalEstimatedMatches includes the word ...Estimated... and isn't just called totalMatches:
"...search engine index does not support an accurate estimation of total match."
Taken from: News Search API V5 paging results with offset and count
As one might expect, the fewer results you get back, the larger % error you're likely to see in the totalEstimatedMatches value. Similarly, the more complex your query is (for example running a compound query such as ../search?q=(foo OR bar OR foobar)&...which is actually 3 searches packed into 1) the more variation this value seems to exhibit.
That said, I've managed to (at least preliminarily) compensate for this by setting the offset == totalEstimatedMatches and creating a simple equivalency-checking function.
Here's a trivial example in python:
while True:
if original_totalEstimatedMatches < new_totalEstimatedMatches:
original_totalEstimatedMatches = new_totalEstimatedMatches.copy()
#set_new_offset_and_call_api() is a func that does what it says.
new_totalEstimatedMatches = set_new_offset_and_call_api()
else:
break
Revisiting the API & and I've come up with a way to paginate efficiently without having to use the "totalEstimatedMatches" return value:
class ApiWorker(object):
def __init__(self, q):
self.q = q
self.offset = 0
self.result_hashes = set()
self.finished = False
def calc_next_offset(self, resp_urls):
before_adding = len(self.result_hashes)
self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
after_adding = len(self.result_hashes)
if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
self.complete = True
else:
self.offset += len(new_results)
def page_through_results(self, *args, **kwargs):
while not self.finished:
new_resp_urls = ...<call_logic>...
self.calc_next_offset(new_resp_urls)
...<save logic>...
print(f'All unique results for q={self.q} have been obtained.')
This^ will stop paginating as soon as a full response of duplicates have been obtained.

renumbering ordered session variables when deleting one

I'm updating a classic ASP application, written in jScript, for a local pita restaurant. I've created a new mobile-specific version of their desktop site, which allows ordering for delivery and lots of customization of the final pita (imagine a website for Subway, which would allow you to add pickles, lettuce, etc.). Each pita is stored as a string of numbers in a session variable. The total number of pitas is also stored. The session might look like this:
PitaCount = 3
MyPita1 = "35,23,16,231,12"
MyPita2 = "24,23,111,52,12,23,93"
MyPita3 = "115,24"
I know there may be better ways to store the data, but for now, since the whole thing is written, working , and live (and the client is happy), I'd like to just solve the problem I have. Here's the problem...
I've got buttons on the order recap page which allow the customer to delete pitas from the cart. When I do this, I want to renumber the session variables. If the customer deletes MyPita1, I need to renumber MyPita2 to MyPita1, renumber MyPita3 to MyPita2, and then decrement the PitaCount.
The AJAX button sends an integer to an ASP file with the number of the pita to be deleted (DeleteID). My function looks at PitaCount and DeleteID. If they're both 1, it just abandons the session. If they're both the same, but greater than one, we're deleting the most recently added pita, so no renumbering is needed. However, if PitaCount is greater then DeleteID, we need to renumber the pitas. Here's the code I'm using to do that:
for (y=DeleteID;y<PitaCount;y++) {
Session("MyPita" + y) = String(Session.Contents("MyPita" + (y+1)));
};
Session.Contents.Remove("MyPita" + PitaCount);
PitaCount--;
Session.Contents("PitaCount") = PitaCount;
This works for every pita EXCEPT the one which replaces the deleted one, which returns 'undefined'. For example, if I have 6 pitas in my cart, and I delete MyPita2, I end up with 5 pitas in the cart. Number 1, 3, 4, and 5 are exactly what you'd expect, but MyPita2 returns undefined.
I also tried a WHILE loop instead:
while (DeleteID < PitaCount) {
Session("MyPita" + DeleteID) = String(Session.Contents("MyPita" + (DeleteID+1)));
DeleteID++;
};
Session.Contents.Remove("MyPita" + PitaCount);
PitaCount--;
Session.Contents("PitaCount") = PitaCount;
This also returns 'undefined', just like the one above.
Until I can get this working I'm simply writing the most recent pita into the spot vacated by the deleted pita, but this reorders the cart, and I consider that a usability problem because people expect the items they added to the cart to remain in the same order. (Yes, I could add some kind of timestamp to the sessions and order using that, but it would be quicker to fix the problem I'm having, I think).
I'm baffled. Why (using the 6 pita example above) would it work perfectly on the second, third, and fourth iteration through the loop, but not on the first?
I can't be sure, but I think your issue may be that the value of DeleteID is a string. This could happen you assign its value by doing something like:
var DeleteID = Session("DeleteID");
Assuming this is true, then in the first iteration of your loop (which writes to the deleted spot), y is a string, and the expression y+1 is interpreted as a string concatenation instead of a numeric addition. If, for example, you delete ID 1, you're actually copying the value from id 11 ("1" + 1) into the deleted spot, which probably doesn't exist in your tests. This can be tested by adding at least 11 items to your cart and then deleting the first one. On the next iteration, the increment operator ++ forces y to be a number, so the script works as expected from that point on.
The solution is to convert DeleteID to a number when initializing your loop:
for (y = +DeleteID; y < PitaCount; y++) {
There may be better ways to convert a string to a number, but the + is what I remember.

Ruby regular expression for asterisks/underscore to strong/em?

As part of a chat app I'm writing, I need to use regular expressions to match asterisks and underscores in chat messages and turn them into <strong> and <em> tags. Since I'm terrible with regex, I'm really stuck here. Ideally, we would have it set up such that:
One to three words, but not more, can be marked for strong/em.
Patterns such as "un*believ*able" would be matched.
Only one or the other (strong OR em) work within one line.
The above parameters are in order of importance, with only #1 being utterly necessary - the others are just prettiness. The closest I came to anything that worked was:
text = text.sub(/\*([(0-9a-zA-Z).*])\*/,'<b>\1<\/b>')
text = text.sub(/_([(0-9a-zA-Z).*])_/,'<i>\1<\/i>')
But it obviously doesn't work with any of our params.
It's odd that there's not an example of something similar already out there, given the popularity of using asterisks for bold and whatnot. If there is, I couldn't find it outside of plugins/gems (which won't work for this instance, as I really only need it in in one place in my model). Any help would be appreciated.
This should help you finish what you are doing:
sub(/\*(.*)\*/,'<b>\1</b>')
sub(/_(.*)_/,'<i>\1</i>')
Firstly, your criteria are a little strange, but, okay...
It seems that a possible algorithm for this would be to find the number of matches in a message, count them to see if there are less than 4, and then try to perform one set of substitutions.
strong_regexp = /\*([^\*]*)\*/
em_regexp = /_([^_]*)_/
def process(input)
if input ~= strong_regexp && input.match(strong_regexp).size < 4
input.sub strong_regexp, "<b>\1<\b>"
elsif input ~= em_regexp && intput.match(em_regexp).size < 4
input.sub em_regexp, "<i>\1<\i>"
end
end
Your specifications aren't entirely clear, but if you understand this, you can tweak it yourself.

Parsing text files in Ruby when the content isn't well formed

I'm trying to read files and create a hashmap of the contents, but I'm having trouble at the parsing step. An example of the text file is
put 3
returns 3
between
3
pargraphs 1
4
3
#foo 18
****** 2
The word becomes the key and the number is the value. Notice that the spacing is fairly erratic. The word isn't always a word (which doesn't get picked up by /\w+/) and the number associated with that word isn't always on the same line. This is why I'm calling it not well-formed. If there were one word and one number on one line, I could just split it, but unfortunately, this isn't the case. I'm trying to create a hashmap like this.
{"put"=>3, "#foo"=>18, "returns"=>3, "paragraphs"=>1, "******"=>2, "4"=>3, "between"=>3}
Coming from Java, it's fairly easy. Using Scanner I could just use scanner.next() for the next key and scanner.nextInt() for the number associated with it. I'm not quite sure how to do this in Ruby when it seems I have to use regular expressions for everything.
I'd recommend just using split, as in:
h = Hash[*s.split]
where s is your text (eg s = open('filename').read. Believe it or not, this will give you precisely what you're after.
EDIT: I realized you wanted the values as integers. You can add that as follows:
h.each{|k,v| h[k] = v.to_i}

Resources