How is Illegal char's URL working? - algorithm

There are many sites (such as Stackoverflow) that has the title of the page in the URL.
I am looking for the algorithm in which they are using in order to avoid illegal URL characters. ( I dont want URL encoding, I want replace/remove algo)
like 'How is Illegal char's URL working?' will become 'How-is-Illegal-chars-URL-working'
Thanks!

The algorithm to do this is generally called 'slugify', because it turns a string into a 'slug' to be used in a URL. Searching for that should give you plenty of useful implementations.

No idea how SO does it, but I would just strip every non-alphanumeric character and replace spaces with underscores.
In Python:
def cleanTitle(title):
temp = ''
for character in title.lower():
if character in 'abcdefghijklmnopqrstuvwxyz1234567890_-+/<>,.=[]{}()\|!##$%^&':
temp += character
return temp
I see you are working in C#. I don't know C#, so you'll have to translate this code. I doubt it's hard to do, though.

Related

breakable slashes everywhere but URLs

I generate pdf (latex) from restructured text using python sphinx (1.4.6) .
I use narrow table column headers with texts like "stuff/misc/other". I need the slashes to be breakable, so the table headers don't overflow into the next column.
The LaTeX solution is to use \BreakableSlash or \slash where necessary. I can use python code to replace all slashes:
from sphinx.util.texescape import tex_replacements
# \BreakableSlash needs package hyphenat to be loaded
tex_replacements.append((u'/', ur'\BreakableSlash ') )
# tex_replacements.append((u'/', ur'\slash ') )
But that will break any URL like http://www.example.com/ into something like
http:\unhbox\voidb#x\penalty\#M\hskip\z#skip/\discretionary{-}{}{}\penalty\#M\hskip\z#skip\unhbox\voidb#x\penalty\#M\hskip\z#skip/\discretionary{-}{}{}\penalty\#M\hskip\z#skipwww.example.com
or
http:/\penalty\exhyphenpenalty/\penalty\exhyphenpenaltywww.example.com
I'd like to use a general solution that works in both cases, where the editor of the documentation can still use normal ReST and doesn't have to worry about latex.
Any idea how to get classic slashes in URLs and breakable slashes everywhere else?
You have not really given data and source code and only asked for an idea, so I take the liberty of only sketching a solution in pseudo code:
Split the document into a list of strings at each position of a space using .split()
For each string, check whether it is an URL by comparing its left side to http:// (and maybe also ftp://, https:// or similar tags)
Do replacements, but only in strings which are no URLs
Recombine all strings including the spaces again, using a command such as " ".join(my_list)
One way to do it, might be to write a Transform subclass. And then use add transform in setup(app) to use it in every read.
I could use DefaultSubstitutions from transforms.py as template for my own class.

InDesign Grep: Changing sentence beginnings to Uppercase

I am relatively new to scripting and within an InDesign Script I am trying to change all the first letters of all sentences to uppercase (many of the are lowercase, since I randomly generated the setences from different text sources).
I am so far able to find the text parts with this Grep expression:
\.(\s)+\l
I also found this script by Peter Kahrel, that he shares on InDesign Secrets:
app.findGrepPreferences.findWhat = "^.";
found = app.activeDocument.findGrep();
for (i = 0; i < found.length; i++)
found[i].characters[0].changecase (ChangecaseMode.lowercase);
However, when I now replace the ^. with my own expression, and change lowercase to uppercase, the script does not work, which makes sense, since I do not want to change the first character of my findGrep results, but the last one. But how can I find the last character? The breaks between the sentences have different lengths, so I cannot simply type 2 instead of 0.
Any help would be very appreciated! Thank you!
Edit: I'm working on CS6.
Your GREP returns matches that start with a period, then have any number of spaces (including hard returns, probably), and always end with one lowercase character. So far, so good. You can access the last character (and in fact any last item in any InDesign object collection) in this way:
found[i].characters[-1].changecase (ChangecaseMode.lowercase);
which 'indexes' from the end, rather than from the start.
However! The only character in your matches, other than the period and spaces, is always going to be a lowercase letter. So you can skip the entire "how to find the correct index" thing, and probably slightly speed up the script as well, by simply applying lowercase (or, as you are using it, uppercase) to the entire match:
found[i].changecase (ChangecaseMode.lowercase);
because nothing will happen to not-lowercaseable characters (a word I declare to signify "having the Unicode-defined property of being lowercase and having an uppercase equivalent). (Or the other way around, if I understand your purpose correct.)

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

Verify string in MVC validator using regularexpressions

I am trying to grasp the concept of Regular Expressions but seem to be missing something.
I want to ensure that someone enters a string that ends with .wav in a field. Should be a pretty simple Regular Expression.
I've tried this...
[RegularExpression(#"$.wav")]
but seem to be incorrect. Any help is appreciated. Thanks!
$ is the anchor for the end of the string, so $.wav doesn't make any sense. You can't have any characters after the end of the string. Also, . has a special meaning for regex (it just means 'any character') so you need to escape it.
Try writing
\.wav$
If that doesn't work, try
.*\.wav$
(It depends on if the RegularExpression attribute wants to match the whole string, or just a part of it. .* means 'any character, 0 or more times')
Another thing you should consider is what to do with extra whitespace in the field. Users have a terrible habit of adding extra white space in inputs - its why various .Trim() functions are so important. Here, RegularExpressionAttribute might be evaluated before you can trim the input, so you might want to write this:
.*\.wav[\s]*$
The [\s]* section means 'any whitespace character (tabs, space, linebreak, etc) 0 or more times'.
You should read a tutorial on regex. It's not so hard to understand for simple problems like this. When I was learning I found this site pretty handy: http://www.regular-expressions.info/

Ruby regex: extract a list of urls from a string

I have a string of images' URLs and I need to convert it into an array.
http://rubular.com/r/E2a5v2hYnJ
How do I do this?
URI.extract(your_string)
That's all you need if you already have it in a string. I can't remember, but you may have to put require 'uri' in there first. Gotta love that standard library!
Here's the link to the docs URI#extract
Scan returns an array
myarray = mystring.scan(/regex/)
See here on regular-expressions.info
The best answer will depend very much on exactly what input string you expect.
If your test string is accurate then I would not use a regex, do this instead (as suggested by Marnen Laibow-Koser):
mystring.split('?v=3')
If you really don't have constant fluff between your useful strings then regex might be better. Your regex is greedy. This will get you part way:
mystring.scan(/https?:\/\/[\w.-\/]*?\.(jpe?g|gif|png)/)
Note the '?' after the '*' in the part capturing the server and path pieces of the URL, this makes the regex non-greedy.
The problem with this is that if your server name or path contains any of .jpg, .jpeg, .gif or .png then the result will be wrong in that instance.
Figuring out what is best needs more information about your input string. You might for example find it better to pattern match the fluff between your desired URLs.
Use String#split (see the docs for details).
Part of the problem is in rubular you are using https instead of http.. this gets you closer to what you want if the other answers don't work for you:
http://rubular.com/r/cIjmjxIfz5

Resources