Does any programming language assign meaning to trailing spaces? - whitespace

There are a number of programming languages such as Python and F# that assign semantic meaning to leading spaces, so that if a line of code has leading spaces removed, it can break the whole program.
But what about trailing spaces?
Are there any programming languages that assign semantic meaning to spaces that come at the end of a line, just before the line feed or carriage return character?
(Let's assume I'm excluding languages that are nothing but space, e.g. Whitespace.)

T-SQL is a language that's trailing whitespace sensitive, both in identifiers and multiline strings, and still used in many enterprise applications
-- Warning: horrifying, don't ever do this
CREATE TABLE [Table
Name](
[My
field] INT
);
SELECT * FROM [Table
Name]; --Works;
SELECT [My
field] FROM [Table
Name]; --Works
SELECT [My
field] FROM [Table
Name]; --Fails, lacks trailing space in field name
SELECT [My
field] FROM [Table
Name]; --Fails, too few trailing spaces in table name

Markdown treats some trailing whitespace specially. When a line ends with two or more spaces, a line break <br> is inserted to break up a paragraph.
This rule is not well-known, so it may be triggered unintentionally and some linters disallow it by default (MD009 no-trailing-spaces):
This rule is triggered on any lines that end with unexpected whitespace. To fix this, remove the trailing space from the end of the line.
Note: Trailing space is allowed in indented and fenced code blocks because some languages require it.
The br_spaces parameter allows an exception to this rule for a specific number of trailing spaces, typically used to insert an explicit line break. The default value allows 2 spaces to indicate a hard break (<br> element).
Note: You must set br_spaces to a value >= 2 for this parameter to take effect. Setting br_spaces to 1 behaves the same as 0, disallowing any trailing spaces.
As an example, these markdown paragraphs (middle dot represents space):
Three···
trailing spaces
Two··
trailing spaces
One·
trailing space
No
trailing spaces
A··
mix
of··
spacing··
are rendered as such:
Three
trailing spaces
Two
trailing spaces
One
trailing space
No
trailing spaces
A
mix
of
spacing

Related

Regular expression for matching blocks delimited by separators of variable type

I need to cleanse some email messages from automatically appended chunks of text. Each one of those chunks is enclosed by a pair of separators (single or multiple lines). I need a regular expression that will match anything between such separators so that I can remove it.
Here is some text that illustrates the problem and shows all the weird cases that need to be accounted for:
This is some text that should not be matched. As you can see, it is not enclosed
by separator lines.
===========================================================
This part should be matched as it is between two separator lines. Note that the
opening and closing separators are composed of the exact same number of the same
character.
===========================================================
This block should not be matched as it is not enclosed by its own separators,
but rather the closing separator of the previous block and the opening
separator of the next block.
===========================================================
It is tricky to distinguish between an enclosed and non-enclosed blocks, because
sometimes a matching pair of separators appears to be legal, while it is really
the closing separator of the previous block and the opening separator of the
next one (e.g. the block obove this one).
===========================================================
==================================
=====
This block is enclosed by multiline separators.
==================================
=====
Some more text that should not be matched by the regex.
***************************************
A separator can be a different character, for example the asterisk.
***************************************
***************************************
*******************
Another example of a multiline separated block.
***************************************
*******************
>Even more text not to be matchedby the regex. This time, preceeded by a
>variable number of '>'.
>>__________________________________________
>>And another type of separator. The block is now also a part of a reply section
>>of the email.
>>__________________________________________
Note that there is no recursion to be handled here - a block is never inside of another block.
I've been trying work this out for a while now, but I am not experienced enough when it comes to regex. I do not know how to make the expression "remember" what the opening separator was.
Right now my solution will produce incorrect matches for a block like this:
=========================
text text
text
*************************
I would really appreciate some help on this. I am working in Ruby, but will work through different types of syntax if required.
Try Regex: ((.)(?:\2)+)(?:\n(\2+))?\n.+?\n\1(?:(?:\n\3))?
Demo
Please note that I have added 2 restrictions around the multiline separators:
only 2 lines in the separator
separator on the second line is same as the first line
Let me know if these restrictions are not needed.
Looks like the backward capture should do:
input.gsub(/(?<sep>\W{40,}).*?(\k<sep>)/m, "\n")

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

preg_match search pattern, stop at character combination

I am trying to pull a whole Mysql statement from a database sql file
INSERT INTO `helppages`
(`HelpPageID`, `ShowHelpItem`, `HelpRank`, `HelpCategory`, `HelpTitle`, `HelpDescription`, `HelpLink`, `HelpText`, `CMSHelpBar`, `CMSHelpBarAdditional`)
VALUES (... characters (Too many to post here, but the expression below grabs all) ...
);
The current, though I have been through many variations, expression I am using is:
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*?)\)\;\r\n#s", $uploadedfile, $matches);
which gets all the information but I can't get it to stop at the end ");\r\n"
also $SearchingTableName = helppages.
Edit
Sorry the current expression uses look forward
preg_match("#INSERT INTO `$SearchingTableName` ([!%&'-/:<=>#^`\;\s\d\w\"\#\$\(\)\*\+\,\.\?\[\]\{\}\(\)\\\|©]*)(?!\)\;\r\n)#s", $uploadedfile, $matches);
Also I checked with MSword using );^p and there is only one instance at the end of the Insert
To match this kind of string you can't do it only playing with character classes. You need to describe the string structure.
For this simple particular case you can use this pattern:
$pattern = <<<EOD
~
# definitions
(?(DEFINE)
(?<elt> [^"',)]+ | '(?>[^\\']+|\\.)*' | "(?>[^\\"]+|\\.)*" )
(?<list> \( \g<elt>? (?: \s* , \s* \g<elt> )* \) )
)
# main pattern
INSERT \s+ (?:INTO \s+)? `$SearchingTableName` \s* \g<list>? \s* VALUES \s*
\g<list> \s* (?: , \s* \g<list> \s* )* ;
~xs
EOD;
if (preg_match_all($pattern, $uploadedfile, $m))
print_r($m[0]);
online demo
But keep in mind that parsing a programming language is not an easy task and is full of traps (depending of the syntax) even for the capabilities of the PHP regex engine. (It's however possible.)
regex features used here:
delimiters and modifiers:
The pattern delimiter used here is ~ instead of the classical /. There is no literal ~ in the pattern thus it's ok.
The pattern uses two modifiers: s and x:
by default the . can't match the newline character \n. The s modifier (s for singleline mode) changes this behavior. When used the . can match all characters including the newline character. (Note that you can retrieve this default behavior using \N that doesn't match the newline character whatever the mode.)
x switches on the extended mode. In this mode, whitespaces inside the pattern are ignored. This mode allows too inline comments that begin with a sharp character #. This mode is very useful to make readable long patterns using spaces, indentation and comments.
using named captures
When you have a long pattern and when you need to reuse several times the same subpatterns, you have the possibility to reuse subpatterns that are written inside capture groups.
A quick example:
You want to match several items separated by commas and composed with 4 digits and 4 letters like this: 1234abcd,5678efgh,9012ijkl,3456mnop.
The pattern to do that is obviously ^\d{4}[a-z]{4}(?:,\d{4}[a-z]{4})+$
But if I don't want to write \d{4}[a-z]{4} two times, I can put it in a capture group and use an alias for the subpattern in the capture group, like this: ^(\d{4}[a-z]{4})(?:,(?1))+$.
Here the (?1) is an alias for the subpattern inside the capture group 1 (not the content matched by the subpattern as a backreference \1 does, but the subpattern itself) that is \d{4}[a-z]{4}.
PCRE, the regex engine used by PHP supports this syntax too \g<1> instead of (?1).
But if you have a lot of capture groups in the pattern, it is not always handy to remember what's the number of the capture group you need. This is the reason why you have the possibility to name capturing groups. Example: ^(?<diglet>\d{4}[a-z]{4})(?:,\g<diglet>)+$
The other advantage of named patterns, except to make the whole pattern more readable, is to add a semantical dimension to the pattern, in the same way you can do it by addying an id attribute to an html tag.
definition section
Instead of defining the named subpattern directly in the main pattern like in the previous example, you can use a definition section to put all the subpatterns that would be used in the main pattern. Note that all that is inside this section is only here for definition purpose and doesn't match nothing. It's like a zero-width assertion.
The syntax of this section is : (?(DEFINE)(?<diglet>\d{4}[a-z]{4})) (you can put several named subpatterns inside.). The precedant pattern becomes:(?(DEFINE)(?<diglet>\d{4}[a-z]{4}))^\g<diglet>(?:,\g<diglet>)+$
the pattern itself:
The first part of the pattern enclosed between (?(DEFINE) and ) consists of subpatterns definitions that will be used later in the main pattern.
The elt subpattern describes an item (a column name or a value):
[^"',)]+ # all that is not a quote a comma or a closing parenthese:
# in the present context this will match numbers and column names
| # OR
'(?>[^\\']+|\\.)*' # string between single quotes (designed to deal with escaped quotes)
|
"(?>[^\\"]+|\\.)*" # same for double quotes
The list subpattern describes the full list of elements separated by commas between parenthesis. Note that this subpattern use a reference to the elt subpattern.
The main pattern needs only to reuse the subpattern list.

string has trailing whitespaces that aren't white spaces? (i.e. strip doesn't get rid of it)

I have the following string I got from parsing some html:
"this is my string  "
If I use .strip or .rstrip the string remains the same.
However if I literally type the string "this is my string " and type .strip then the trailing spaces get stripped.
This leads me to believe the string I obtained from parsing html is not containing trailing white spaces. So the question I have is, 1) what is trailing the string if it isn't a white space? and 2) how do I get rid of it?
The unicode table contains several whitespace characters, and it is possible that all of these characters are not handle by the strip methods. If you want to use a regular expression with the sub method, you can try this simple pattern: /\p{Space}+\z/ or /[[:space:]]+\z/ to trim all the blank characters on the right. (obviously, the replacement string must be empty)
Note: the \s is equivalent to [ \t\r\n\f] in Ruby and doesn't contain all whitespaces of the unicode table.

Regular expression to turn hyphens into double hyphens

My implementation of markdown turns double hyphens into endashes. E.g., a -- b becomes a – b
But sometimes users write a - b when they mean a -- b. I'd like a regular expression to fix this.
Obviously body.gsub(/ - /, " -- ") comes to mind, but this messes up markdown's unordered lists – i.e., if a line starts - list item, it will become -- list item. So solution must only swap out hyphens when there is a word character somewhere to their left
You can match a word character to the hyphen's left and use a backreference in the replacement string to put it back:
body.gsub(/(\w) - /, '\1 -- ')
Perhaps, if you want to be a little more accepting ...
gsub(/\b([ \t]+)-(?=[ \t]+)/, '\1--')
\b[ \t] forces a non-whitepace before the whitespace through a word boundary condition. I don't use \s to avoid line-runs. I also only use one capture to preserve the preceding whitespace (does Ruby 1.8.x have a ?<= ?).

Resources