What characters are never used in xpath? - xpath

I'm trying to build a DSL which will contain a number of XPaths as parameters. I'm new to XPath, and I need a character which is never used in the XPath syntax so I can delimit n number of XPaths on a single line of a script. My question: what characters are NOT part of the XPath syntax?

The null character.
Seriously. Because an XPath is supposed to support any XML document, it must be capable of matching text nodes that contain any allowed Unicode character. However, XML disallows one character: the null character.
Ok, that is not entirely true, but it is simplest. As in XML 1.1, control characters were supported, except Unicode Null. However, as per the XML 1.0 production of Char, there are a few other characters you can choose from: surrogate pairs (as characters, not as correctly encoded octets representing a non-BMP character), and anything before 0x20, except linefeed, carriage return and tab.
Another good guess is any Private Use character, as it is unlikely it is used by your input documents, however, this is not guaranteed, and you asked for "never".

I'm trying to build a DSL which will contain a number of XPaths as parameters.
Well, many people use XML for DSLs, and this is how you would do it in XML:
<paths>
<path>/a/b/c/d</path>
<path>/w/x/y/z</path>
</path>
So how do we reconcile this with the fact that "<" can appear in an XPath expression? Answer: if it does appear, we escape it:
<paths>
<path>/a/b/c/d[e < 3]</path>
<path>/w/x/y/z[v < 2]</path>
</path>
So: don't try to find a character that can't appear in an XPath expression. Use a character that can appear, and escape it if it does.

Related

How to split by Replacement Character "" in Ruby?

I have this character , or see screenshot below. Its the "replacement character" in Ruby.
I'm using an external API that does parsing, and unfortunately returns this character instead of - for un-ordered list points.
I would like to split by this character in what is returned, but I've been unsuccessful with below.
text.split(//)
How can I split by this character?
These will match any non ASCII character:
[^\x00-\x7F] or [^[:ascii:]].
As noted by #engineersmnky this may not be the most ideal solution if the data you are parsing could contain more unrecognized characters.
Use this regex if you want to split only the  character:
[\uF0B7]

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

How to use the `\p{Assigned}` regexp selector?

I just read about the \p{Assigned} selector in the Ruby Regex documentation, which sounds like a good possibility to easily generate an array of characters that it is supposed to match? The only thing it says in the documentation though is 'An assigned character'.
How do I assign one/multiple characters to this selector?
Unicode assigned characters are the whole of graphic, format, control, and private-use characters, i.e. any character that is not reserved for future assignment. You can't assign arbitrary characters to the class \p{Assigned}.
See https://bugs.ruby-lang.org/issues/3838
You need to specify that the encoding is UTF-8 by adding 'u' after the expression
/\p{Assigned}/u

Allowed characters in map key identifier in YAML?

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?
According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.
Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

How do i remove   — – special characters from my XML files

this is a sample of the xml file
<row tnote="0">
<entry namest="col2" nameend="col4" us="none" emph="bld"><blst>
<li><text>Single, head of household, or qualifying widow(er)—$55,000</text></li>
<li><text>Married filing jointly—$115,000</text></li>
</blst></entry>
<entry colname="col6" ldr="1" valign="middle"> </entry>
<entry colname="col7" valign="middle"> 5.</entry>
</row>
the — etc represent HTML 4.0 entities. i want to store each line's text as an element of an array, but not if the line is just  
if e.text.strip =~ /^&#x20[0-9][0-9];$/ then
next
else
subLines << e.text
end
but it doesn't seem to be working...is my regEx incorrect?
&#x...; isn't an entity reference, it's a character reference. To an XML parser, — is absolutely identical to the raw character —, so when you look at the DOM produced by an XML parser through a property such as element.text you won't see anything with an ampersand in it, but a simple — character.
So in principle, you'd match it with a regex something like /[—– ]/. However, if you are using Ruby 1.8, you've got the problem that the language itself doesn't have support for Unicode, so the character group in /[—– ]/ won't quite work properly: it'll try to remove every byte in the UTF-8 representation of –, — and  , which will likely mangle any other characters.
A simple string replace for each target character would work correctly, as that doesn't require special character handling. (Naturally if you included characters like — directly in the source code you'd also have to get the file encoding of that script right, so probably easier to use a string literal escape like "\xe2\x80\x94".)
Because your regex is of the form /^...$/, it will only match against the entire string. You will only skip text that consists entirely of one HTML entity.

Resources