XML Schema, pattern restriction - what is \w? - validation

An XSD is containing this line to restrict an attribute:
<xsd:pattern value="buyer_specific|customer_specific|duns|iln|gln|party_specific|supplier_specific|\w{1,250}"/>
If a custom attribute containing an underscore is used, validation with xmllint fails:
Schemas validity error :[...] The value 'checkout_profile' is not accepted by the pattern 'buyer_specific|customer_specific|duns|iln|gln|party_specific|supplier_specific|\w{1,250}'.
So what exactly is meant by \w? The XSD specification is surprisingly silent about this topic.
https://www.w3.org/TR/xmlschema11-1/#key-typeRestriction mentions pattern, but I could not find any definition for its meaning. In my eyes patterns are meant to be PCREs, and Perl states in pcrepattern(3):
\w any "word" character
[...]
A "word" character is an underscore or any character that is a letter or digit.
So as far as I can say, checkout_profile should be fine. Who is to blame? Is this a bug in xmllint(1)? Or am I missing an important point somewhere?

Related

What are valid identifiers in R7RS-small?

R7RS-small says that all identifiers must be terminated by a delimiter, but at the same time it defines pretty elaborate rules for what can be in an identifier. So, which one is it?
Is an identifier supposed to start with an initial character and then continue until a delimiter, or does it start with an initial character and continue following the syntax defined in 7.1.1.
Here are a couple of obvious cases. Are these valid identifiers?
a#a
b,b
c'c
d[d]
If they are not supposed to be valid, what is the purpose of saying that an identifier must be terminated by a delimiter?
|..ident..| are delimiters for symbols in R7RS, to allow any character that you cannot insert in an old style symbol (| is the delimiter).
However, in R6RS the "official" grammar was incorrect, as it did not allow to define symbols such that 1+, which led all implementations define their own rules to overcome this illness of the official grammar.
Unless you need to read the source code of a given implementation and see how it defines the symbols, you should not care too much about these rules and use classical symbols.
In the section 7.1.1 you find the backus-naur form that defines the lexical structure of R7RS identifiers but I doubt the implementations follow it.
I quote from here
As with identifiers, different implementations of Scheme use slightly
different rules, but it is always the case that a sequence of
characters that contains no special characters and begins with a
character that cannot begin a number is taken to be a symbol
In other words, an implementation will use a function like read-atom and after that it will classify an atom by backtracking with read-number and if number? fails it will be a symbol.

Semantic meaning of '36_864_7_345ms' as a time literal

Reading the spec for verilog, it appears that
36_864_7_345ms
Is a valid time literal: http://www.ece.uah.edu/~gaede/cpe526/SystemVerilog_3.1a.pdf (see section 2)
Note: decimal_digit is defined as [0-9] in the full IEEE spec.
What is the semantic meaning (if any) of this time literal? Or am I misreading the spec?
Edit:
Looking elsewhere in the spec (section 3.7.9), it appears that the underscore characters are silently discarded. Does the underscore act as an arbitrary seperating character in a similar way as numbers in English (ex. 43,251) have commas to visually separate the numbers? Or is there another meaning altogether?
The spec you quoted from is long since obsolete. Please get the latest from the IEEE where it says in section 5.7.1 Integer literal constants:
The underscore character (_) shall be legal anywhere in a number
except as the first character. The underscore character is ignored.
This feature can be used to break up long numbers for readability
purposes.

What characters are never used in xpath?

I'm trying to build a DSL which will contain a number of XPaths as parameters. I'm new to XPath, and I need a character which is never used in the XPath syntax so I can delimit n number of XPaths on a single line of a script. My question: what characters are NOT part of the XPath syntax?
The null character.
Seriously. Because an XPath is supposed to support any XML document, it must be capable of matching text nodes that contain any allowed Unicode character. However, XML disallows one character: the null character.
Ok, that is not entirely true, but it is simplest. As in XML 1.1, control characters were supported, except Unicode Null. However, as per the XML 1.0 production of Char, there are a few other characters you can choose from: surrogate pairs (as characters, not as correctly encoded octets representing a non-BMP character), and anything before 0x20, except linefeed, carriage return and tab.
Another good guess is any Private Use character, as it is unlikely it is used by your input documents, however, this is not guaranteed, and you asked for "never".
I'm trying to build a DSL which will contain a number of XPaths as parameters.
Well, many people use XML for DSLs, and this is how you would do it in XML:
<paths>
<path>/a/b/c/d</path>
<path>/w/x/y/z</path>
</path>
So how do we reconcile this with the fact that "<" can appear in an XPath expression? Answer: if it does appear, we escape it:
<paths>
<path>/a/b/c/d[e < 3]</path>
<path>/w/x/y/z[v < 2]</path>
</path>
So: don't try to find a character that can't appear in an XPath expression. Use a character that can appear, and escape it if it does.

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Allowed characters in map key identifier in YAML?

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?
According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.
Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

Resources