In Prolog, there are traditionally two ways of representing a sequence of characters:
As a list of chars, which are atoms of length 1.
As a list of codes, which are just integers. The integers are to be interpreted as codepoints, but the convention to be applied is left unspecified. As a (eminently sane) example, in SWI-Prolog, the space of codepoints is Unicode (thus, roughly, the codepoint-integers range from 0 and 0x10FFFF).
DCGs, a notational way of writing left-to-right list processing code, are designed to perfom parsing on "lists of exploded text". Depending on preference, the lists to-be-handled can be lists of chars or lists of codes. However, the notation for char/code processing differs when writing down the constants. Does one generally write the DCG in "char style" or "code style"? Or maybe even in char/code style for portability in case of modules exporting DCG nonterminals?
Some Research
The following notations can be used to express constants in DCGs
'a': A char (as usual: single quotes indicate an atom, and they can be left out if the token starts with a lowercase letter.)
0'a: the code of a .
['a','b']: A list of char.
[ 0'a, 0'b ]: A list of codes, namely the codes for a and b (so you can avoid typing in the actual codepoint values).
"a" a list of codes. Traditionally, a double-quoted string is exploded into a list of codes, and this notation also works SWI-Prolog in DCG contexts, even though SWI-Prolog maps a "double-quoted string" to the special string datatype otherwise.
`0123`. Traditonally, text within back-quotes is mapped to an atom (I think, the 95 ISO Standard just avoids being specific regarding the meaning of a back-quoted string. "It would be a valid extension of this part of ISO/IEC 13211 to define a back quoted string as denoting a character string constant."). In SWI-Prolog, text within back-quotes is exploded into a list of codes unless the flag back_quotes has been set to demand a different behaviour.
Examples
Char style
Trying to recognize "any digit" in "char style" and make its "char representation" available in C:
zero(C) --> [C],{C = '0'}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Code style
Trying to recognize "any digit" in "code style":
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Char/Code transparent style
DCGs can be written as "char/code transparent style" by duplicating the rules involving constants. In the above example:
zero(C) --> [C],{C = '0'}.
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
The above also accepts a sequence of alternating codes and chars (as lists of stuff cannot be typed). This is probably not a problem). When generating, one will get arbitrary char/code mixes which are unwanted, and then cuts need to be added.
Char/Code transparent style taking an additional Mode indicator
Another approach would be to explicitly indicate the mode. Looks clean:
zero(C,chars) --> [C],{C = '0'}.
zero(C,codes) --> [C],{C = 0'0}.
nonzero(C,chars) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C,codes) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C,Mode) --> zero(C,Mode).
any_digit(C,Mode) --> nonzero(C,Mode).
Char/Code transparent style using dialect features
Alternatively, features of the Prolog dialect can be used to achieve char/code transparency. In SWI-Prolog, there is code_type/2, which actually works on codes and chars (there is a corresponding char_type/2 but IMHO there should be only chary_type/2 working for chars and codes in any case) and for "digit-class" codes and chars yield the compound digit(X):
?- code_type(0'9,digit(X)).
X = 9.
?- code_type('9',digit(X)).
X = 9.
?- findall(W,code_type('9',W),B).
B = [alnum,csym,prolog_identifier_continue,ascii,
digit,graph,to_lower(57),to_upper(57),
digit(9),xdigit(9)].
And so one can write this for clean char/code transparency:
zero(C) --> [C],{code_type(C,digit(0)}.
nonzero(C) --> [C],{code_type(C,digit(X),X>0}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
In SWI-Prolog in particular
SWI-Prolog by default prefers codes. Try this:
The flags
back_quotes
double_quotes
influence interpretation of "string" and `string` in "standard code". By default"string" is interpreted as an atomic "string" whereas `string` is interpreted as a "list of codes".
Outside of DCGs, the following holds in SWI-Prolog, with all flags at their default:
?- string("foo"),\+atom("foo"),\+is_list("foo").
true.
?- L=`foo`.
L = [102,111,111].
However, in DCGs, both "string" and `string` are interpreted as "codes" by default.
Without any settings changed, consider this DCG:
representation(double_quotes) --> "bar". % SWI-Prolog decomposes this into CODES
representation(back_quotes) --> `bar`. % SWI-Prolog decomposes this into CODES
representation(explicit_codes_1) --> [98,97,114]. % explicit CODES (as obtained via atom_codes(bar,Codes))
representation(explicit_codes_2) --> [0'b,0'a,0'r]. % explicit CODES
representation(explicit_chars) --> ['b','a','r']. % explicit CHARS
Which of the above matches codes?
?-
findall(X,
(atom_codes(bar,Codes),
phrase(representation(X),Codes,[])),
Reps).
Reps = [double_quotes,back_quotes,explicit_codes_1,explicit_codes_2].
Which of the above matches chars?
?- findall(X,
(atom_chars(bar,Chars),phrase(representation(X),Chars,[])),
Reps).
Reps = [explicit_chars].
When starting swipl with swipl --traditional the backquoted representation is rejected with Syntax error: Operator expected , but otherwise nothing changes.
The Prolog Standard (6.3.7) says:
A double quoted list is either an atom (6.3.1.3) or a list (6.3.5).
Consequently, the following should succeed:
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.
For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).
?- Foo = "foo", (atom(Foo) ; Foo = [F, O, O]).
false.
So SWI-Prolog is not a Prolog by default. That's OK, but if you want to know about SWI-Prolog's non-Prolog behavior, please adjust the tags on the question.
From the definition it also follows that double quoted lists are completely useless by default even in a conforming Prolog: They might denote atoms, so regardless of the chars/codes distinction you can't even know that the double quoted list is actually a list. Even DCGs that only care about structural properties of the "text" (whether it's a palindrome, for example) are useless if the "list" is in fact an atom.
Hence a Prolog program that wants to process text with DCGs must at startup force the double_quotes flag to the value it wants. You have the choice between codes and chars. Codes have no advantages over chars, but they do have disadvantages in readability and typeability. Thus:
Answer: Use chars. Set the double_quotes flag explicity.
I should start to by noting that the answer to the "Should text-processing DCGs be written to handle codes or chars? Or both?" question can be neither. DCGs work by using an implicit difference list to thread state. But the elements of that difference list can be other than chars or codes. It depends on the output of text tokenization and what exactly text processing entails. E.g. I have worked on and come across Prolog NLP applications where codes/chars were only used for the basic tokenization and the reasoning was performed (still with DCGS) using either atoms or compound terms that reified the token data (e.g. v(Verb) or n(Noun)). One of those applications (a personal assistant like it's common nowadays in phones) used atoms produced by a voice-recognition component.
But let's go back to chars vs codes. Legacy practices and failed standardization left Prolog with problematic text representation. ASCII gives us a singe quote, a back quote, and a double-quote. With single quotes being used for atoms, a choice could have been made to use e.g. back quotes for representing a list of codes and double-quotes for representing a list of chars. Or the other way around. Instead, and this is where standardization failed, we got the problematic double_quotes flag. There's no shortage of Prolog code in the wild that makes an assumption about the meaning of double-quoted terms and thus works or breaks depending on the implicit value of the double_quotes flag (if you think this is mainly an issue with legacy code, think again). Guess what happens when we try to combine code that require different values for the flag? Note that, in almost all systems (including those that support modules), the flag value is global ... As Isabelle wrote in her answer, setting the flag explicitly is good general advice. But not, as I explained, without problems.
Some systems provide additional values for the flag. E.g. SWI-Prolog allows the flag to also be set to string. GNU Prolog supports additional atom_no_escape, chars_no_escape and codes_no_escape. Some systems only support codes. Some systems also provide a back_quotes flag. This Babel tower means that portable and resilient code is often forced to use atoms to represent text. But this is may not be ideal from a performance perspective.
Back to the original question. As Isabelle mentioned, chars is usually a more readable (read, easier to debug) choice. But, depending on the Prolog system, codes may provide better performance. If application performance is critical, benchmark both solutions. Some recent Prolog systems (e.g. Scryer-Prolog or Trealla Prolog) have efficient support for chars. Older systems may trail behind.
Note that your question is very much related to I/O. Prior to ISO, many systems in the DEC-10 succession supported a single kind of I/O via get0/1 and put/1 (and versions for tty) which served both characters and bytes at the same time. What can go wrong with that? Today, that is obvious. But multi-octet character set handling (MOCSH as it was called) was for many a much more exotic feature as it is today, a quarter century after the standard's publication. After all, the today mostly accepted UTF-8 encoding was invented 1992-09 and first published in 1993. And like so many projects like TRON it could have failed as well. Some other programming languages got burnt by betting on UCS-2/UTF-16 encoding.
What the standard did was to split I/O into character and byte I/O (and their corresponding types text and binary). So there is now get_char/1, get_byte/1 ... That the _byte versions all use integers in the range of 0..255 was non-controversial (plus -1 for EOF). But what about the _char versions? The only way to resolve this was to provide both _char and _code versions and consequently chars and codes versions of double-quoted strings and related built-ins. The default for flag double_quotes is implementation defined (7.11.2.5).
In this manner systems with a lot of DEC-10 legacy could continue to use codes explicitly. For them, an integer thus meant either an integer or a byte or a character. But users of such system still could use the better encoding. New systems that do not have to deal with such legacies going back to 1977 opt as default for chars like Tau, Scryer, and Trealla. As much as tradition is concerned, note that Prolog I, often called Marseille Prolog, did encode double quoted strings as lists of atoms of length one. And in the preliminary version of Prolog of 1972, often called Prolog 0, strings were encoded as nil-s-t-r-i-n-g qua boum facilitating stemming. In any case, not a single character code was present at all.
The advantages of chars should be obvious. It is much easier to read and debug, in particular if you have partially instantiated strings, say [a,X,c] vs. [97,X,99], which occur often when generalizing queries as with library(diadem). It is also a bit shorter to write. And, double quoted strings can be used for printing answers.
If you really want to write programs that both support codes and chars at the same time, use rather goals like [Ch] = "a" where Ch is now the atom a or the integer 97 or 129 or whatever processor character set you are using. It all depends on the Prolog flag double_quotes. And more succinctly you can write
nonzero(C) --> [C],{member(C,"123456789")}.
What is even more important is that phrase("abc", "abc") still holds.
However, changing that flag within the same application is certainly not a good idea (nor to switch to the value atom or some non-conforming value).
((When using chars note that single quotes as in C = 'a' are a bit misleading since the single quotes do not serve any purpose. Instead, round brackets are preferable if you want to ensure that the code will be valid even in the presence of an operator declaration for a. When a occurs as a functor's argument or a list's element, no round brackets are needed, but they are often used redundantly in operator declarations.))
You are making incorrect assumptions. These are not "chars":
foo_or_bar(foo) --> "foo".
The "foo" is a string, in SWI-Prolog, but this works perfectly within a DCG rule definition. The place to read about this is here, in particular:
A DCG literal
Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.
All your other suggestions are just unnecessary, you should be either enumerating explicitly all possible "nonzeros", digits and so on, or using the library.
PS: if your main goal is to write code that runs on any Prolog, you might as well use something like Logtalk instead.