Using phrase_from_file to read a file's lines - prolog

I've been trying to parse a file containing lines of integers using phrase_from_file with the grammar rules
line --> I,line,{integer(I)}.
line --> ['\n'].
thusly: phrase_from_file(line,'input.txt').
It fails, and I got lost very quickly trying to trace it.
I've even tried to print I, but it doesn't even get there.
EDIT::
As none of the solutions below really fit my needs (using read/1 assumes you're reading terms, and sometimes writing that DCG might just take too long), I cannibalized this code I googled, the main changes being the addition of:
read_rest(-1,[]):-!.
read_word(C,[],C) :- ( C=32 ;
C=(-1)
) , !.

If you are using phrase_from_file/2 there is a very simple way to test your programs prior to reading actual files. Simply call the very same non-terminal with phrase/2. Thus, a goal
phrase(line,"1\n2").
is the same as calling
phrase_from_file(line,fichier)
when fichier is a file containing above 3 characters. So you can test and experiment in a very compact manner with phrase/2.
There are further issues #Jan Burse already mentioned. SWI reads in character codes. So you have to write
newline --> "\n".
for a newline. And then you still have to parse integers yourself. But all that is tested much easier with phrase/2. The nice thing is that you can then switch to reading files without changing the actual DCG code.

I guess there is a conceptional problem here. Although I don't know the details of phrase_from_file/2, i.e. which Prolog system you are using, I nevertheless assume that it will produce character codes. So for an integer 123 in the file you will get the character codes 0'1, 0'2 and 0'3. This is probably not what you want.
If you would like to process the characters, you would need to use a non-terminal instead of a bare bone variable I, to fetch them. And instead of the integer test, you would need a character test, and you can do the test earlier:
line --> [I], {0'0=<I, I=<0'9}, line.
Best Regards
P.S.: Instead of going the DCG way, you could also use term read operations. See also:
read numbers from file in prolog and sorting

Related

Does antlr4 memoize tokens?

Let's say I have the following expression alternation:
expr
: expr BitwiseAnd expr
| expr BitwiseXor expr
// ...
;
Just for arguments sake, let's say that the expr on the left-hand-side turns out to be 1MB. Will antlr be able to 'save' that expression so it doesn't have to start-from-zero on each alternation, or how far does it have to backtrack when it fails to match on an alternation?
Just
ANTLR will recognize the 1st expr and then if it doesn't find a BitwiseAnd, it will look for a BitwiseXor to try to match the second alternative. It won't backtrack all the way to trying to recognize the 1st expr again. It's not exactly memoization, but you get the same benefit (arguably even better).
You may find it useful to have ANTLR generate the ATN for your grammar. Use the -atn option when running the antlr4 command, this will generate *.dot files for each of your rules (both Lexer and Parser). You can then use graphViz to render them to svg, pdf, etc. They may look a bit intimidating at first glance, but just take a moment with them and you'll get a LOT of insight into how ANTLR goes about parsing your input.
The second place to look is the generated parser code. It too is much more understandable than you might expect (especially if reading it with the ATN graph handy).

Should text-processing DCGs be written to handle codes or chars? Or both?

In Prolog, there are traditionally two ways of representing a sequence of characters:
As a list of chars, which are atoms of length 1.
As a list of codes, which are just integers. The integers are to be interpreted as codepoints, but the convention to be applied is left unspecified. As a (eminently sane) example, in SWI-Prolog, the space of codepoints is Unicode (thus, roughly, the codepoint-integers range from 0 and 0x10FFFF).
DCGs, a notational way of writing left-to-right list processing code, are designed to perfom parsing on "lists of exploded text". Depending on preference, the lists to-be-handled can be lists of chars or lists of codes. However, the notation for char/code processing differs when writing down the constants. Does one generally write the DCG in "char style" or "code style"? Or maybe even in char/code style for portability in case of modules exporting DCG nonterminals?
Some Research
The following notations can be used to express constants in DCGs
'a': A char (as usual: single quotes indicate an atom, and they can be left out if the token starts with a lowercase letter.)
0'a: the code of a .
['a','b']: A list of char.
[ 0'a, 0'b ]: A list of codes, namely the codes for a and b (so you can avoid typing in the actual codepoint values).
"a" a list of codes. Traditionally, a double-quoted string is exploded into a list of codes, and this notation also works SWI-Prolog in DCG contexts, even though SWI-Prolog maps a "double-quoted string" to the special string datatype otherwise.
`0123`. Traditonally, text within back-quotes is mapped to an atom (I think, the 95 ISO Standard just avoids being specific regarding the meaning of a back-quoted string. "It would be a valid extension of this part of ISO/IEC 13211 to define a back quoted string as denoting a character string constant."). In SWI-Prolog, text within back-quotes is exploded into a list of codes unless the flag back_quotes has been set to demand a different behaviour.
Examples
Char style
Trying to recognize "any digit" in "char style" and make its "char representation" available in C:
zero(C) --> [C],{C = '0'}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Code style
Trying to recognize "any digit" in "code style":
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
Char/Code transparent style
DCGs can be written as "char/code transparent style" by duplicating the rules involving constants. In the above example:
zero(C) --> [C],{C = '0'}.
zero(C) --> [C],{C = 0'0}.
nonzero(C) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
The above also accepts a sequence of alternating codes and chars (as lists of stuff cannot be typed). This is probably not a problem). When generating, one will get arbitrary char/code mixes which are unwanted, and then cuts need to be added.
Char/Code transparent style taking an additional Mode indicator
Another approach would be to explicitly indicate the mode. Looks clean:
zero(C,chars) --> [C],{C = '0'}.
zero(C,codes) --> [C],{C = 0'0}.
nonzero(C,chars) --> [C],{member(C,['1','2','3','4','5','6','7','8','9'])}.
nonzero(C,codes) --> [C],{member(C,[0'1,0'2,0'3,0'4,0'5,0'6,0'7,0'8,0'9])}.
any_digit(C,Mode) --> zero(C,Mode).
any_digit(C,Mode) --> nonzero(C,Mode).
Char/Code transparent style using dialect features
Alternatively, features of the Prolog dialect can be used to achieve char/code transparency. In SWI-Prolog, there is code_type/2, which actually works on codes and chars (there is a corresponding char_type/2 but IMHO there should be only chary_type/2 working for chars and codes in any case) and for "digit-class" codes and chars yield the compound digit(X):
?- code_type(0'9,digit(X)).
X = 9.
?- code_type('9',digit(X)).
X = 9.
?- findall(W,code_type('9',W),B).
B = [alnum,csym,prolog_identifier_continue,ascii,
digit,graph,to_lower(57),to_upper(57),
digit(9),xdigit(9)].
And so one can write this for clean char/code transparency:
zero(C) --> [C],{code_type(C,digit(0)}.
nonzero(C) --> [C],{code_type(C,digit(X),X>0}.
any_digit(C) --> zero(C).
any_digit(C) --> nonzero(C).
In SWI-Prolog in particular
SWI-Prolog by default prefers codes. Try this:
The flags
back_quotes
double_quotes
influence interpretation of "string" and `string` in "standard code". By default"string" is interpreted as an atomic "string" whereas `string` is interpreted as a "list of codes".
Outside of DCGs, the following holds in SWI-Prolog, with all flags at their default:
?- string("foo"),\+atom("foo"),\+is_list("foo").
true.
?- L=`foo`.
L = [102,111,111].
However, in DCGs, both "string" and `string` are interpreted as "codes" by default.
Without any settings changed, consider this DCG:
representation(double_quotes) --> "bar". % SWI-Prolog decomposes this into CODES
representation(back_quotes) --> `bar`. % SWI-Prolog decomposes this into CODES
representation(explicit_codes_1) --> [98,97,114]. % explicit CODES (as obtained via atom_codes(bar,Codes))
representation(explicit_codes_2) --> [0'b,0'a,0'r]. % explicit CODES
representation(explicit_chars) --> ['b','a','r']. % explicit CHARS
Which of the above matches codes?
?-
findall(X,
(atom_codes(bar,Codes),
phrase(representation(X),Codes,[])),
Reps).
Reps = [double_quotes,back_quotes,explicit_codes_1,explicit_codes_2].
Which of the above matches chars?
?- findall(X,
(atom_chars(bar,Chars),phrase(representation(X),Chars,[])),
Reps).
Reps = [explicit_chars].
When starting swipl with swipl --traditional the backquoted representation is rejected with Syntax error: Operator expected , but otherwise nothing changes.
The Prolog Standard (6.3.7) says:
A double quoted list is either an atom (6.3.1.3) or a list (6.3.5).
Consequently, the following should succeed:
Welcome to SWI-Prolog (threaded, 64 bits, version 7.6.4)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.
For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).
?- Foo = "foo", (atom(Foo) ; Foo = [F, O, O]).
false.
So SWI-Prolog is not a Prolog by default. That's OK, but if you want to know about SWI-Prolog's non-Prolog behavior, please adjust the tags on the question.
From the definition it also follows that double quoted lists are completely useless by default even in a conforming Prolog: They might denote atoms, so regardless of the chars/codes distinction you can't even know that the double quoted list is actually a list. Even DCGs that only care about structural properties of the "text" (whether it's a palindrome, for example) are useless if the "list" is in fact an atom.
Hence a Prolog program that wants to process text with DCGs must at startup force the double_quotes flag to the value it wants. You have the choice between codes and chars. Codes have no advantages over chars, but they do have disadvantages in readability and typeability. Thus:
Answer: Use chars. Set the double_quotes flag explicity.
I should start to by noting that the answer to the "Should text-processing DCGs be written to handle codes or chars? Or both?" question can be neither. DCGs work by using an implicit difference list to thread state. But the elements of that difference list can be other than chars or codes. It depends on the output of text tokenization and what exactly text processing entails. E.g. I have worked on and come across Prolog NLP applications where codes/chars were only used for the basic tokenization and the reasoning was performed (still with DCGS) using either atoms or compound terms that reified the token data (e.g. v(Verb) or n(Noun)). One of those applications (a personal assistant like it's common nowadays in phones) used atoms produced by a voice-recognition component.
But let's go back to chars vs codes. Legacy practices and failed standardization left Prolog with problematic text representation. ASCII gives us a singe quote, a back quote, and a double-quote. With single quotes being used for atoms, a choice could have been made to use e.g. back quotes for representing a list of codes and double-quotes for representing a list of chars. Or the other way around. Instead, and this is where standardization failed, we got the problematic double_quotes flag. There's no shortage of Prolog code in the wild that makes an assumption about the meaning of double-quoted terms and thus works or breaks depending on the implicit value of the double_quotes flag (if you think this is mainly an issue with legacy code, think again). Guess what happens when we try to combine code that require different values for the flag? Note that, in almost all systems (including those that support modules), the flag value is global ... As Isabelle wrote in her answer, setting the flag explicitly is good general advice. But not, as I explained, without problems.
Some systems provide additional values for the flag. E.g. SWI-Prolog allows the flag to also be set to string. GNU Prolog supports additional atom_no_escape, chars_no_escape and codes_no_escape. Some systems only support codes. Some systems also provide a back_quotes flag. This Babel tower means that portable and resilient code is often forced to use atoms to represent text. But this is may not be ideal from a performance perspective.
Back to the original question. As Isabelle mentioned, chars is usually a more readable (read, easier to debug) choice. But, depending on the Prolog system, codes may provide better performance. If application performance is critical, benchmark both solutions. Some recent Prolog systems (e.g. Scryer-Prolog or Trealla Prolog) have efficient support for chars. Older systems may trail behind.
Note that your question is very much related to I/O. Prior to ISO, many systems in the DEC-10 succession supported a single kind of I/O via get0/1 and put/1 (and versions for tty) which served both characters and bytes at the same time. What can go wrong with that? Today, that is obvious. But multi-octet character set handling (MOCSH as it was called) was for many a much more exotic feature as it is today, a quarter century after the standard's publication. After all, the today mostly accepted UTF-8 encoding was invented 1992-09 and first published in 1993. And like so many projects like TRON it could have failed as well. Some other programming languages got burnt by betting on UCS-2/UTF-16 encoding.
What the standard did was to split I/O into character and byte I/O (and their corresponding types text and binary). So there is now get_char/1, get_byte/1 ... That the _byte versions all use integers in the range of 0..255 was non-controversial (plus -1 for EOF). But what about the _char versions? The only way to resolve this was to provide both _char and _code versions and consequently chars and codes versions of double-quoted strings and related built-ins. The default for flag double_quotes is implementation defined (7.11.2.5).
In this manner systems with a lot of DEC-10 legacy could continue to use codes explicitly. For them, an integer thus meant either an integer or a byte or a character. But users of such system still could use the better encoding. New systems that do not have to deal with such legacies going back to 1977 opt as default for chars like Tau, Scryer, and Trealla. As much as tradition is concerned, note that Prolog I, often called Marseille Prolog, did encode double quoted strings as lists of atoms of length one. And in the preliminary version of Prolog of 1972, often called Prolog 0, strings were encoded as nil-s-t-r-i-n-g qua boum facilitating stemming. In any case, not a single character code was present at all.
The advantages of chars should be obvious. It is much easier to read and debug, in particular if you have partially instantiated strings, say [a,X,c] vs. [97,X,99], which occur often when generalizing queries as with library(diadem). It is also a bit shorter to write. And, double quoted strings can be used for printing answers.
If you really want to write programs that both support codes and chars at the same time, use rather goals like [Ch] = "a" where Ch is now the atom a or the integer 97 or 129 or whatever processor character set you are using. It all depends on the Prolog flag double_quotes. And more succinctly you can write
nonzero(C) --> [C],{member(C,"123456789")}.
What is even more important is that phrase("abc", "abc") still holds.
However, changing that flag within the same application is certainly not a good idea (nor to switch to the value atom or some non-conforming value).
((When using chars note that single quotes as in C = 'a' are a bit misleading since the single quotes do not serve any purpose. Instead, round brackets are preferable if you want to ensure that the code will be valid even in the presence of an operator declaration for a. When a occurs as a functor's argument or a list's element, no round brackets are needed, but they are often used redundantly in operator declarations.))
You are making incorrect assumptions. These are not "chars":
foo_or_bar(foo) --> "foo".
The "foo" is a string, in SWI-Prolog, but this works perfectly within a DCG rule definition. The place to read about this is here, in particular:
A DCG literal
Although represented as a list of codes is the correct representation for handling in DCGs, the DCG translator can recognise the literal and convert it to the proper representation. Such code need not be modified.
All your other suggestions are just unnecessary, you should be either enumerating explicitly all possible "nonzeros", digits and so on, or using the library.
PS: if your main goal is to write code that runs on any Prolog, you might as well use something like Logtalk instead.

What does the "%" mean in tcl?

In a situation like this for example:
[% $create_port %]
or [list [% $RTL_LIST %]]
I realized it had to do with the brackets, but what confuses me is that sometimes it is used with the brackets and variable followed, and sometimes you have brackets with variables inside without the %.
So i'm not sure what it is used for.
Any help is appreciated.
% is not a metacharacter in the Tcl language core, but it still has a few meanings in Tcl. In particular, it's the modulus operator in expr and a substitution field specifier in format, scan, clock format and clock scan. (It's also the default prompt character, and I have a trivial pass-through % command in my ~/.tclshrc to make cut-n-pasting code easier, but nobody else in the world needs to follow my lead there!)
But the code you have written does not appear to be any of those (because it would be a syntax error in all of the commands I've mentioned). It looks like it is some sort of directive processing scheme (with the special sequences being [% and %], with the brackets) though not one I recognise such as doctools or rivet. Because a program that embeds a Tcl interpreter could do an arbitrary transformation to scripts before executing them, it's extremely difficult to guess what it might really be.

What is the difference between "hello".length and "hello" .length?

I am surprised when I run the following examples in ruby console. They both produce the same output.
"hello".length
and
"hello" .length
How does the ruby console remove the space and provide the right output?
You can put spaces wherever you want, the interpreter looks for the end of the line. For example:
Valid
"hello".
length
Invalid
"hello"
.length
The interpreter sees the dot at the end of the line and knows something has to follow it up. While in the second case it thinks the line is finished. The same goes for the amount of spaces in one line. Does it matter how the interpreter removes the spaces? What matters is that you know the behavior.
If you want you can even
"hello" . length
and it will still work.
I know this is not an answer to you question, but does the "how" matter?
EDIT: I was corrected in the comments below. The examples with multiple lines given above are both valid when run in a script instead of IRB. I was mixed them up with the operators. Where the following also applies when running a script:
Valid
result = true || false
Valid
result = true ||
false
Invalid
result = true
|| false
This doesn't have as much to do with the console as it has to do with how the language itself is parsed by the compiler.
Most languages are parsed in such a way that items to be parsed are first grouped into TOKENS. Then the compiler is defined to expect a certain SEQUENCE of tokens in order to interpret each programming statement.
Because the compiler is only looking for a TOKEN SEQUENCE, it doesn't matter if there is space in between or not.
In this case the compiler is looking for:
STRING DOT METHOD_NAME
So it won't matter if you write "hello".length, or even "hello" . length. The same sequence of tokens are present in both, and that is all that matters to the compiler.
If you are curious how these token sequences are defined in the Ruby source code, you can look at parse.y starting around line 1042:
https://github.com/ruby/ruby/blob/trunk/parse.y#L1042
This is a file that is written using the YACC language, which is a language used to define parsers with.
Even without knowing anything about YACC, you should already be able to get some clues on how it works by just looking around the file a bit.

prolog read from file throws error on special characters

I am working in Prolog, and am trying to read in from a file. The first line is a password. With the password, I want to be able to use special characters.
Here is the read file code:
readfile(Filename):-
open(Filename, read, Str),
read(Str, Thepassword),
read(Str, Thefirewall),
close(Str),
nb_setval(password, Thepassword),
nb_setval(firewall, Thefirewall).
This works fine until I change the password from brittany to britta!y, then I get ERROR: computer1.txt:1: Syntax error: Operator expected.
Anyone know what I should do?
read/2 reads Prolog terms. What you probably want is to read the whole line regardless it is in Prolog syntax or not.
In SWI Prolog you can use the predicate read_line_to_codes/2 instead. (See the SWI manual entry). You must include the library with use_module(library(readutil)) first.
SICStus has a similar predicate called read_line/1/2.
If you need an atom instead of a list of codes, you can convert it with atom_codes/2.

Resources