Atom escaping rules in Prolog - prolog

I need to export to a file a Prolog program expressed using an arbitrary term representation in Java. The idea is that a Prolog interpreter should be able to consult the generated file afterwards.
My question is about the correct way to write in the file Java Strings representing atom terms.
For example, if the string has a space in the middle, it should be surrounded by single quotes in the file:
hello world becomes 'hello world'
And the exporter should take into consideration characters that should be escaped:
' becomes '\''
Could someone point me to the place were these rules are specified?, and: Can I assume that these rules are respected by major Prolog implementors? (I mean, a Prolog program generated following these rules would be correctly parsed by most Prolog interpreters?).

The precise place for this is the standard, ISO/IEC 13211-1:1995, quoted_token (* 6.4.2 *). See this answer how to get it for USD 30.
The precise syntax is quite complex due to a lot of extras like continuation lines and the like. If you are only writing atoms that should be read by Prolog, things are a bit easier. Also in that situation, you could always quote, which makes writing again a bit simpler.
Some things to be aware of:
Only simple spaces may occur as layout in a quoted atom. All other spaces need to be escaped like \t, \n (abrftnv). Many systems accept also other layout but they differ to each other in very tiny details.
Backslash and quote must be escaped.
Characters outside the printable ASCII range depend on the PCS supported by a system. In a conforming system, the accompanying documentation should define how the additional characters (extended characters) are classified. Documentation quality varies on a wide range.
In any case, test your interface also with GNU-Prolog from 1.4.1 upwards. To date, no differences are known between GNU 1.4.1+ and the standard as far as syntax is concerned.
Here are some 240+ syntax related test cases. Please report any oversight!

A practical hint: if you issue a writeq with your Prolog, with data you need to know about, you'll get quotes around when required.

Related

Escape sequences \033[01;36m\] vs. \033[1;36m\] in PS1 in .bashrc: why the zero?

I've just compared the $PS1 prompts in .bashrc on two of my Debian machines:
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;36m\]\u\[\033[0;90m\]#\[\033[0;32m\]\h\[\033[0;90m\]:\[\033[01;34m\]\w\[\033[0;90m\]\$\[\033[0m\] '
PS1='${debian_chroot:+($debian_chroot)}\[\033[1;36m\]\u\[\033[0;37m\]#\[\033[0;32m\]\h\[\033[0;37m\]:\[\033[01;34m\]\w\[\033[0;37m\]\$\[\033[0m\] '
As you see, the first sequence says \033[01;, whereas the second has \033[1; on the same position. Do both mean the same (I guess, bold) or do they mean something different? Any idea why the zero has appeared or disappeared? I have no recollection of having introduced/removed this zero myself. A Web search returns numerous occurrences both with and without zero.
"ANSI" numeric parameters are all decimal integers (see ECMA-48, section 5.4.1 Parameter Representation). In section 5.4.2, it explains
A parameter string consists of one or more parameter sub-strings, each of which represents a number
in decimal notation.
A leading zero makes no difference. Someone noticed the unnecessary character and trimmed it.
the ESC[#;#m escape is for the console font color. I've seen many subtle variations on escape implementations, so I'm not surprised. Regardless I think both should be interpreted the same way

What are Unicode codepoint types for?

I recently read the UTF-8 Everywhere manifesto, a document arguing for handling text with UTF-8 by default. The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
However, some modern languages that use the UTF-8 default have built-in codepoint types, such as rune in Go and char in Rust.
What are these types actually useful for? Are they legacy from times before the meaninglessness of codepoints was broadly understood? Or is that an incomplete perspective?
Texts have many different meaning and usages, so the question is difficult to answer.
First: about codepoint. We uses the term codepoint because it is easy, it implies a number (code), and not really confuseable with other terms. Unicode tell us that it doesn't use the term codepoint and character in a consistent way, but also that it is not a problem: context is clear, and they are often interchangeable (but for few codepoints which are not characters, like surrogates, and few reserved codepoints). Note: Unicode is mostly about characters, and ISO 10646 was most about codepoints. So original ISO was about a table with numbers (codepoint) and names, and Unicode about properties of characters. So we may use codepoints where Unicode character should be better, but character is easy confuseable with C char, and with font glyphs/graphemes.
Codepoints are one basic unit, so useful for most of programs, e.g. to store in databases, to exchange to other programs, to save files, for sorting, etc. For this exact reasons program languages uses the codepoint as type. UTF-8 code units may be an alternative, but it would be more difficult to navigate (see a UTF-8 as a tape disk where you should read sequentially, and codepoint text as an hard disk where you can just in middle of a text). Not a 100% appropriate, because you may need some context bytes. If you are getting user text, your program probably do not need to split in graphemes, to do liguatures, etc. if it will just store the data in a database. Codepoint is really low level and so fast for most operations.
The other part of text: displaying (or speech). This part is very complex, because we have many different scripts with very different rules, and then different languages with own special cases. So we needs a series of libraries, e.g. text layout (so word separation, etc. like pango), sharper engine (to find which glyph to use, combining characters, where to put next characters, e.g. HarfBuzz), and a font library which display the font (cairo plus freetype). it is complex, but most programmers do not need special handling: just reading text from database and sent to screen, so we just uses the relevant library (and it depends on operating system), and just going on. It is too complex for a language specification (and also a moving target, maybe in 30 years things are more standardized). So it is complex, and with many operation, so we may use complex structures (array of array of codepoint: so array of graphemes): not much a slow down. Note: fonts have codepoint tables to perform various operation before to find the glyph index. Various API uses Unicode strings (as codepoint array, UTF-16, UTF-8, etc.).
Naturally things are more complex, and it requires a lot of knowledge of different part of Unicode, if you are trying to program an editor (WYSIWYG, but also with terminals): you mix both worlds, and you need much more information (e.g. for selection of text). But in this case you must create your own structures.
And really: things are complex: do you want to just show first x characters on your blog? (maybe about assessment), or split at words (some language are not so linear, so the interpretation may be very wrong). For now just humans can do a good job for all languages, so also not yet need to a supporting type in different languages.
The manifesto argues that Unicode codepoints aren't a generally useful concept and shouldn't be directly interacted with outside of programs/libraries specializing in text processing.
Where? It merely outlines advantages and disadvantages of code points. Two examples are:
Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character Ω, and must be treated identically.
Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character ǵ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>.
In other words: code points just index which graphemes Unicode supports.
Sometimes they're meant as single characters: one prominent example would be € (EURO SIGN), having only the code point U+20AC.
Sometimes the same character has multiple code-points as per context: the dollar sign exists as:
﹩ = U+FE69 (SMALL DOLLAR SIGN)
$ = U+FF04 (FULLWIDTH DOLLAR SIGN)
💲 = U+1F4B2 (HEAVY DOLLAR SIGN)
Storage wise when searching for one variant you might want to match all 3 variants instead on relying on the exact code point only.
Sometimes multiple code points can be combined to form up a single character:
á = U+00E1 (LATIN SMALL LETTER A WITH ACUTE), also termed "precomposed"
á = combination of U+0061 (LATIN SMALL LETTER A) and U+0301 (COMBINING ACUTE ACCENT) - in a text editor trying to delete á (from the right side) will mostly result in actually deleting the acute accent first. Searching for either variant should find both variants.
Storage wise you avoid to need searching for both variants by performing Unicode normalization, i.e. NFC to always favor precombined code points over two combined code points to form one character.
As for homoglyphs code points clearly distinguish the contextual meaning:
A = U+0041 (LATIN CAPITAL LETTER A)
Α = U+0391 (GREEK CAPITAL LETTER ALPHA)
А = U+0410 (CYRILLIC CAPITAL LETTER A)
Copy the greek or cyrillic character, then search this website for that letter - it will never find the other letters, no matter how similar they look. Likewise the latin letter A won't find the greek or cyrillic one.
Writing system wise code points can be used by multiple alphabets: the CJK portion is an attempt to use as few code points as possible while supporting as many languages as possible - Chinese (simplified, traditional, Hong Kong), Japanese, Korean, Vietnamese:
今 = U+4ECA
入 = U+5165
才 = U+624D
Dealing as a programmer with code points has valid reasons. Programming languages which support these may (or may not) support correct encodings (UTF-8 vs. UTF-16 vs. ISO-8859-1) and may (or may not) correctly produce surrogates for UTF-16. Text wise users should not be concerned about code points, although it would help them distinguishing homographs.

Better algorithm for shortening English words

I have some unique codes that are generated from strings (ex: website host names) in various independent components of my application.
These codes are meant to be used by machines only so i would like to keep them as short as possible.
The below algorithm would be applied to every word in the string. The output words would be concatenated with a dash to generate the unique code.
The current algorithm I have used:
- Skip word if length is less than 6
- Leave first character as is
- Remove every wowel in the word from the second character onwards
architectural digest eu => archtctrl-dgst-eu
arizona foothills magazine => arzn-fthlls-mgzn
Is there a better way to shorten an English word leaving it as recognisable as possible to a human reader?
The output should be deterministic and produce the same shortened version whenever it is run on the same input.
A good algorithm should also minimise the number of clashes for similarly spelt words.
I have some unique codes that are generated from strings
I am afraid that is not true. There are many English words that will reduce to the same 'code word' when stripped of their vowels. For example, 'leaving' -> 'living' Given, this is fairly rare, it could still cause issues.
How important is it that these 'code words' remain human-readable if as you say, they are meant to be used by machines only? If its not that important, I'd suggest looking into some simpler compression algorithms like Huffman Coding or LZW Compression. Then if the user needs to see the translation of the code word, just uncompress it.
If you must keep it human-readable, I'm not sure that there is much more you can do to shorten it. You could take a look at specific latin + greek roots, and determine if you can shorten those any more by hand, and then just substitute those out automatically.
Alternatively, you could turn to a phonetic approach. Automatically search the pronunciation of the word, and then see if that is any shorter (or itself can be compressed, taking 'cee' to 'C', or 'kay' to 'K'). This would be much more time and CPU intensive, but its still an option if you really, really need short but yet readable codes.
What you're generating sounds like what's called a "slug". There are many libraries to handle this for blogs or site generators that should suit your purposes. Here's a usage example from a Python library called slugify:
txt = "___This is a test ---"
r = slugify(txt)
self.assertEqual(r, "this-is-a-test")
Slug libraries generally work like this:
replacing non-ascii linguistic characters via a mapping (ex: 影師嗎 -> ying-shi-ma)
replace accented latin letters with ascii equivalents via a mapping (ex: C'est déjà l'été. -> c-est-deja-l-ete)
remove beginning and trailing spaces/punctuation
convert remaining spaces and punctuation to dashes, collapsing multiple dashes in a row to a single dash
If you want to make slugs shorter you could remove vowels or, more simply, use a maximum length.

Are parsing expression grammars suited to parsing the shell command language?

The POSIX shell command language is not easy to parse, largely because of tight coupling between lexing and parsing.
However, parsing expression grammars (PEGs) are often scannerless. By combining lexing and parsing, it seems that I could avoid these problems. The language that I am using (Rust) has a well-maintained PEG library. However, I know of three difficulties that could make it impractical to use this library:
Shells must be able to parse line by line, not reading characters past the end of the line.
Aliases are purely lexical, and can cause a token to be replaced by any sequence of other tokens in certain situations
Shell reserved words are only recognized in certain situations
Is a PEG suited to parsing the shell command language given these requirements, or is a hand-written recursive-descent parser more suitable?
Yes, a PEG can be used, and none of the issues you note should be a problem.
In particular:
1) parsing line by line: most PEG tools will not have any built-in white-space skipping. All white space including newlines must be explicitly handled by you, which means you can handle newline any way you like.
2) You should not use the parse tree from PEG as your AST. Instead you should descend the parse tree and build an AST. For aliases then, after the parse has completed and you're building your AST, you can detect the alias and insert the appropriate expansion for the alias instead.
3) Reserved words are not reserved unless you reserve them. That is, if you have a context where either a reserved word or another alphanumeric symbol can occur, you must first check for the reserved words explicitly, then the arbitrary alphanumeric symbol, because once the PEG decides it has a match, that will not back-track. Anywhere a reserved word is not permitted, simply don't check for it, and your generalised alphanumeric symbol rule will succeed instead.

Hidden Whitespace Best Practice

I think most people agree that trailing whitespace is not good practice. A lot of editors will display it for you or automatically strip it out.
Consider this Python function as a simple example:
The extra whitespace on lines 11 and 13 are wrong. What I'm wondering about is line 10. Should a blank line inside a control block that doesn't change indentation have leading whitespace?
Most editors I've used will keep the cursor at the indentation level from the preceding line, so making a blank line without leading whitespace takes some extra formatting. What's the best practice? Should line 10 have leading whitespace or not?
When it comes to code execution it makes absolutely zero difference; the practice I have seen the most in python IS the one with white spaces, but I don't think anyone can really reasonably say one is objectively better than the other.
I'll try to answer your question sticking to your Python example, quoting their style guide.
From PEP-8:
Method definitions inside a class are separated by a single blank line.
From Wikipedia (blank line):
A blank line usually refers to a line containing zero characters (not counting any end-of-line characters); though it may also refer to any line that does not contain any visible characters (consisting only of whitespace).
If you believe the Wikipedia definition, you might ask why zero characters is preferred.
For one it's simpler, even if autoindent is turned on in your editor, you're just filling your file with extra bytes for no good reason.
Second, a regex for a zero character blank line is simpler as well, '^$' usually vs something like '^\s*$'.
As other answers have pointed out, it makes no difference execution wise to put in whitespace. With no good reason to do so, I would say the best practice is to leave it out and keep it simple. Can you imagine a situation where a zero character line would be treated differently than a line with some whitespace? I would hate to program in that language. Putting in whitespace seems baroque to me.
As #PinkElephantsOnParade writes, it makes no difference for the execution. Thus, it's solely a matter of personal aesthetical preference.
I myself set my editor to display trailing whitespace, since I think trailing whitespace is a bad idea. Thus, your line 10 would be highlighted and stare me in the face all the time.
Nobody wants that, so I'd argue line 10 should not contain whitespace. (Which, coincidentally is how my editor, emacs, handles this automatically.)
To each his own, but one way to look at it is: code style is about human readability. Therefore, trailing white-space is only an issue if it extends the length of the line past some preexisting (self-imposed) limit (ex. 80 char limit).
On the other hand, if you consistently display white-space in your editor, and this matters to you, I personally would keep it there, as it would be (for some, at least) more efficient to have the white-space present; if you decide to add code at that line at some point, you won't have to add additional white-space.

Resources