Pattern that matches UTF-8 encoded characters [duplicate] - utf-8

When answering this question, I wrote this code to iterate over the UTF-8 byte sequence in a string:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
It works in Lua 5.2, but in Lua 5.1, it reports an error:
malformed pattern (missing ']')
I recall in Lua 5.1, the string literal \xhh is not supported, so I modified it to:
local str = "KORYTNAČKA"
for c in str:gmatch("[\0-\127\194-\244][\128-\191]*") do
print(c)
end
But the error stays the same, how to fix it?

See the Lua 5.1 manual on patterns.
A pattern cannot contain embedded zeros. Use %z instead.
In Lua 5.2, this was changed so that you could use \0 instead, but not so for 5.1. Simply add %z to the first set and change the first range to \1-\127.

I highly suspect, this happens because of \0 in the pattern. Basically, string that holds your pattern null-terminates before it should and, in fact, what lua regex engine is parsing is: [\0. That's clearly wrong pattern and should trigger the error you're currently getting.
To prove this concept I made little change to pattern:
local str = "KORYTNAČKA"
for c in str:gmatch("[\x0-\x7F\xC2-\xF4][\x80-\xBF]*") do
print(c)
end
That compiled and ran as expected on lua 5.1.4. Demonstration
Note: I have not actually looked what pattern was doing. Just removed \0 by adding x. So output of modified code might not be what you expect.
Edit: As a workaround you might consider replacing \0 with \\0 (to escape null-termination) in your second code example:
local str = "KORYTNAČKA"
for c in str:gmatch("[\\0-\127\194-\244][\128-\191]*") do
print(c)
end
Demo

Related

JISON: How do I avoid "dog" being parsed as "do"?

I have the following JISON file (lite version of my actual file, but reproduces my problem):
%lex
%%
"do" return 'DO';
[a-zA-Z_][a-zA-Z0-9_]* return 'ID';
"::" return 'DOUBLECOLON'
<<EOF>> return 'ENDOFFILE';
/lex
%%
start
: ID DOUBLECOLON ID ENDOFFILE
{$$ = {type: "enumval", enum: $1, val: $3}}
;
It is for parsing something like "AnimalTypes::cat". It works fine for things like "AnimalTypes::cat", but the when it sees dog instead of cat, it asumes it's a DO instead of an id. I can see why it does that, but how do I get around it? I've been looking at other JISON documents, but can't seem to spot the difference that (I assume) makes those work.
This is the error I get:
JisonParserError: Parse error on line 1:
PetTypes::dog
----------^
Expecting "ID", "enumstr", "id", got unexpected "DO"
Repro steps:
Install jison-gho globally from npm (or modify code to use local version). I use Node v14.6.0.
Save the JISON above as minimal-repro.jison
Run: jison -m es -o ./minimal.mjs ./minimal-repro.jison to create parser
Create a file named test.mjs with code like:
import Parser from "./minimal.mjs";
Parser.parser.parse("PetTypes::dog")
Run node test.mjs
Edit: Updated with a reproducible example.
Edit2: Simpler JISON
Unlike (f)lex, the jison lexer accepts the first matching pattern, even if it is not the longest matching pattern. You can get the (f)lex behaviour by using
%option flex
However, that significantly slows down the scanner.
The original jison automatically added \b to the end of patterns which ended with a literal string matching an alphabetic character, to make it easier to match keywords without incurring this overhead. In jison-gho, this feature was turned off unless you specify
%option easy_keyword_rules
See https://github.com/zaach/jison/wiki/Deviations-From-Flex-Bison#user-content-literal-tokens.
So either of those options will achieve the behaviour you expect.

What does this variable assignment do?

I'm having to code a subversion hook script, and I found a few examples online, mostly python and perl. I found one or two shell scripts (bash) as well. I am confused by a line and am sorry this is so basic a question.
FILTER=".(sh|SH|exe|EXE|bat|BAT)$"
The script later uses this to perform a test, such as (assume EXT=ex):
if [[ "$FILTER" == *"$EXT"* ]]; then blah
My problem is the above test is true. However, I'm not asking you to assist in writing the script, just explaining the initial assignment of FILTER. I don't understand that line.
Editing in a closer example FILTER line. Of course the script, as written does not work, because 'ex' returns true, and not just 'exe'. My problem here is only, however, that I don't understant the layout of the variable assignment itself.
Why is there a period at the beginning? ".(sh..."
Why is there a dollar sign at the end? "...BAT)$"
Why are there pipes between each pattern? "sh|SH|exe"
You probably looking for something as next:
FILTER="\.(sh|SH|exe|EXE|bat|BAT)$"
for EXT
do
if [[ "$EXT" =~ $FILTER ]];
then
echo $EXT extension disallowed
else
echo $EXT is allowed
fi
done
save it to myscript.sh and run it as
myscript.sh bash ba.sh
and will get
bash is allowed
ba.sh extension disallowed
If you don't escape the "dot", e.g. with the FILTER=".(sh|SH|exe|EXE|bat|BAT)$" you will get
bash extension disallowed
ba.sh extension disallowed
What is (of course) wrong.
For the questions:
Why is there a period at the beginning? ".(sh..."
Because you want match .sh (as extension) and not for example bash (without the dot). And therefore the . must be escaped, like \. because the . in regex mean "any character.
Why is there a dollar sign at the end? "...BAT)$"
The $ mean = end of string. You want match file.sh and not file.sh.jpg. The .sh should be at the end of string.
Why are there pipes between each pattern? "sh|SH|exe"
In the rexex, the (...|...|...) construction delimites the "alternatives". As you sure quessed.
You really need read some "regex tutorial" - it is more complicated - and can't be explained in one answer.
Ps: NEVER use UPPERCASE variable names, they can collide with environment variables.
This just assigns a string to FILTER; the contents of that string have no special meaning. When you try to match it against the pattern *ex*, the result is true assuming that the value of $FILTER consists the string ex surrounded by anything on either side. This is true; ex is a substring of exe.
FILTER=".(sh|SH|exe|EXE|bat|BAT)$"
^^
|
+---- here is the "ex" from the pattern.
As I can this is similar to regular expression pattern:
In regular expressions the string start with can be show with ^, similarly in this case . represent seems doing that.
In the bracket you have exact string, which represents what the exact file extensions would be matched, they are 'Or' by using the '|'.
And at the end the expression should only pick the string will '$' or end point and not more than.
I would say that way original author might have looked at it and implemented it.

Single quote string interpolation to access a file in linux

How do I make the parameter file of the method sound become the file name of the .fifo >extension using single quotes? I've searched up and down, and tried many different >approaches, but I think I need a new set of eyes on this one.
def sound(file)
#cli.stream_audio('audio\file.fifo')
end
Alright so I finally got it working, might not be the correct way but this seemed to do the trick. First thing, there may have been some white space interfering with my file parameter. Then I used the File.join option that I saw posted here by a few different people.
I used a bit of each of the answers really, and this is how it came out:
def sound(file)
file = file.strip
file = File.join('audio/',"#{file}.fifo")
#cli.stream_audio(file) if File.exist? file
end
Works like a charm! :D
Ruby interpolation requires that you use double quotes.
Is there a reason you need to use single quotes?
def sound(FILE)
#cli.stream_audio("audio/#{FILE}.fifo")
end
As Charles Caldwell stated in his comment, the best way to get cross-platform file paths to work correctly would be to use File.join. Using that, your method would look like this:
def sound(FILE)
#cli.stream_audio(File.join("audio", "#{FILE}.fifo"))
end
Your problem is with your usage of file path separators. You are using a \. Whereas this may not seem like a big deal, it actually is when used in Ruby strings.
When you use \ in a single quoted string, nothing happens. It is evaluated as-is:
puts 'Hello\tWorld' #=> Hello\tWorld
Notice what happens when we use double quotes:
puts "Hello\tWorld" #=> "Hello World"
The \t got interpreted as a tab. That's because, much like how Ruby will interpolate #{} code in a double quote, it will also interpret \n or \t into a new line or tab. So when it sees "audio\file.fifo" it is actually seeing "audio" with a \f and "ile.fifo". It then determines that \f means 'form feed' and adds it to your string. Here is a list of escape sequences. It is for C++ but it works across most languages.
As #sawa pointed out, if your escape sequence does not exist (for instance \y) then it will just remove the \ and leave the 'y'.
"audio\yourfile.fifo" #=> audioyourfile.fifo
There are three possible solutions:
Use a forward slash:
"audio/#{file}.fifo"
The forward slash will be interpreted as a file path separator when passed to the system. I do most my work on Windows which uses \ but using / in my code is perfectly fine.
Use \\:
"audio\\#{file}.fifo"
Using a double \\ escapes the \ and causes it to be read as you intended it.
Use File.join:
File.join("audio", "#{file}.fifo")
This will output the parameters with whatever file separator is setup as in the File::SEPARATOR constant.

Regular expression "empty range in char class error"

I got a regex in my code, which is to match pattern of url and threw error:
/^(http|https):\/\/([\w-]+\.)+[\w-]+([\w- .\/?%&=]*)?$/
The error was "empty range in char class error". I found the cause of that is in ([\w- .\/?%&=]*)? part. Ruby seems to recognize - in \w- . as an operator for range instead of a literal -. After adding escape to the dash, the problem was solved.
But the original regular expression ran well on my co-workers' machines. We use the same version of osx, rails and ruby: Ruby version is ruby 1.9.3p194, rails is 3.1.6 and osx is 10.7.5. And after we deployed code to our Heroku server, everything worked fine too. Why did only my environment have error regarding this regex? What is the mechanism of Ruby regex interpreting?
I can replicate this error on Ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux], installed on Ubuntu 12.04.1 LTS using rvm 1.13.4. However, this should not be a version-specific error. In fact, I'm surprised it worked on the other machines at all.
A a simpler demonstration that fails just as well:
"abcd" =~ /[\w- ]/
This is because [\w- ] is interpreted as "a range beginning with any word character up to space (or blank)", rather than a character class containing a word, a hyphen, or a space, which is what you had intended.
Per Ruby's regular expression documentation:
Within a character class the hyphen (-) is a metacharacter denoting an inclusive range of characters. [abcd] is equivalent to [a-d]. A range can be followed by another range, so [abcdwxyz] is equivalent to [a-dw-z]. The order in which ranges or individual characters appear inside a character class is irrelevant.
As you saw, prepending a backslash escaped the hyphen, thus changing the nature of the regexp from a range to a character class, removing the error. However, escaping the hyphen in the middle of character class is not recommended, since it's easy to confuse the intended meaning of the hyphen in such cases. As m.buettner pointed out, always place hyphens either at the beginning or the end of a character class:
"abcd" =~ /[-\w ]/

Regular expression to match only the first file in a RAR file set

To see what file to invoke the unrar command on, one needs to determine which file is the first in the file set.
Here are some sample file names, of which - naturally - only the first group should be matched:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
One (limited) way to do it with PCRE compatible regexps is this:
.*(?:(?<!part\d\d\d|part\d\d|\d)\.rar|\.part0*1\.rar)
This did not work in Ruby when I tested it at Rejax however.
How would you write one Ruby compatible regular expression to match only the first file in a set of RAR files?
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes
0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt. I have done my own tests with spanning RAR archives and their headers are correct according to the link above.
This is a much, much safer way of determining which file is first in a set like this.
The short answer is that it's not possible to construct a single regex to satisfy your problem. Ruby 1.8 does not have lookaround assertions (the (?<! stuff in your example regex) which is why your regex doesn't work. This leaves you with two options.
1) Use more than one regex to do it.
def is_first_rar(filename)
if ((filename =~ /part(\d+)\.rar$/) == nil)
return (filename =~ /\.rar$/) != nil
else
return $1.to_i == 1
end
end
2) Use the regex engine for ruby 1.9, Oniguruma. It supports lookaround assertions, and you can install it as a gem for ruby 1.8. After that, you can do something like this:
def is_first_rar(filename)
reg = Oniguruma::ORegexp.new('.*(?:(?<!part\d\d\d|part\d\d|\d)\.rar|\.part0*1\.rar)')
match = reg.match(filename)
return match != nil
end
Personally I wouldn't use (extended) regular expressions in this case (or at least not just one to do it all). What's wrong with coding this in, for example, a few ifs?
I am no regex expert but here is my attempt
^(yes|no)\.(rar|part0*1\.rar)$
Replace "yes|no" with the actual file name. I matched it against your examples to see if it would only match the first set hence the "yes|no" in the regex.
UPDATE: fixed as per the comment. Not sure why the user would not know the filename so i did not fix that part...

Resources