Parslet: How to Buffer/parse incrementally data - ruby

I'm writing an HTTP/1 response parser with parslet. It works, but only when I send the full payload.
I have smth like this:
rule(:response) {
response_line >> crlf >>
header.repeat.as(:headers) >> crlf >>
data.as(:data)
}
root :response
But if I pass an incomplete payload, I get:
parser.parse("HTTP/1.1 200 OK\r\n")
#=> Parslet::ParseFailed: Failed to match sequence (RESPONSE_LINE CRLF headers:(HEADER{0, }) CRLF data:DATA) at line 1 char 16.
I'd like to be able to feed bytes to the parser without failing, at least if they don't break the expectations. Is there a way to somehow "buffer" until some rule is broken, or all expectations are met?

Parslet matches a grammer against a whole document.
If you want to allow it to parse a partial document you need to define your grammar such that the missing parts are optional.
One approach may be to define a grammar that would match any one elements from the header, and define a 'the_rest' capture group that matches 'any.repeat'
Then you could recursively call the parser each time you get more document... with "the rest" plus anything more you have read in.
Each time you call it you would get one part of the header returned.

Related

Response assertion failing even though the document does contain the expected text

I have a response assertion validating the 'Document (text)' of a response.
This is the assertion
This is the failure
This is the response that the assertion is checking. As you can see I have directly copied 'Statutory Currency USD Total Tax (SC) 58.80' from the failing assertion and it is found in the response?!
Other assertions are passing for this request.
Notice you can (maybe should) switch to using Substring instead of Contains which will fix your issue, because it doesn't use regex which have reserved characters as (
Contains - true if the text contains the regular expression pattern
Substring - true if the text contains the pattern string (case-sensitive)
Turns out the '(' and ')' needed to be escaped using a '\'.
You’re using document instead of Text response in Fields to test.
Document is usually reserved to asserting on pdf, excel, word document. it has an important cost in terms of performance and is not needed when asserting on text (html, json...)
Also you should favor substring instead of contains.

how to test my Grammar antlr4 successfully? [duplicate]

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

Preserving whitespace / line breaks with REXML

I'm using Ruby 1.9.3 and REXML to parse an XML document, make a few changes (additions/subtractions), then re-output the file. Within this file is a block that looks like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2
some.namespace.something3=somevalue3
</someElement>
The problem is that after re-writing the file, this block always ends up looking like this:
<someElement>
some.namespace.something1=somevalue1
some.namespace.something2=somevalue2 some.namespace.something3=somevalue3
</someElement>
The newline after the second value (but never the first!) has been lost and turned into a space. Later, some other code which I have no control or influence over will be reading this file and depending on those newlines to properly parse the content. Generally in this situation i'd use a CDATA to preserve the whitespace, but this isn't an option as the code that parses this data later is not expecting one - it's essential that the inner text of this element is preserved exactly as-is.
My read/write code looks like this:
xmlFile = File.open(myFile)
contents = xmlFile.read
xmlDoc = REXML::Document.new(contents, { :respect_whitespace => :all })
xmlFile.close
{perform some tasks}
out = ""
xmlDoc.write(out, 2)
File.open(filePath, "w"){|file| file.puts(out)}
I'm looking for a way to preserve the whitespace of text between elements when reading/writing a file in this manner using REXML. I've read a number of other questions here on stackoverflow on this subject, but none that quite replicate this scenario. Any ideas or suggestions are welcome.
I get correct behavior by removing the indent (second) parameter to Document.write():
#xmlDoc.write(out, 2)
xmlDoc.write(out)
That seems like a bug in Document.write() according to my reading of the docs, but if you don't really need to set the indentation, then leaving that off should solve yor problem.

Parse /var/email/username file in Ruby

For some reason I need to fetch emails from /var/mail/username file. It seems like an append only file.
My question is, is it safe to parse the content of the /var/email/username file depending on the first line From username#host Mon Jun 20 16:50:15 2011? What if the similar pattern found inside the email body?
Furthermore, is there any opensource ruby script available for reference?
Yes, that seems like more or less the right way to parse the mbox format - from a quick scan of the RFC specification:
The structure of the separator lines
vary across implementations, but
usually contain the exact character
sequence of "From", followed by a
single Space character (0x20), an
email address of some kind, another
Space character, a timestamp sequence
of some kind, and an end-of- line
marker.
And...
Many implementations are also known
to escape message body lines that
begin with the character sequence of
"From ", so as to prevent confusion
with overly-liberal parsers that do
not search for full separator
lines. In the common case, a leading
Greater-Than symbol (0x3E) is used
for this purpose (with "From "
becoming ">From "). However, other
implementations are known not to
escape such lines unless they are
immediately preceded by a blank line
or if they also appear to contain
an email address and a timestamp.
Other implementations are also
known to perform secondary escapes
against these lines if they are
already escaped or quoted, while
others ignore these mechanisms
altogether.
Update:
There's also this: https://github.com/meh/ruby-mbox

Resources