Retrieving variable length text from data files - antlr3

We have a requirement to extract variable length text blocks from our data files and have been trying to write a grammar file. There are two main challenges. First, text is of variable length. Second, there are no anchors which can help us locate the text blocks as such. Basically, we are trying to parse a mainframe generated text-based letter to extract the information from specific sections for file and then using the key-value pairs from ANTLR to populate a Adobe forms. In ANTLR 4.0 we identified a way to merge all the words for a token and merged them, space separated, to form a string. Since parsers and lexers were rewritten in 4.0, we don't have the same capability in 3.5. Seeking your inputs to crack the issues. Following are few things we have been trying.
We tried this approach of writing semantic predicate. We are able to get the words but we are unable to insert a whitespace between the words.
#init{ int N = 0; }
: ( { N <= 3 }? ) =>( WORD { N++; } )+
This is the third approach of including white space. Instead of skipping the WS we can handle it in each rule but it may occupy a lot of memory which may cause the memory out of space issue.
name: WORD WS WORD WS WORD;
WS : ( ' '|'\t'|'\r'|'\n' )+

Related

How to use hcl write to set expressions with ${}?

I am trying to use hclwrite to generate .tf files.
According to the example in hclwrite Example, I can generate variables like foo = env.PATH, but I don't know how to generate more forms of expressions. For example, the following.
stage = "prod"
foo = "hello${var.stage}"
when i set foo with
SetAttributeValue("foo", cty.StringVal("hello${bar}"))
i get
foo = "hello$${bar}"
The hclwrite tool currently has no facility to automatically generate arbitrary expressions. Its helper functions are limited only to generating plain references and literal values. SetAttributeValue is the one for literal values, and so that's why the library correctly escaped the ${ sequence in your string, to ensure that it will be interpreted literally.
If you want to construct a more elaborate expression then you'll need to do so manually by assembling the tokens that form the expression and then calling SetAttributeRaw instead.
In the case of your example, it looks like you'd need to generate the following eight tokens:
TokenOQuote with the bytes "
TokenQuotedLit with the bytes hello
TokenTemplateInterp with the bytes ${
TokenIdent with the bytes var
TokenDot with the bytes .
TokenIdent with the bytes stage
TokenTemplateSeqEnd with the bytes }
TokenCQuote with the bytes "
The SetAttributeValue function is automating the simpler case of generating three tokens: TokenOQuote, TokenQuotedLit, TokenCQuote.
You can potentially automate the creation of tokens for the var.stage portion of this expression by using TokensForTraversal, which is what SetAttributeTraversal does internally. However, unless you already have a parsed hcl.Traversal representing var.stage, or unless you need things to be more dynamic than you've shown in practice, I expect that it would take more code to construct that input traversal than to just write out the three tokens literally as I showed above.

Extract Tweet ID from text

I have a large, 4.5M+ row CSV (commas are the separators) containing tweets. The CSV comes from some time ago, and has all manner of line breaks inside column data, characters, etc. It is likely malformed in some ways but it is difficult for me to discern exactly where and how with a file of this size.
I want to move through this CSV file as a large body of text, pull out all the Tweet IDs, and put each pulled ID into a line in a new file.
Doing this via bash, perl, Python will work fine. Can anyone help here? I can't seem to even find info on the parameters for a tweet ID, though the ones in this corpus seem to all be 17 integers.
Since in your question the only evidence for a Tweet ID is that its an integer of length of 17, that is the only rule I am going to use.
Plus, I am going to use it as a hard-and-fast rule. Anything that is an integer of length is a Tweet ID, nothing else.
After that its a normal regular expression search.
import re
string = '''
12345678912345678, abcd, efgh
45645645645645645, ijkl, mnop
78944556677889900, qrst, uvwx
0, y, z
'''
m = re.findall('[0-9]{17}', string)
print(m)
re.findall searches for the regular expression (first arg) in the string (second argument)
(a):- [0-9] means any integer between 0 to 9
(b):- {m} means the regular exp. that preceded this must repeat m number of times
(a)+(b):- [0-9]{17} get me a match that has is a string of integers 0 to 9 repeated 17 times. i.e. a number of length 17
find out more about re module in python
This is as much I can help with you without knowing anything about the input file and tweet format.

Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or of course I could simply open the whole file and check if the last character was an EOF. I need to process lots of potentially large PDF's and I want to load as little memory as possible.
Note: all the files I want to detect will be lacking the EOF marker, so I feel like this is a little more specific scenario then detecting general PDF "corruption". What is the best, fast way to do this?
TL;DR
Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.
Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.
Check Last Kilobyte for File Trailer
This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:
Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.
Therefore, the following will work by looking for the file trailer instruction within that range:
def valid_file_trailer? filename
File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end
A Stricter Check of the File Trailer via Regex
However, the ISO standard is both more complex and a lot more strict. It says, in part:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).
Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:
def valid_file_trailer? filename
pattern = /^startxref\n\d+\n%%EOF\n\z/m
File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end

Suggested variable system and procedures to incorporate this LEET table into my Free Pascal program

I code using Free Pascal and Lazarus.
I want to incorporate the LEET Table seen here (http://en.wikipedia.org/wiki/Leet#Orthography) into a new program, but I'm unsure of the best way to do so. Should I use array structures (one for each letter of the alphabet) or 'Set Types' for each letter or records for each letter? Any suggestions of how to implement an idea would be appreciated.
The aim of the program is to open and read a text file line by line (I've got this done already) using an OpenDialog and it will then say "For each word, if it finds the letters 'E', 'O' or 'I', replace them with values from the table for the letter found"
e.g. if strLineFromFile contains letter 'E', replace it with 3, £, + &....and so on
repeat
...
strLineFromFile(Readln(SourceFile));
Look for letters E, I and O in strLineFromFile
Lookup LEET Table - Switch chars
until EOF(SourceFile);
I'm open to suggestions on the best way to optimise this process - I'm not expecting pure code but pointers as to perhaps what function\procedures would be best and what variable system to use for ptimum performance.
Note : I'm still learning so nothing too complex please!
Ted
Sets are not ordered, so they don't make sense here.
An array['a'..'z'] of array of string. The first array level is all letters in the input, the second array allows for various translations of the same input-letter.

Bad numeric format

So basically I have a record that looks like this
modulis = record
kodas : string[4];
pavadinimas : string[30];
skaicius : integer;
kiti : array[1..50] of string;
end;
And I'm trying to read it from the text file like this :
ReadLn(f1,N);
for i := 1 to N do
begin
Read(f1,moduliai[i].kodas);
Read(f1,moduliai[i].pavadinimas);
Read(f1,moduliai[i].skaicius);
for j := 1 to moduliai[i].skaicius do
Read(f1,moduliai[i].kiti[j]);
ReadLn(f1);
end;
And the file looks like this :
9
IF01 Programavimo ivadas 0
IF02 Diskrecioji matematika 1 IF01
IF03 Duomenu strukturos 2 IF01 IF02
IF04 Skaitmenine logika 0
IF05 Matematine logika 1 IF04
IF06 Operaciju optimizavimas 1 IF05
IF07 Algoritmu analize 2 IF03 IF06
IF08 Asemblerio kalba 1 IF03
IF09 Operacines sistemos 2 IF07 IF08
And I'm getting 106 bad numeric format. Can't figure out how to fix this, I'm not sure, but I think it has something to do with the text file, however I copied the text file from the internet so it has to be good :|
Reading string data is different from reading numeric data in Pascal.
With numbers the Read instruction consumes data until it hits white space or the end of file. Now white space in this case can be the space character, the tab character, the EOL 'character'. So if there are 2 numbers on one line of text, you could read them one by one using two consecutive Reads.
I believe you have already known that.
And I believe you thought it would work the same with strings. But it won't, you cannot read two string values from one line of text simply by using two consecutive Read instructions. Read would consume all the text up to EOL or EOF. After the reading the string variable is assigned however many characters it can hold, the rest of the data being thrown out into oblivion. It is essentially equivalent to ReadLn in this respect.
Solution? Arrange all the data in the input file on separate lines and better use ReadLns instead of all the Reads. (But I think the latter might be unnecessary, and rearranging the input data might be enough.)
Alternatively you would need to read the whole line of text into a temporary string variable, then split it manually and assign the parts to the corresponding record fields, not forgetting also to convert the numeric values from string to integer.
You choose what suits you better.
Because you have declared pavadinimas as string[30], it reads 30 character no matter what is the length of the string. For example in the following line pavadinimas will be
" Skaitmenine logika 0" instead of just "Skaitmenine logika"
IF04 Skaitmenine logika 0
I'm not a Pascal programmer, but it looks like the fields within your text file are not fixed length. How would you expect your program to delimit each field during read back?

Resources