I need Ruby to parse and edit parts of files with a custom syntax - ruby

I have a set of .txt files that show a custom languange, files I want to systematically modify using a Ruby script. The syntax of that language is as follows:
(I will use [some text] as meta variables for expressions, like [atom 1] to indicate an arbitrary atom, and [atom 2] to indicate an arbitrary atom diferent from the former)
Atoms: alphanumerical strings, possibly surrounded by double quotes. Examples:
same_realm
"Ok"
Statements: either
[atom_1] = [atom_2]
or
[atom_1] = { [atom or statement 1] ... [atom or statement n] }
comments: in any line of the text, any character after a # is ignored. Example:
[atom_1] = [atom_2] #This is a comment and will be ignored
If an statement is of the form [atom 1] = {[atom or statement 1] ... [atom or statement n]}, we call [atom 1] the head of the satement and [atom or statement 1] ... [atom or statement n] the body of the statement.
Before and after =, { and } there can be an arbitrary number (possibly 0) of space characters.
Between two consecutive atoms must be at least one space character, but can be any number higher than that.
So, the two expressions any_realm_lord = {...} and any_realm_lord = {...} in the example below are valid, the only syntactical difference between them being the use of any_realm_lord/any_province_lord as the head of each statement.
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
} any_province_lord={any_character = { #some comment
limit = {#some other comment
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
Once that's explained, this is what I want to do with ruby (I will use the Example file above to illustrate it)
1) Open a file and locate the statements which are not within the body of other statements
(in the example, I'd want it to locate the any_realm_character = {...} and the any_province_character = {...} statements)
2) Iterate over the statements located in 1) and select the ones whose head matches a certain string. If the match is in a line where there is also other atoms or statements, separate them. From now on, I will refer to the statement whose head matches the string as "the target statement".
(Say the string to match is "any_province_lord". After this step the file will look like this:
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
any_province_lord={any_character = { #some comment
limit = {#some other comment
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
)
3) create a blank line above the line where the head of the target statement is, and cut and paste there any comment in the lines that the target statement encompass
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={any_character = {
limit = {
same_realm = ROOT} set_character_flag =
my_flag
}
}#more text
)
4)If the closing bracket of the target statement is in the same line as another atom(s) or statement(s), add a \n after the closing bracket
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={any_character = {
limit = {
same_realm = ROOT} set_character_flag =
my_flag
}
}
#more text
)
5)Erase the body of the target statement (but not the brackets), and add new content between the brackets that I have already defined, with nice spacing.
(
#Example file
#previous text
any_realm_lord={any_character={limit={same_realm=ROOT}set_character_flag=my_flag}
}
#some comment#some other comment
any_province_lord={
#my predefined content will be here
}
#more text
)
What would then be the best way to do this in terms of efficiency? I need my program to do this to over a thousand of files (each one of an average size of 500Kb). I'm fairly new to Ruby, so I am still figuring out if for these things is best to use read, readlines, or readline in term of efficiency.
What do you think?
I hope I've been clear in the explanation of what I need and not too unnecesarily verbose

Related

Parsing comment in Rascal

I have a very basic question about parsing a fragment that contains comment.
First we import my favorite language, Pico:
import lang::pico::\syntax::Main;
Then we execute the following:
parse(#Id,"a");
gives, as expected:
Id: (Id) `a`
However,
parse(#Id,"a\n%% some comment\n");
gives a parse error.
What do I do wrong here?
There are multiple problems.
Id is a lexical, meaning layout (comments) are never there
Layout is only inserted between elements in a production and the Id lexical has only a character class, so no place to insert layout.
Even if Id was a syntax non terminal with multiple elements, it would parse comments between them not before or after.
For more on the difference between syntax, lexical, and layout see: Rascal Syntax Definitions.
If you want to parse comments around a non terminal, we have the start modified for the non terminal. Normally, layout is only inserted between elements in the production, with start it is also inserted before and after it.
Example take this grammer:
layout L = [\t\ ]* !>> [\t\ ];
lexical AB = "A" "B"+;
syntax CD = "C" "D"+;
start syntax EF = "E" "F"+;
this will be transformed into this grammar:
AB = "A" "B"+;
CD' = "C" L "D"+;
EF' = L "E" L "F"+ L;
"B"+ = "B"+ "B" | "B";
"D"+ = "D"+ L "D" | "D";
"F"+ = "F"+ L "F" | "F";
So, in particular if you'd want to parse a string with layout around it, you could write this:
lexical Id = [a-z]+;
start syntax P = Id i;
layout L = [\ \n\t]*;
parse(#start[P], "\naap\n").top // parses and returns the P node
parse(#start[P], "\naap\n").top.i // parses and returns the Id node
parse(P, "\naap"); // parse error at 0 because start wrapper is not around P

Parse file, find a string and store next values

I need to parse a file according to different rules.
The file contains several lines.
I go through the file line by line. When I find a specific string, I have to store the data present in the next lines until a specific character is found.
Example of file:
start {
/* add comment */
first_step {
sub_first_step {
};
sub_second_step {
code = 50,
post = xxx (aaaaaa,
bbbbbb,
cccccc,
eeeeee),
number = yyyy (fffffff,
gggggg,
jjjjjjj,
ppppppp),
};
So, in this case:
File.open(#file_to_convert, "r").each_line do |line|
In "line" I have my current line. I need to:
1) find when the line contains the string "xxx"
if line.include?("union") then
Correct?
2) store the next values (e.g.: aaaa, bbbb, ccccc,eeee) in an array until I find the character ")". This highlights that the section is finished.
I think we I reach the line with the string "xxxx" I have to iterate the next lines inside the block "if".
Try this:
file_contents = File.read(#file_to_convert)
lines = file_contents[/xxx \(([^)]+)\)/, 1].split
# => ["aaaaaa,", "bbbbbb,", "cccccc,", "eeeeee"]
The regex (xxx \(([^)]+)\)) takes all the text after xxx ( until the next ), and split splits it into its items.
It think this is what you are looking for:
looking = true
results = []
File.open(#file_to_convert, "r").each_line do |line|
if looking
if line.include?("xxx")
looking = false
results << line.scan(/\(([^,]*)/x)
end
else
if line.include?(")")
results << line.strip.delete('),')
break
else
results << line.strip.delete(',')
end
end
end
puts results

What is a regular expression for finding lines with uncommented Java code?

I'm working on a simple Ruby program that should count of the lines of text in a Java file that contain actual Java code. The line gets counted even if it has comments in it, so basically only lines that are just comments won't get counted.
I was thinking of using a regular expression to approach this problem. My program will just iterate line by line and compare it to a "regexp", like:
while line = file.gets
if line =~ regex
count+=1
end
end
I'm not sure what regexp format to use for that, though. Any ideas?
Getting the count for "Lines of code" can be a little subjective. Should auto-generated stuff like imports and package name really count? A person usually didn't write it. Does a line with just a closing curly brace count? There's not really any executing logic on that line.
I typically use this regex for counting Java lines of code:
^(?![ \s]*\r?\n|import|package|[ \s]*}\r?\n|[ \s]*//|[ \s]*/\*|[ \s]*\*).*\r?\n
This will omit:
Blank lines
Imports
Lines with the package name
Lines with just a }
Lines with single line comments //
Opening multi-line comments ((whitespace)/* whatever)
Continuation of multi-line comments ((whitespace)* whatever)
It will also match against either \n or \r\n newlines (since your source code could contain either depending on your OS).
While not perfect, it seems to come pretty close to matching against all, what I would consider, "legitimate" lines of code.
count = 0
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{/\*} .. ln =~ %r{\*/})
count += 1
end
There's only a problem with lines that have a multilines comment but also code, for example:
someCall(); /* Start comment
this a comment
even this
*/ thisShouldBeCounted();
However:
imCounted(); // Comment
meToo(); /* comment */
/* comment */ yesImCounted();
// i'm not
/* Nor
we
are
*/
EDIT
The following version is a bit more cumbersome but correctly count all cases.
count = 0
comment_start = false
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{^\s*/\*} .. ln =~ %r{\*/}) or (comment_start and not ln.include? '*/')
count += 1 unless comment_start and ln =~ %r{\*/\s*$}
comment_start = ln.include? '/*'
end

How to detect the difference between ' as used in an abbreviation and as quotation markers

I'm attempting to parse blocks of text and need a way to detect the difference between apostrophes in different contexts. Possession and abbreviation in one group, quotations in the other.
e.g.
"I'm the cars' owner" -> ["I'm", "the", "cars'", "owner"]
but
"He said 'hello there' " -> ["He","said"," 'hello there' "]
Detecting whitespace on either side won't help as things like " 'ello " and " cars' " would parse as one end of a quotation, same with matching pairs of apostrophes. I'm getting the feeling that there's no way of doing it other than an outrageously complicated NLP solution and I'm just going to have to ignore any apostrophes not occurring mid-word, which would be unfortunate.
EDIT:
Since writing I have realised this is impossible. Any regex-ish based parser would have to parse:
'ello there my mates' dogs
in 2 different ways, and could only do that with understanding of the rest of the sentence. Guess I'm for the inelegant solution of ignoring the least likely case and hoping it's rare enough to only cause infrequent anomalies.
Hm, I'm afraid this won't be easy. Here's a regex that kinda works, alas only for stuff like "I'm" and "I've":
>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"
If you play around with it a bit more, you may be able to eliminate some other common contractions, which might still be better than nothing.
Some rules to think about:
Quotes will start with an apostrophe with a whitespace character or nothing before it.
Quotes will end with an apostrophe with punctuation or a whitespace character after it.
Some words may look like the end of quotes, e.g., peoples'.
Quote delimiting apostrophes will never have letters directly before and after them.
Use a very simple two-phase process.
In pass 1 of 2, start with this regular expression to break the text down into alternating segments of word and non-word characters.
/(\w+)|(\W+)/gi
Store the matches in a list like this (I'm using AS3-style pseudo-code, since I don't work with ruby):
class MatchedWord
{
var text:String;
var charIndex:int;
var isWord:Boolean;
var isContraction:Boolean = false;
function MatchedWord( text:String, charIndex:int, isWord:Boolean )
{
this.text = text; this.charIndex = charIndex; this.isWord = isWord;
}
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)
In pass 2 of 2, iterate over the list of matches to find contractions by checking to see if each (trimmed, non-word) match ENDS with an apostrophe. If it does, then check the next adjacent (word) match to see if it matches one of only 8 common contraction endings. Despite all the two-part contractions I could think of, there are only 8 common endings.
d
l
ll
m
re
s
t
ve
Once you've identified such a pair of matches (non-word)="'" and (word)="d", then you just include the preceding adjacent (word) match and concatenate the three matches to get your contraction.
Understanding the process just described, one modification you must make is expand that list of contraction endings to include contractions that start with apostrophe, such as "'twas" and "'tis". For those, you simply don't concatenate the preceding adjacent (word) match, and you look at the apostrophe match a little more closely to see if it included other non-word character before it (that's why it's important it ends with an apostrophe). If the trimmed string EQUALS an apostrophe, then merge it with the next match, and if it only ENDS with an apostrophe, then strip off the apostrophe and merge it with the following match. Likewise, conditions that will include the prior match should first check to ensure the (trimmed non-word) match ending with an apostrophe EQUALS an apostrophe, so there are no extra non-word characters included accidentally.
Another modification you may need to make is expand that list of 8 endings to include endings that are whole words such as "g'day" and "g'night". Again, it's a simple modification involving a conditional check of the preceding (word) match. If it's "g", then you include it.
That process should capture the majority of contractions, and is flexible enough to include new ones you can think of.
The data structure would look like this.
Condition(Ending, PreCondition)
where PreCondition is
"*", "!", or "<exact string>"
The final list of conditions would look like this:
new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");
If you just process those conditions as I explained, that should cover all of these 86 contractions (and more):
'tis 'twas ain't aren't can't could've couldn't didn't doesn't don't
everybody's g'day g'night hadn't hasn't haven't he'd he'll he's how'd
how'll how's I'd I'll I'm I've isn't it'd it'll it's let's li'l
might've mightn't mustn't needn't nobody's nothing's shan't she'd
she'll she's should've shouldn't that'd that'll that's there's they'd
they'll they're they've wasn't we'd we'll we're we've weren't what'll
what're what'd what's what've when'd when'll when's where'd where'll
where's who's who'll who're who'd who'll who's who've why'd why'll
why's won't would've wouldn't you'd you'll you're you've
On a side note, don't forget about slang contractions that don't use apostrophes such as "gotta" > "got to" and "gonna" > "going to".
Here is the final AS3 code. Overall, you're looking at less than 50 lines of code to parse the text into alternating word and non-word groups, and identify and merge contractions. Simple. You could even add a Boolean "isContraction" variable to the MatchedWord class and set the flag in the code below when a contraction is identified.
//Automatically merge known contractions
var conditions:Array = [
["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
["l","*"],
["ll","*"],
["m","*"],
["re","*"],
["s","*"],
["t","*"],
["ve","*"],
["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
["tis","!"],
["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
var m:MatchedWord = matched_words[i];
var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
{
var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
for each (var condition:Array in conditions)
{
if (StringUtils.trim( m_next.text ) == condition[0])
{
var pre_condition:String = condition[1];
switch (pre_condition)
{
case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
if (apostrophe_text == "'")
{
m.text += m_next.text;
m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
m.isContraction = true;
matched_words.splice( i + 1, 1 );
}
else
{ //strip apostrophe off end and merge with next item, nothing needs deleted
//preserve spaces and match start indexes by manipulating untrimmed strings
var apostrophe_end:int = m.text.lastIndexOf( "'" );
var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
m_next.text = apostrophe_ending + m_next.text;
m_next.charIndex = m.charIndex + apostrophe_end;
m_next.isContraction = true;
}
break;
default: //conditional success, check prior match meets condition
if (m_prev != null && m_prev.text == pre_condition)
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
}
}
}
}
}

Parse a particular number of lines

I'm trying to read through a file, find a certain pattern and then grabbing a set number of lines of text after the line that contains that pattern. Not really sure how to approach this.
If you want the n number of lines after the line matching pattern in the file filename:
lines = File.open(filename) do |file|
line = file.readline until line =~ /pattern/ || file.eof;
file.eof ? nil : (1..n).map { file.eof ? nil : file.readline }.compact
end
This should handle all cases, like the pattern not present in the file (returns nil) or there being less than n lines after the matching lines (the resulting array containing the last lines of the file.)
First parse the file into lines. Open, read, split on the line break
lines = File.open(file_name).read.split("\n")
Then get index
index = line.index{|x| x.match(/regex_pattern/)}
Where regex_pattern is the pattern that you are looking for. Use the index as a starting point and then the second argument is the number of lines (in this case 5)
lines[index, 5]
It will return an array of 'lines'
You could combine it a bit more to reduce the number of lines. but I was attempting to keep it readable.
If you're not tied to Ruby, grep -A 12 trivet will show the 12 lines after any line with trivet in it. Any regex will work in place of "trivet"
matched = false;
num = 0;
res = "";
new File(filename).each_line { |line|
if (matched) {
res += line+"\n";
num++;
if (num == num_lines_desired) {
break;
}
} elsif (line.match(/regex/)) {
matched = true;
}
}
This has the advantage of not needing to read the whole file in the event of a match.
When done, res will hold the desired lines.
in rails (only difference is how I generate the file object)
file = File.open(File.join(Rails.root, 'lib', 'file.json'))
#convert file into an array of strings, with \n as the separator
line_ary = file.readlines
line_count = line_ary.count
i = 0
#or however far up the document you want to be...you can get very fancy with this or just do it manually
hsh = {}
line_count.times do |l|
child_id = JSON.parse(line_ary[i])
i += 1
parent_ary = JSON.parse(line_ary[i])
i += 1
hsh[child_id] = parent_ary
end
haha I've said too much that should definitely get you started

Resources