"one or more" with LL parser - algorithm

Let's say my grammar is:
file = line, {line}
line = ..., "\n"
If I want to build a LL parser for that grammar, how should I implement the "one or more line"?
I was thinking about changing the grammar to this:
file = line
line = ..., "\n", nl
nl = line
| <end of file>
My lines would be nested. Is this the most elegant/efficient way to solve the problem ?

Close. Typically just like this:
file = line, morelines
morelines = e | line, morelines
line = ..., "\n"
Where e is the epsilon or empty symbol

Related

Using sed to add line above a set of lines

EDIT BELOW
I'm new to bash scripting, sorry if this has been answered elsewhere, couldn't find it in any searches I've done.
I'm using sed -i to add a line above an argument, for example.
for EFP in *.inp; do
sed -i "/^O */i FRAGNAME=H2ODFT" $EFP
done
and it works as expected. but I would like it to only add the line when the argument is true across multiple lines, like so:
O
C
O
C
FRAGNAME=H2ODFT
O
H
H
FRAGNAME=H2ODFT
O
H
H
Notice there's no added line above the two O's that are followed by C's.
I tried the following:
for FILE in *.inp; do
sed -i "/^O*\nH*\nH */i FRAGNAME=H2ODFT" $EFP
done
and I was expecting it to show up above the 3 lines that went O - H - H, but nothing happened, it passed through the file thinking that that argument was nowhere in the document.
I've looked elsewhere and thought of using awk, but I can't wrap my head around it.
Any help would be greatly appreciated!
L
EDIT
Thanks for the help. And sorry for being a bit unclear. I've tried a ton of things, too many to put in this post. I've tried awk, perl and sed solutions, but they're not working.
My input has a series of O's C's and H's, with cartesian coordinates assigned to them:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
I'm trying to input a new line above a specific set of three lines, the OHH lines.
The awk solution posted didn't work, because it would add extra lines where there shouldn't be when the stage gets reset. I'm looking for the following output:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
The ^tsed was a typo and should've been an indent instead of ^t
Here is a ruby to do that:
ruby -e 'lines=$<.read.split(/\R/)
lines.each_with_index{|line,i|
three_line_tag=lines[i..i+2].map{|sl| sl.split[0] }.join
puts "FRAGNAME=H2ODFT" if three_line_tag == "OHH"
puts line
}
' file
Or any awk, same kind of method:
awk '{lines[NR]=$0}
END{
for(i=1;i<=NR;i++) {
tag=""
for(j=0;j<=2;j++) {
split(lines[i+j],arr)
tag=tag arr[1]
}
if (tag=="OHH")
print "FRAGNAME=H2ODFT"
print lines[i]
}
}
' file
Or Perl:
perl -0777 -pe 's/(^\h*O\h.*\R^\h*H\h.*\R^\h*H\h.*\R?)/FRAGNAME=H2ODFT\n\1/gm' file
Any print:
C 36.116 34.950 34.657
C 35.638 34.681 35.883
C 36.134 33.569 36.703
C 34.379 34.567 37.522
N 34.579 35.375 36.476
N 35.234 33.518 37.706
O 37.045 32.745 36.559
H 36.892 34.226 34.415
O 35.234 38.803 30.513
H 34.303 39.079 30.567
C 33.490 35.015 38.608
H 34.002 35.390 39.503
H 32.894 34.170 38.974
H 32.832 35.813 38.245
C 35.342 32.708 38.920
H 35.920 33.237 39.688
H 35.942 31.802 38.772
H 34.356 32.475 39.340
FRAGNAME=H2ODFT
O 30.226 35.908 36.744
H 30.557 36.408 37.490
H 30.642 36.311 35.982
FRAGNAME=H2ODFT
O 37.356 40.420 29.232
H 36.473 40.786 29.286
H 37.220 39.474 29.189
FRAGNAME=H2ODFT
O 40.889 37.054 35.401
H 40.304 36.361 35.706
H 41.620 36.587 34.995
===
Edit in place:
Read THIS about awk and that is generally applicable.
Any of these scripts as written write to stdout.
You can redirect the output to a new file:
someutility input_file >new_file
Or some (like perl, ruby, GNU awk, GNU sed) have the ability to do in-place file replacement. If you don't have that option, you cannot do:
someutil 'prints to STDOUT' file >file
since file will be destroyed before fully read.
Instead you would do:
someutil 'prints to STDOUT' file > tmp && mv tmp file
This might work for you (GNU sed):
sed -Ei -e ':a;N;s/\n/&/2;Ta;/^O(\n.)\1$/i FRAGNAME=H2ODFT' -e 'P;D' file1 file2
Open a 3 line window throughout the file and if the required pattern matches, insert the line of the desired text.
N.B. The \1 back reference matches the line before. Also the script is in two separate pieces because the i command requires to end in a newline which the -e option provides.
An alternative version of the same solution:
cat <<\! | sed -Ef - -i file{1..100}
:a
N
s/\n/&/2
Ta
/^O(\n.)\1$/i FRAGNAME=H2ODFT
P
D
!
If input files aren't large to cause memory issues, you can slurp the entire file and then perform the substitution. For example:
perl -0777 -pe 's/^O\nH\nH\n/FRAGNAME=H2ODFT\n$&/gm' ip.txt
If this works for you, then you can add the -i option for inplace editing. The regex ^O*\nH*\nH * shown in the question isn't clear. ^O\nH\nH\n will match three lines having O, H and H exactly. Adjust as needed.
I know you requested a sed solution, but, I have a solution based on awk.
We initialize the awk program with a stage which, overtime, will track the progress of "OHH"
If we receive another letter, we grow the stage until we get OHH, then, we print your required string and reset the stage
If we encounter a breakage, we print out whatever we accumulated in stage and reset stage
awk '
BEGIN { stage="" }
/^O$/ { if (stage=="") { stage="O\n"; next } }
/^H$/ { if (stage=="O\n") { stage="O\nH\n"; next } }
/^H$/ { if (stage=="O\nH\n") { print "FRAGNAME=H20DFT\nO\nH\nH"; stage=""; next } }
{ print stage $1; stage="" }
' < sample.txt
Where sample.txt contains:
O
C
O
C
O
H
H
O
H
H

Sort two text files with its indented text aligned to it

I would like to compare two of my log files generated before and after an implementation to see if it has impacted anything. However, the order of the logs I get is not the same all the time. Since, the log file also has multiple indented lines, when I tried to sort, everything is sorted. But, I would like to keep the child intact with the parent. Indented lines are spaces and not tab.
Any help would be greatly appreciated. I am fine with any windows solution or Linux one.
Eg of the file:
#This is a sample code
Parent1 to be verified
Child1 to be verified
Child2 to be verified
Child21 to be verified
Child23 to be verified
Child22 to be verified
Child221 to be verified
Child4 to be verified
Child5 to be verified
Child53 to be verified
Child52 to be verified
Child522 to be verified
Child521 to be verified
Child3 to be verified
I am posting another answer here to sort it hierarchically, using python.
The idea is to attach the parents to the children to make sure that the children under the same parent are sorted together.
See the python script below:
"""Attach parent to children in an indentation-structured text"""
from typing import Tuple, List
import sys
# A unique separator to separate the parent and child in each line
SEPARATOR = '#'
# The indentation
INDENT = ' '
def parse_line(line: str) -> Tuple[int, str]:
"""Parse a line into indentation level and its content
with indentation stripped
Args:
line (str): One of the lines from the input file, with newline ending
Returns:
Tuple[int, str]: The indentation level and the content with
indentation stripped.
Raises:
ValueError: If the line is incorrectly indented.
"""
# strip the leading white spaces
lstripped_line = line.lstrip()
# get the indentation
indent = line[:-len(lstripped_line)]
# Let's check if the indentation is correct
# meaning it should be N * INDENT
n = len(indent) // len(INDENT)
if INDENT * n != indent:
raise ValueError(f"Wrong indentation of line: {line}")
return n, lstripped_line.rstrip('\r\n')
def format_text(txtfile: str) -> List[str]:
"""Format the text file by attaching the parent to it children
Args:
txtfile (str): The text file
Returns:
List[str]: A list of formatted lines
"""
formatted = []
par_indent = par_line = None
with open(txtfile) as ftxt:
for line in ftxt:
# get the indentation level and line without indentation
indent, line_noindent = parse_line(line)
# level 1 parents
if indent == 0:
par_indent = indent
par_line = line_noindent
formatted.append(line_noindent)
# children
elif indent > par_indent:
formatted.append(par_line +
SEPARATOR * (indent - par_indent) +
line_noindent)
par_indent = indent
par_line = par_line + SEPARATOR + line_noindent
# siblings or dedentation
else:
# We just need first `indent` parts of parent line as our prefix
prefix = SEPARATOR.join(par_line.split(SEPARATOR)[:indent])
formatted.append(prefix + SEPARATOR + line_noindent)
par_indent = indent
par_line = prefix + SEPARATOR + line_noindent
return formatted
def sort_and_revert(lines: List[str]):
"""Sort the formatted lines and revert the leading parents
into indentations
Args:
lines (List[str]): list of formatted lines
Prints:
The sorted and reverted lines
"""
sorted_lines = sorted(lines)
for line in sorted_lines:
if SEPARATOR not in line:
print(line)
else:
leading, _, orig_line = line.rpartition(SEPARATOR)
print(INDENT * (leading.count(SEPARATOR) + 1) + orig_line)
def main():
"""Main entry"""
if len(sys.argv) < 2:
print(f"Usage: {sys.argv[0]} <file>")
sys.exit(1)
formatted = format_text(sys.argv[1])
sort_and_revert(formatted)
if __name__ == "__main__":
main()
Let's save it as format.py, and we have a test file, say test.txt:
parent2
child2-1
child2-1-1
child2-2
parent1
child1-2
child1-2-2
child1-2-1
child1-1
Let's test it:
$ python format.py test.txt
parent1
child1-1
child1-2
child1-2-1
child1-2-2
parent2
child2-1
child2-1-1
child2-2
If you wonder how the format_text function formats the text, here is the intermediate results, which also explains why we could make file sorted as we wanted:
parent2
parent2#child2-1
parent2#child2-1#child2-1-1
parent2#child2-2
parent1
parent1#child1-2
parent1#child1-2#child1-2-2
parent1#child1-2#child1-2-1
parent1#child1-1
You may see that each child has its parents attached, all the way along to the root. So that the children under the same parent are sorted together.
Short answer (Linux solution):
sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
Test it out:
test.txt
parent2
child2-1
child2-1-1
child2-2
parent1
child1-1
child1-2
child1-2-1
$ sed ':a;N;$!ba;s/\n /#/g' test.txt | sort | sed ':a;N;$!ba;s/#/\n /g'
parent1
child1-1
child1-2
child1-2-1
parent2
child2-1
child2-1-1
child2-2
Explanation:
The idea is to replace the newline followed by an indentation/space with a non newline character, which has to be unique in your file (here I used # for example, if it is not unique in your file, use other characters or even a string), because we need to turn it back the newline and indentation/space later.
About sed command:
:a create a label 'a'
N append the next line to the pattern space
$! if not the last line, ba branch (go to) label 'a'
s substitute, /\n / regex for newline followed by a space
/#/ a unique character to replace the newline and space
if it is not unique in your file, use other characters or even a string
/g global match (as many times as it can)

Parsing CSV file with \n in double quoted fields

I'm parsing a CSV file that has a break line in double quoted fields. I'm reading the file line by line with a groovy script but I get an ArrayIndexOutBoundException when I tried to get access the missing tokens.
I was trying to pre-process the file to remove those characters and I was thinking to do that with some bash script or with groovy itself.
Could you, please suggest any approach that I can use to resolve the problem?
This is how the CSV looks like:
header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"
This is the groovy script I'm using
def csv = new File(args[0]).text
def bufferString = ""
def parsedFile = new File("Parsed_" + args[0]);
csv.eachLine { line, lineNumber ->
def splittedLine = line.split(',');
retString += new Date(splittedLine[0]) + ",${splittedLine[1]},${splittedLine[2]},${splittedLine[3]}\n";
if(lineNumber % 1000 == 0){
parsedFile.append(retString);
retString = "";
}
}
parsedFile.append(retString);
UPDATE:
Finally I did this and it works, (I needed format the first column from timestamp to a human readable date):
gawk -F',' '{print strftime("%Y-%m-%d %H:%M:%S", substr( $1, 0, length($1)-3 ) )","($2)","($3)","($4)}' TobeParsed.csv > Parsed.csv
Thank you #karakfa
If you use a proper CSV parser rather than trying to do it with split (which as you can see doesn't work with any form of quoting), then it works fine:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"'''
def data = parseCsv(csv)
data.eachWithIndex { line, index ->
println """Line $index:
| 1:$line.header1
| 2:$line.header2
| 3:$line.header3
| 4:$line.header4""".stripMargin()
}
Which prints:
Line 0:
1:timestamp
2:abcdefghi
3:abcdefghi
4:sdsd
Line 1:
1:timestamp
2:zxcvb
fffffgfg
3:asdasdasadsd
4:sdsdsd
awk to the rescue!
this will merge the newline split fields together, you process can take it from there
$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}1' splitted.csv
header1,header2,header3
xxxxxx, "abcdefghi", "abcdefghi"
yyyyyy, "zxcvb fffffgfg","asdasdasadsd"
assumes that odd number of quotes mean split field and replace new line with OFS. If you want to simple delete new line (the split parts will combine) remove OFS.

printf not printing line correctly bash unix

I am using printf command to log some values in a file as follows:
printf "Parameter = $parameter v9_value = $v9_val v9_line = $V9_Line_Count v16_val = $v16_val v16_line = $V16_Line_Count"
But the output I am getting as follows:
v16_line = 8elayServerPort v9_value = 41 v9_line = 8 v16_val = 4571
Seems like the line is printed in rotation manner, and last values are coming from starting.
Expected Output:
Parameter = RelayServerPort v9_value = 41 v9_line = 8 v16_val = 4571 v16_line = 8
But v16_line = 8 is overwritten on Parameter = R in line.
printf doesn't add a NL on the end. You need to add \n to the end of your printf.
Not seeing the rest of your program, or where you get your variable values, it's hard to say what else could be the issue.
One thing you can do is to redirect your output to a file and look at that file either through a good program editor or using cat -v which disables control characters.
See if you see ^M in your output. If you do, it could be that you have ^R in your variables.
Also remove $v16_val from your printf (temporarily) and see if your output looks better. If so, that $v16_val might have a CR (^M) in it.

Is there a SnakeYaml DumperOptions setting to avoid double-spacing output?

I seem to see double-spaced output when parsing/dumping a simple YAML file with a pipe-text field.
The test is:
public void yamlTest()
{
DumperOptions printOptions = new DumperOptions();
printOptions.setLineBreak(DumperOptions.LineBreak.UNIX);
Yaml y = new Yaml(printOptions);
String input = "foo: |\n" +
" line 1\n" +
" line 2\n";
Object parsedObject = y.load(new StringReader(input));
String output = y.dump(parsedObject);
System.out.println(output);
}
and the output is:
{foo: 'line 1
line 2
'}
Note the extra space between line 1 and line 2, and after line 2 before the end of the string.
This test was run on Mac OS X 10.6, java version "1.6.0_29".
Thanks!
Mark
In the original string you use literal style - it is indicating by the '|' character. When you dump your text, you use single-quoted style which ignores the '\n' characters at the end. That is why they are repeated with the empty lines.
Try to set different styles in DumperOptions:
// and others - FOLDED, DOUBLE_QUOTED
DumperOptions.setDefaultScalarStyle(ScalarStyle.LITERAL)

Resources