Ruby variable scoping is killing me - ruby

I have a parser that reads files. Inside a file, you can declare a filename and the parser will go and read that one, then when it is done, pick up right where it left off and continue. This can happen as many levels deep as you want.
Sounds pretty easy so far. All I want to do is print out the file names and line numbers.
I have a class called FileReader that looks like this:
class FileReader
attr_accessor :filename, :lineNumber
def initialize(filename)
#filename = filename
#lineNumber = 0
end
def readFile()
# pseudocode to make this easy
open #filename
while (lines)
#lineNumber = #lineNumber + 1
if(line startsWith ('File:'))
FileReader.new(line).readFile()
end
puts 'read ' + #filename + ' at ' + #lineNumber.to_s()
end
puts 'EOF'
end
end
Simple enough. So lets say I have a file that refers other files like this. File1->File2->File3. This is what it looks like:
read File1 at 1
read File1 at 2
read File1 at 3
read File2 at 1
read File2 at 2
read File2 at 3
read File2 at 4
read File3 at 1
read File3 at 2
read File3 at 3
read File3 at 4
read File3 at 5
EOF
read File3 at 5
read File3 at 6
read File3 at 7
read File3 at 8
EOF
read File2 at 4
read File2 at 5
read File2 at 6
read File2 at 7
read File2 at 8
read File2 at 9
read File2 at 10
read File2 at 11
And that doesnt make any sense to me.
File 1 has 11 lines
File 2 has 8 lines
File 3 has 4 lines
I would assume creating a new object would have its own scope that doesn't affect a parent object.

class FileReader
def initialize(filename)
#filename = filename
end
def read_file
File.readlines(#filename).map.with_index {|l, i|
next "read #{#filename} at #{i}" unless l.start_with?('File:')
FileReader.new(l.gsub('File:', '').chomp).read_file
}.join("\n") << "\nEOF"
end
end
puts FileReader.new('File1').read_file
or just
def File.read_recursively(filename)
readlines(filename).map.with_index {|l, i|
next "read #{filename} at #{i}" unless l.start_with?('File:')
read_recursively(l.gsub('File:', '').chomp)
}.join("\n") << "\nEOF"
end
puts File.read_recursively('File1')

I agree that something in your rewriting code has obfuscated the problem. Yes, those instance variables should be local to the instance.
Watch out for things where a block of code or conditional may be returning a value and assigning it to the instance variable... for example, if your open statement uses the next block and returns the filename somehow... #filename = open(line) {}
I say this because the filename obviously didn't change back after the EOF

This is what I came up with. It's not pretty but I tried to stay as close to your code as possible while Ruby-fying it too.
file_reader.rb
#!/usr/bin/env ruby
class FileReader
attr_accessor :filename, :lineNumber
def initialize(filename)
#filename = filename
#lineNumber = 0
end
def read_file
File.open(#filename,'r') do |file|
while (line = file.gets)
line.strip!
#lineNumber += 1
if line.match(/^File/)
FileReader.new(line).read_file()
end
puts "read #{#filename} at #{#lineNumber} : Line = #{line}"
end
end
puts 'EOF'
end
end
fr = FileReader.new("File1")
fr.read_file
And the File1, File2, and File3 looking like:
Line 1
Line 2
Line 3
File2
Line 5
Line 6
Line 7
Line 8
Line 9
Line 10
Line 11
Output:
read File1 at 1 : Line = Line 1
read File1 at 2 : Line = Line 2
read File1 at 3 : Line = Line 3
read File2 at 1 : Line = Line 1
read File2 at 2 : Line = Line 2
read File2 at 3 : Line = Line 3
read File2 at 4 : Line = Line 4
read File3 at 1 : Line = Line 1
read File3 at 2 : Line = Line 2
read File3 at 3 : Line = Line 3
read File3 at 4 : Line = Line 4
EOF
read File2 at 5 : Line = File3
read File2 at 6 : Line = Line 6
read File2 at 7 : Line = Line 7
read File2 at 8 : Line = Line 8
EOF
read File1 at 4 : Line = File2
read File1 at 5 : Line = Line 5
read File1 at 6 : Line = Line 6
read File1 at 7 : Line = Line 7
read File1 at 8 : Line = Line 8
read File1 at 9 : Line = Line 9
read File1 at 10 : Line = Line 10
read File1 at 11 : Line = Line 11
EOF
To reiterate we really have to see your code to know where the problems is.
I understand you thinking it has something to do with variable scoping so the question makes sense to me.
For the other people. Please be a little more kind to the novices trying to learn. This is supposed to be a place for helping. Thank you. </soapbox>

Related

How to match a list of numbers to another file with a list of numbers?

I have a list of numbers :
Line 1
6728
2882
18181
Line 2
282828
4778
2876
9393
Line 3
73920
2489
Line 4
53689
8292
93838
To match to the list of the line number:
4
7
35
52
98
148
406
I have tried to read both the file and compare it but comes out error :
no such file or directory - 4
inc = 'incoming.txt'
grid = 'Lines.txt'
File.readlines(inc).each do |a|
File.readlines(grid).each do |line|
if grid == a
puts line
end
end
end
Expected result:
Line 4
53689
8292
93838
Line 7
6272
4441
98754
Line 35
156
4785
9867
14286
986
With a help of regular expression that would be possible.
lines = File.read('Lines.txt').split
input = File.read('incoming.txt')
lines.map do |l|
# select everything starting with Line NN
# till the empty line _or_ EOF
input[/Line #{l}.*?(?=\n\s*\n|\z)/m]
end.compact.join($/)
Also, there is a flip-flop solution :)
File.readlines('incoming.txt').select do |l|
true if lines.include?(l[/(?<=Line )\d+/])..(l =~ /^\s*$/)
end.join($/)
More about flip-flop.

How do I delete all lines in a concatenated text file that match the header WITHOUT deleting the header? [bash] [duplicate]

This question already has answers here:
Is there way to delete duplicate header in a file in Unix?
(2 answers)
How to delete the first column ( which is in fact row names) from a data file in linux?
(5 answers)
Closed 4 years ago.
My apologies if this question already exists out there. I have a concatenated text file that looks like this:
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu_2500_NA 0.66666024153854 -0.194766358934969
2 1 764484 787307 1:764484:787307:clu_2500_NA -0.602342191830433 0.24773430748199
3 1 880180 880422 1:880180:880422:clu_2501_NA -0.211378452591182 2.02508282380949
4 1 880180 880437 1:880180:880437:clu_2501_NA 0.231916912049866 -2.20305649485074
5 1 889462 891303 1:889462:891303:clu_2502_NA -2.3215482460681 0.849095194607155
6 1 889903 891303 1:889903:891303:clu_2502_NA 2.13353943689806 -0.920181808417383
7 1 899547 899729 1:899547:899729:clu_2503_NA 0.990822909478346 0.758143648905368
8 1 899560 899729 1:899560:899729:clu_2503_NA -0.938514081703866 -0.543217522714283
9 1 986217 986412 1:986217:986412:clu_2504_NA -0.851041440248378 0.682551011244202
The first line, #Chr start end ID GTEX-Q2AG GTEX-NPJ8, is the header, and because I concatenated several similar files, it occurs multiple times throughout the file. I would like to delete every instance of the header occuring in the text without deleting the first header
BONUS I actually need help with this too and would like to avoid posting another stack overflow question. The first column of my data was generated by R and represents row numbers. I want them all gone without deleting #Chr. There are too many columns and it's a problem.
This problem is different than ones recommended to me because of the above additional issue and also because you don't necessarily have to use regex to solve this problem.
The following AWK script removes all lines that are exactly the same as the first one.
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile > outputfile
It will print the first line because the initial value of header is an empty string. Then it will store the first line in header because it is empty.
After this it will print only lines that are not equal to the first one already stored in header. The second if will always be false once the header has been saved.
Note: If the file starts with empty lines these empty lines will be removed.
To remove the first number column you can use
sed 's/^[0-9][0-9]*[ \t]*//' inputfile > outputfile
You can combine both commands to a pipe
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile | sed 's/^[0-9][0-9]*[ \t]*//' > outputfile
maybe this helpful:
delete all headers
delete first column
add first header
cat foo.txt
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
2 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
3 1 764484 783034 1:764484:783034:clu
sed '/#Chr start end ID GTEX-Q2AG GTEX-NPJ8/d' foo.txt | awk '{$1 = ""; print $0 }' | sed '1i #Chr start end ID GTEX-Q2AG GTEX-NPJ8'
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu
Using sed
sed '2,${/HEADER/d}' input.txt > output.txt
Command explained:
Starting at line 2: 2,
Search for any line matching 'HEADER' /HEADER
Delete it /d
I would do
awk 'NR == 1 {header = $0; print} $0 != header' file

sed print between two line patterns only if both patterns are found

Suppose I have a file with:
Line 1
Line 2
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Line 8
Line 9
Start Line 10
Line 11
End Line 12
Line 13
Start line 14
Line 15
I want to use sed to print between the patterns only if both /Start/ and /End/ are found.
sed -n '/Start/,/End/p' works as expected if you know both markers are there and in the order expected, but it just prints from Start to the end of the file if End is either out of order or not present. (i.e., prints line 14 and line 15 in the example)
I have tried:
sed -n '/Start/,/End/{H;}; /End/{x; p;}' file
Prints:
# blank line here...
Start Line 3
Line 4
Line 5
Line 6
End Line 7
End Line 7
Start Line 10
Line 11
End Line 12
which is close but two issues:
Unwanted leading blank line
End Line 7 printed twice
I am hoping for a result similar to
$ awk '/Start/{x=1} x{buf=buf$0"\n"} /End/{print buf; buf=""; x=0}' file
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Start Line 10
Line 11
End Line 12
(blank lines between the blocks not necessary...)
With GNU sed and sed from Solaris 11:
sed -n '/Start/{h;b;};H;/End/{g;p;}' file
Output:
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Start Line 10
Line 11
End Line 12
If Start is found copy current pattern space to hold space (h) and branch to end of script (b). For every other line append current pattern space to hold space (H). If End is found copy hold space back to pattern space (g) and then print pattern space (p).
GNU sed: after encountering Start, keep appending lines as long as we don't see End; once we do, print the pattern space and start over:
$ sed -n '/Start/{:a;N;/End/!ba;p}' infile
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Start Line 10
Line 11
End Line 12
Getting the newline between blocks is tricky. This would add one after each block, but results in an extra blank at the end:
$ sed -n '/Start/{:a;N;/End/!ba;s/$/\n/p}' infile
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Start Line 10
Line 11
End Line 12
[blank]
You can use this awk:
awk 'x{buf=buf ORS $0} /Start/{x=1; buf=$0} /End/{print buf; buf=""; x=0}' file
Start Line 3
Line 4
Line 5
Line 6
End Line 7
Start Line 10
Line 11
End Line 12
Here is a sed version to do the same on OSX (BSD) sed (Based on Benjamin's sed command):
sed -n -e '/Start/{:a;' -e 'N;/End/!ba;' -e 'p;}' file
Personally, I prefer your awk solution, but:
sed -n -e '/start/,/end/H' -e '/end/{s/.*//; x; p}' input

How to reduce live log data?

A program produces a log file, which I am watching. Unfortunately, the log file includes sometimes 50 times the same Line 1.
Is there a possibility to get instead of
program.sh
Line 1
Line 1
Line 1
Line 1
...
Line 1
Line 1
Line 2
just something like:
program.sh
Line 1
\= repeated 43 times
Line 2
You can use this awk:
awk 'function prnt() { print p; if (c>1) print " \\= repeated " c " times"; }
p && p != $0{prnt(); c=0} {p=$0; c++}; END{prnt()}' file
Line 1
\= repeated 43 times
Line 2

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2
If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.
This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).
Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)
In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Resources