How to save chunk of information between two words to a file? - ruby

I have a following file:
old_file
new_file
Some string.
end
Text in the middle that is not supposed to go to any of files.
new_file
Another text.
end
How using regex can I create two files with the following content:
file1
new_file
Some string.
end
file2
new_file
Another text.
end
How can I get information which is between keywords 'new_file' and 'end' to write it to the file?

If your files are not that large, you can read them in as a string, (use File.read(file_name)), and then run the following regex:
file_contents.scan(/^new_file$.*?^end$/m).select { |block| WRITE_TO_FILE_CODE_HERE }
See the regex demo
The ^new_file$.*?^end$ regex matches new_file that is a whole line content, then 0+ any characters as few as possible (incl. a newline as /m modifier is used), and then end (a whole line).
Else, you may adapt this answer here as
printing = false
File.open(my_file).each_line do |line|
printing = true if line =~ /^new_file$/
puts line if printing
printing = false if line =~ /^end$/
end
Open the file when the starting line is found, write to it where puts line is in the example above, and close when printing false occurs.

You can also read the file chunk by chunk by changing what constitutes a "line" in ruby:
File.open("file1.txt", "w") do |file1|
File.open("file2.txt", "w") do |file2|
enum = IO.foreach("old_file.txt", sep="\n\n")
file1.puts enum.next.strip
enum.next #discard
file2.puts enum.next.strip
end #automatically closes file2
end #automatically closes file1
By designating the separator as "\n\n" ruby will read all the characters up to and including two consecutive newlines--and return that as a "line".

If that kind of format is fixed, then you may try this (new_file\n.*\nend)

Related

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

How do I add a line after another line in a file, in Ruby?

Updated description to be clearer.
Say I have a file and it has these lines in it.
one
two
three
five
How do I add a line that says "four" after the line that says "three" so my file now looks like this?
one
two
three
four
five
Assuming you want to do this with the FileEdit class.
Chef::Util::FileEdit.new('/path/to/file').insert_line_after_match(/three/, 'four')
Here is the example ruby block for inserting 2 new line after match:
ruby_block "insert_lines" do
block do
file = Chef::Util::FileEdit.new("/path/of/file")
file.insert_line_after_match("three", "four")
file.insert_line_after_match("four", "five")
file.write_file
end
end
insert_line_after_match searches for the regex/string and it will insert the value in after the match.
The following Ruby script should do what you want quite nicely:
# insert_line.rb
# run with command "ruby insert_line.rb myinputfile.txt", where you
# replace "myinputfile.txt" with the actual name of your input file
$-i = ".orig"
ARGF.each do |line|
puts line
puts "four" if line =~ /^three$/
end
The $-i = ".orig" line makes the script appear to edit the named input file in-place and make a backup copy with ".orig" appended to the name. In reality it reads from the specified file and writes output to a temp file, and on success renames both the original input file (to have the specified suffix) and the temp file (to have the original name).
This particular implementation writes "four" after finding the "three" line, but it would be trivial to alter the pattern being matched, make it count-based, or have it write before some identified line rather than after.
This is an in memory solution. It looks for complete lines rather than doing a string regex search...
def add_after_line_in_memory path, findline, newline
lines = File.readlines(path)
if i = lines.index(findline.to_s+$/)
lines.insert(i+1, newline.to_s+$/)
File.open(path, 'wb') { |file| file.write(lines.join) }
end
end
add_after_line_in_memory 'onetwothreefive.txt', 'three', 'four'
An AWK Solution
While you could do this in Ruby, it's actually trivial to do this in AWK. For example:
# Use the line number to choose the insertion point.
$ awk 'NR == 4 {print "four"}; {print}' lines
one
two
three
four
five
# Use a regex to prepend your string to the matched line.
$ awk '/five/ {print "four"}; {print}' lines
one
two
three
four
five

How do I join two lines of a file by matching pattern, in Ruby or Bash?

I'm using a Ruby script to do a lot of manipulation and cleaning to get this, and a bunch of other files, ready for import.
I have a really large file with some data that I'm trying to import into a database. There are some data issues with newline characters being in the data where they should not be, messing with the import.
I was able to solve this problem with sed using this:
sed -i '.original' -e ':a' -e 'N' -e '$!ba' -e 's/Oversight Bd\n/Oversight Bd/g' -e 's/Sciences\n/Sciences/g' combined_old_individual.txt"
However, I can't call that command from inside a Ruby script, because Ruby messes up interpreting the newline characters and won't run that command. sed needs the non-escaped newline character but when calling a system command from Ruby it needs a string, where the newline character needs to be escaped.
I also tried doing this using Ruby's file method, but it's not working either:
File.open("combined_old_individual.txt", "r") do |f|
File.open("combined_old_individual_new.txt","w") do |new_file|
to_combine = nil
f.each_line do |line|
if(/Oversight Bd$/ =~ line || /Sciences$/ =~ line)
to_combine = line
else
if to_combine.nil?
new_file.puts line
else
combined_line = to_combine + line
new_file.puts combined_line
to_combine = nil
end
end
end
end
end
Any ideas how I can join lines where the first line ends with "Bd" or "Sciences", from within a Ruby script, would be very helpful.
Here's an example of what might go in a testfile.txt:
random line
Oversight Bd
should be on the same line as the above, but isn't
last line
and the result should be
random line
Oversight Bdshould be on the same line as the above, but isn't
last line
With ruby (My first attempt at a ruby answer):
File.open("combined_old_individual.txt", "r") do |f|
File.open("combined_old_individual_new.txt","w") do |new_file|
f.each_line do |line|
if(/(Oversight Bd|Sciences)$/ =~ line)
new_file.print line.strip
else
new_file.puts line
end
end
end
end
You have to realize that sed normally works line by line, so you cannot match for \n in your initial pattern. You can however match for the pattern on the first line and then pull in the next line with the N command and then run the substitute command on the buffer to remove the newline like so:
sed -i -e '/Oversight Bd/ {;N;s/\n//;}' /your/file
Run from Ruby (without -i so that the output goes to stdout):
> cat test_text
aaa
bbb
ccc
aaa
bbb
ccc
> cat test.rb
cmd="sed -e '/aaa/ {;N;s/\\n//;}' test_text"
system(cmd)
> ruby test.rb
aaabbb
ccc
aaabbb
ccc
Since you are asking in bash, here is a pure-bash solution:
$ r="(Oversight Bd|Sciences)$"
$ while read -r; do printf "%s" "$REPLY"; [[ $REPLY =~ $r ]] || echo; done < combined_old_individual.txt
random line
Oversight Bdshould be on the same line as the above, but isn't
last line
$

How do i create line breaks in ruby?

How would i put line breaks in between lines like this:
print "Hi"
print "Hi"
Because it would just output this:
HiHi
Use puts since it will automatically add a newline for you:
puts "Hi"
puts "Hi"
If you want to make an explicit newline character then you'll need to know what kind of system(s) on which your program will run:
print "Hi\n" # For UNIX-like systems including Mac OS X.
print "Hi\r\n" # For Windows.
Use line break character:
print "Hi\n"
print "Hi"
puts "\n" works also on Win/Ruby ruby 2.4.2p198
and even "\n"*4 for multiplication of new rows (by 4)
You can create a space by adding a string with only a space in it between the 2 other strings. For example:
print "Hi" + " " + "Hi"
You could avoid the two print statements and instead only use one line.
print "Hi\r\nHi"
Or if you want to use two lines then
print "Hi\r\n"
print "Hi"

Reading from stdin and printing to stdout in Ruby

This question is kinda simple (don't be so harsh with me), but I can't get a code-beautiful solution. I have the following code:
ARGF.each_line do |line|
arguments = line.split(',')
arguments.each do |task|
puts "#{task} result"
end
end
It simply read from the standard input numbers. I use it this way:
echo "1,2,3" | ruby prog.rb
The output desired is
1 result
2 result
3 result
But the actual output is
1 result
2 result
3
result
It seems like there's a newline character introduced. I'm skipping something?
Each line ends in a newline character, so splitting on commas in your example means that the last token is 3\n. Printing this prints 3 and then a newline.
Try using
arguments = line.chomp.split(',')
To remove the trailing newlines before splitting.
Your stdin input includes a trailing newline character. Try calling line.chomp! as the first instruction in your each_line block.

Resources