How do I join two lines of a file by matching pattern, in Ruby or Bash? - ruby

I'm using a Ruby script to do a lot of manipulation and cleaning to get this, and a bunch of other files, ready for import.
I have a really large file with some data that I'm trying to import into a database. There are some data issues with newline characters being in the data where they should not be, messing with the import.
I was able to solve this problem with sed using this:
sed -i '.original' -e ':a' -e 'N' -e '$!ba' -e 's/Oversight Bd\n/Oversight Bd/g' -e 's/Sciences\n/Sciences/g' combined_old_individual.txt"
However, I can't call that command from inside a Ruby script, because Ruby messes up interpreting the newline characters and won't run that command. sed needs the non-escaped newline character but when calling a system command from Ruby it needs a string, where the newline character needs to be escaped.
I also tried doing this using Ruby's file method, but it's not working either:
File.open("combined_old_individual.txt", "r") do |f|
File.open("combined_old_individual_new.txt","w") do |new_file|
to_combine = nil
f.each_line do |line|
if(/Oversight Bd$/ =~ line || /Sciences$/ =~ line)
to_combine = line
else
if to_combine.nil?
new_file.puts line
else
combined_line = to_combine + line
new_file.puts combined_line
to_combine = nil
end
end
end
end
end
Any ideas how I can join lines where the first line ends with "Bd" or "Sciences", from within a Ruby script, would be very helpful.
Here's an example of what might go in a testfile.txt:
random line
Oversight Bd
should be on the same line as the above, but isn't
last line
and the result should be
random line
Oversight Bdshould be on the same line as the above, but isn't
last line

With ruby (My first attempt at a ruby answer):
File.open("combined_old_individual.txt", "r") do |f|
File.open("combined_old_individual_new.txt","w") do |new_file|
f.each_line do |line|
if(/(Oversight Bd|Sciences)$/ =~ line)
new_file.print line.strip
else
new_file.puts line
end
end
end
end

You have to realize that sed normally works line by line, so you cannot match for \n in your initial pattern. You can however match for the pattern on the first line and then pull in the next line with the N command and then run the substitute command on the buffer to remove the newline like so:
sed -i -e '/Oversight Bd/ {;N;s/\n//;}' /your/file
Run from Ruby (without -i so that the output goes to stdout):
> cat test_text
aaa
bbb
ccc
aaa
bbb
ccc
> cat test.rb
cmd="sed -e '/aaa/ {;N;s/\\n//;}' test_text"
system(cmd)
> ruby test.rb
aaabbb
ccc
aaabbb
ccc

Since you are asking in bash, here is a pure-bash solution:
$ r="(Oversight Bd|Sciences)$"
$ while read -r; do printf "%s" "$REPLY"; [[ $REPLY =~ $r ]] || echo; done < combined_old_individual.txt
random line
Oversight Bdshould be on the same line as the above, but isn't
last line
$

Related

Bash regex to match multiple blocks of indented content and print all of them

I'm trying to do some regex matching in bash.
I'd like to match multiple block of indented (space or tab) content, with the block itself starting with a keyword.
Some other content could be present in the file.
Using this sample content :
keyword aaa match1
Some other content
keyword ccc match2
indentend content
matching
Some other content
with indendation
keyword ddd match2
indented content still matching
I managed to use this : (^keyword.*(?:\n^\h+.*)*), which seems to be sort of okay, everything is matching as expected. :
https://regex101.com/r/kvMlKK/1
Expected output would be to print every matches :
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
Unfortunatly I did not find a way to print all matches in bash. I can use grep/sed/awk/perl without any problem (edit: i meant I have access to all these command in the environnement i am working with).
Edit:
grep -E --include \*.md '(^keyword.*(?:\n^\h+.*)*)' $(dirname "$0")/../_inbox/draft.md
Using grep it does not return the full match, only first line because of the lack of multi-line matching support I guess.
I am not familiar with awk/sed, I did not get any meaningful results (even if it seems to be better to use them for multi-line matching).
Edit: if that could work on multiple files that would be awesome
Thanks for your help!
You can do it in pure bash, by looping... Because bash regex doesn't support multi-line matching.
#!/bin/bash
# Flag to track whether inside indented block
indented=0
# Read input line by line
while IFS= read -r line; do
# Check if line starts with keyword
reg="^[ \t]*keyword"
if [[ $line =~ $reg ]]; then
# Print line
printf "%s\n" "$line"
# Set flag to indicate inside indented block
indented=1
else
# Check if line starts with whitespace and inside indented block
reg="^[ \t]+.*"
if [[ $line =~ $reg && $indented -eq 1 ]]; then
# Print line
printf "%s\n" "$line"
else
# Reset flag to indicate outside indented block
indented=0
fi
fi
done < "input"
You can do it in awk too:
awk '/^[ \t]*keyword/{print;while(getline line) if(line~/^[ \t]+.*/) print line;else break}' input
Or use sed
sed -n '/^[ \t]*keyword/{:start;p;n;/^[ \t]/{p;n;b start;}}' input
Using awk:
$ awk '!/^[\t ]/{p=0} /^keyword/{p=1} p' file
keyword aaa match1
keyword ccc match2
indentend content
matching
keyword ddd match2
indented content still matching
$

How to read commented line in a file and copy the same ..as it is to other file in shell script

I have file (Name test.func) with a comments as below
#--------------------
# DOG $ CAT NAMES
#--------------------
Brownie
Blacky
Vicky
Pammy
#--------------
# MOBILE & LAPTOP NAMES
#--------------
Lenovo
Oppo
Realme
The code i have written is as below
TestFile=$(cat /usr/test.func)
for line in $TestFile
echo "line is $line"
if [[ "$line" == *"#"* ]]; then
echo "$line is commented"
echo "$line" >>test_copy.func
echo " "
fi
if ...
#Some other logic here
fi
done
Output is giving as below (in test_copy.func)
line is #----------
#-------- is commented
line is #
# is commented
line is DOG
line is &
line is CAT
line is NAMES
*Some logic is performed*
line is #----------
#-------- is commented
line is #
# is commented
line is MOBILE
line is &
line is LAPTOP
line is NAMES
*Some logic is performed*
Expected output in test_copy.func should be as below
#--------------------
# DOG $ CAT NAMES
#--------------------
*Output as per the logic*
#--------------
# MOBILE & LAPTOP NAMES
#--------------
*Output as per the logic*
Commented lines are splited in the actual output.
But Expected result should be as in the source file
Can anyone help me to resolve this issue
code
The code:
TestFile=$(cat /usr/test.func)
for line in $TestFile
does not loop over the lines of the file, but over the "words" (contiguous strings of non-whitespace characters). The variable TestFile contains the contents of the file, but the for loop is subject to field splitting. In other words, if the file contains "foo bar baz", the loop is equivalent to for line in foo bar baz; do .... This is a very fragile construction, as it is also subject to glob expansion, etc. For example, if the file contains wildcards (eg foo * bar), those wildcards will be expanded (and foo * bar expands to a string that contains all the names in the current directory).
The standard way to iterate over the lines of a file is
while read line; do ... done < /usr/test.func
But this is terribly slow and should generally be avoided. Tools like sed and awk are far more appropriate. It's normally a bad idea to read through a file on multiple passes, but while read is so slow that you could read the file 50 times with other tools before you would likely begin to notice. You probably don't want to copy lines that merely contain a # (as the *"#"* expression will do, but only want to copy lines that begin with #, but that's a different question). I would recommend either:
sed -n -e '/^\s*#/p' /usr/test.func > test_copy.func
while read -r line; do some_other_logic "$line"; done < /usr/test.func
or:
awk '/^\s*#/{print > "test_copy.func"}
{ some other logic here }' /usr/test.func

Use `sed` to replace text in code block with output of command at the top of the code block

I have a markdown file that has snippets of code resembling the following example:
```
$ cat docs/code_sample.sh
#!/usr/bin/env bash
echo "Hello, world"
```
This means there there's a file at the location docs/code_sample.sh, whose contents is:
#!/usr/bin/env bash
echo "Hello, world"
I'd like to parse the markdown file with sed (awk or perl works too) and replace the bottom section of the code snippet with whatever the above bash command evaluates to, for example whatever cat docs/code_sample.sh evaluates to.
Perl to the rescue!
perl -0777 -pe 's/(?<=```\n)^(\$ (.*)\n\n)(?^s:.*?)(?=```)/"$1".qx($2)/meg' < input > output
-0777 slurps the whole file into memory
-p prints the input after processing
s/PATTERN/REPLACEMENT/ works similarly to a substitution in sed
/g replaces globally, i.e. as many times as it can
/m makes ^ match start of each line instead of start of the whole input string
/e evaluates the replacement as code
(?<=```\n) means "preceded by three backquotes and a newline"
(?^s:.*?) changes the behaviour of . to match newlines as well, so it matches (frugally because of the *?) the rest of the preformatted block
(?=```) means "followed by three backquotes`
qx runs the parameter in a shell and returns its output
A sed-only solution is easier if you have the GNU version with an e command.
That said, here's a quick, simplistic, and kinda clumsy version I knocked out that doesn't bother to check the values of previous or following lines - it just assumes your format is good, and bulls through without any looping or anything else. Still, for my example code, it worked.
I started by making an a, a b, and an x that is the markup file.
$: cat a
#! /bin/bash
echo "Hello, World!"
$: cat b
#! /bin/bash
echo "SCREW YOU!!!!"
$: cat x
```
$ cat a
foo
bar
" b a z ! "
```
```
$ cat b
foo
bar
" b a z ! "
```
Then I wrote s which is the sed script.
$: cat s
#! /bin/env bash
sed -En '
/^```$/,/^```$/ {
# for the lines starting with the $ prompt
/^[$] / {
# save the command to the hold space
x
# write the ``` header to the pattern space
s/.*/```/
# print the fabricated header
p
# swap the command back in
x
# the next line should be blank - add it to the current pattern space
N
# first print the line of code as-is with the (assumed) following blank line
p
# scrub the $ (prompt) off the command
s/^[$] //
# execute the command - store the output into the pattern space
e
# print the output
p
# put the markdown footer back
s/.*/```/
# and print that
p
}
# for the (to be discarded) existing lines of "content"
/^[^`$]/d
}
' $*
It does the job and might get you started.
$: s x
```
$ cat a
#! /bin/bash
echo "Hello, World!"
```
```
$ cat b
#! /bin/bash
echo "SCREW YOU!!!!"
```
Lots of caveats - better to actually check that the $ follows a line of backticks and is followed by a blank line, maybe make sure nothing bogus could be in the file to get executed... but this does what you asked, with (GNU) sed.
Good luck.
A rare case when use of getline would be appropriate:
$ cat tst.awk
state == "importing" {
while ( (getline line < $NF) > 0 ) {
print line
}
close($NF)
state = "imported"
}
$0 == "```" { state = (state ? "" : "importing") }
state != "imported" { print }
$ awk -f tst.awk file
See http://awk.freeshell.org/AllAboutGetline for getline uses and caveats.

how ruby if column less than 4 print column 3?

Im triying to use this code but not work
ruby -a -F';' -ne if $F[2]<4 'puts $F[3]' ppp.txt
this is my file
mmm;2;nsfnjd
sadjjasjnsd;6;gdhjsd
gsduhdssdj;3;gsdhjhjsd
what is doing worng Please help me
First of all, instead of treating Ruby like some kind of fancy Perl and writing scripts like that, let's expand it into the Ruby code equivalent for clarity:
$; = ';'
while gets
$F = $_.split
if $F[2]<4
puts $F[3]
end
end
Your original code doesn't work, it can't possibly work because it's not valid Ruby code, and further, you're not properly quoting it to pass through the -e evaluation term. Trying to run it I get:
-bash: 4: No such file or directory
You're also presuming the array is 1-indexed, but it's not. It's 0-indexed. Additionally Ruby treats integer values as completely different from strings, never equivalent, not auto-converted. As such you need to call .to_i to convert.
Here's a re-written program that does the job:
File.open(ARGV[0]) do |fi|
fi.readlines.each do |line|
parts = line.chomp.split(';')
if parts[1].to_i < 4
puts parts[2]
end
end
end
I solved with this
ruby -a -F';' -ne ' if $F[1] < "4" ;puts $F[2] end ' ppp.txt

Grep search strings with line breaks

How to use grep to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?
Input files:
File a.txt:
blah blah ... export to
excel ...
blah blah..
File b.txt:
blah blah ... export to excel ...
blah blah..
Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?
If the former, you can use tr to convert newlines to spaces:
tr '\n' ' ' | grep 'export to excel'
If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.
I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.
I like the solution #Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.
If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.
Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.
This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.
#!/usr/bin/python
import re
import sys
s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)
def print_ete(fname):
try:
f = open(fname, "rt")
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
prev_line = ""
i_last = -10
for i, line in enumerate(f):
# is ete within current line?
if pat.search(line):
print "%s:%d: %s" % (fname, i+1, line.strip())
i_last = i
else:
# construct extended line that included previous
# note newline is stripped
s = prev_line.strip("\n") + " " + line
# is ete within extended line?
if pat.search(s):
# matched ete in extended so want both lines printed
# did we print prev line?
if not i_last == (i - 1):
# no so print it now
print "%s:%d: %s" % (fname, i, prev_line.strip())
# print cur line with special marker
print "--> %s:%d: %s" % (fname, i+1, line.strip())
i_last = i
# make sure we don't match ete twice
prev_line = re.sub(pat, "", line)
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
EDIT: added comments.
I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.
It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:
#!/usr/bin/python
import re
import sys
# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)
def print_ete(fname):
try:
text = open(fname, "rt").read()
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
for s_match in re.findall(pat, text):
print s_match
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
grep -A1 "export to" filename | grep -B1 "excel"
I have tested this a little and it seems to work:
sed -n '$b; /export to excel/{p; b}; N; /export to\nexcel/{p; b}; D' filename
You can allow for some extra white space at the end and beginning of the lines like this:
sed -n '$b; /export to excel/{p; b}; N; /export to\s*\n\s*excel/{p; b}; D' filename
use gawk. set record separator as excel, then check for "export to".
gawk -vRS="excel" '/export.*to/{print "found export to excel at record: "NR}' file
or
gawk '/export.*to.*excel/{print}
/export to/&&!/excel/{
s=$0
getline line
if (line~/excel/){
printf "%s\n%s\n",s,line
}
}' file

Resources