Matching across a line vs matching words regex - ruby

Why is it that when I match across new lines it would seem that I can't identify individual words. For example:
content = "COAL_STORIES
AUSTRALIA - blah blah blah
BOTSWANA – blah blah blah
URANIUM_STORIES
AUSTRALIA – blah
INDIA - blah
COPPER_STORIES
AUSTRALIA - blah blah blah
AUSTRALIA - blah blah blah
CHINA - blah blah blah
ALUMINIUM_STORIES"
sections = content.scan(/\w.*_.*\b/)
Give and array:
[
[0] "COAL_STORIES",
[1] "URANIUM_STORIES",
[2] "COPPER_STORIES",
[3] "ALUMINIUM_STORIES"
]
But if I try that using the 'm' flag everything gets matched:
sections = content.scan(/\w.*_.*\b/m) gives an array:
[
[0] "COAL_STORIES\nAUSTRALIA - blah blah blah\nBOTSWANA – blah blah blah \n\nURANIUM_STORIES \nAUSTRALIA – blah\nINDIA - blah\n\nCOPPER_STORIES\nAUSTRALIA - blah blah blah\nAUSTRALIA - blah blah blah\nCHINA - blah blah blah\n\nALUMINIUM_STORIES"
]
As far as I can tell I'm still looking for the same word boundaries?

To elaborate on Casimir's comment:
.* is greedy... it will match the longest possible string it can, including newlines if you let it (which you can/did do by enabling multiline matching with \m).
In your first example .* will not match newlines, so \b is forced to match a word boundary on the same line as where \w matched.
In your second example .* will match across lines, so when \w matches your first character, \b is free to match any word boundary, even many lines away, as long as there's an _ somewhere between the two. Specifically, for you, it looks like:
\w matched the first character in your input: "C" from "COAL_STORIES"
.* matched everything up to "ALUMINUM" on the last line
_ matched "_"
.* matched "STORIES"
\b matched the end of "STORIES"

Related

How do I write a regex that captures the first non-numeric part of string that also doesn't include 3 or more spaces?

I'm using Ruby 2.4. I want to extract from a string the first consecutive occurrence of non-numeric characters that do not include at least three or more spaces. For example, in this string
str = "123 aa bb cc 33 dd"
The first such occurrence is " aa bb ". I thought the below expression would help me
data.split(/[[:space:]][[:space:]][[:space:]]+/).first[/\p{L}\D+\p{L}\p{L}/i]
but if the string is "123 456 aaa", it fails to return " aaa", which I would want it to.
r = /
(?: # begin non-capture group
[ ]{,2} # match 0, 1 or 2 spaces
[^[ ]\d]+ # match 1+ characters that are neither spaces nor digits
)+ # end non-capture group and perform 1+ times
[ ]{,2} # match 0, 1 or 2 spaces
/x # free-spacing regex definition mode
str = "123 aa bb cc 33 dd"
str[r] #=> " aa bb "
Note that [ ] could be replaced by a space if free-spacing regex definition mode is not used:
r = /(?: {,2}[^ \d]+)+ {,2}/
Remove all digits + spaces from the start of a string. Then split with 3 or more whitespaces and grab the first item.
def parse_it(s)
s[/\A(?:[\d[:space:]]*\d)?(\D+)/, 1].split(/[[:space:]]{3,}/).first
end
puts parse_it("123 aa bb cc 33 dd")
# => aa bb
puts parse_it("123 456 aaa")
# => aaa
See the Ruby demo
The first regex \A(?:[\d[:space:]]*\d)?(\D+) matches:
\A - start of a string
(?:[\d[:space:]]*\d)? - an optional sequence of:
[\d[:space:]]* - 0+ digits or whitespaces
\d - a digit
(\D+) -Group 1 capturing 1 or more non-digits
The splitting regex is [[:space:]]{3,}, it matches 3 or more whitespaces.
It looks like this'd do it:
regex = /(?: {1,2}[[:alpha:]]{2,})+/
"123 aa bb cc 33 dd"[regex] # => " aa bb"
"123 456 aaa"[regex] # => " aaa"
(?: ... ) is a non-capturing group.
{1,2} means "find at least one, and at most two".
[[:alpha:]] is a POSIX definition for alphabet characters. It's more comprehensive than [a-z].
You should be able to figure out the rest, which is all documented in the Regexp documentation and String's [] documentation.
Will this work?
str.match(/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/)[0]
or apparently
str[/(?: ?)?(?:[^ 0-9]+(?: ?)?)+/]
or using Cary's nice space match,
str[/ {,2}(?:[^ 0-9]+ {,2})+/]

Special characters in Salt state

I am using Salt and I have to append some text to a file,after some research I found that you can achieve that by using the file.append module.
I am getting an error about adding something like [text] in the file:
failed: could not found expected ':';at line x
The state is:
file.append:
- text: |
blah blah blah
[SSL] <====================== Here is where it complains
blah blah
Should I try to escape the [ character using the \ before or how am I able to do that?
The problem is your indentation. A mapping value must be indented more than its key:
file.append:
- text: |
blah blah blah
[SSL]
blah blah

Multi-paragraph attribute in AsciiDoc

I have a two-paragraph text that gets repeated fairly often. How could I avoid the repetition?
For now I have:
:something-1: Blah blah blah +
blah blah blah +
blah blah blah
:something-2: Blah blah blah +
blah blah blah +
blah blah blah
And then:
--
{something-1}
{something-2}
--
Is there a way I could put both paragraphs into one attribute? It would be even better if I could put the block into the attribute too.
This doesn't work:
:something: Blah blah blah +
blah blah blah +
blah blah blah +
+
Blah blah blah +
blah blah blah +
blah blah blah
The plus on the empty line and the second paragraph are not parsed as part of the attribute definition.
Another option is putting the two paragraphs in a separate file and using the include: macro. But creating a separate file every time I face this problem would create some clutter. It also makes it harder than necessary to go from 1-paragraph definitions to 2-paragraph definitions. I'd rather have a single "glossary" section (or document) which contains all these repeated term definitions.
I don't know if multi-paragraph attributes are possible, but selective imports definitely are! I now have a glossary.asciidoc file:
tag::something[]
--
Blah blah blah
Blah blah blah
--
end::something[]
And I can import this section by saying:
include::glossary.asciidoc[tag=something]
A major advantage of this approach is that text formatting inside the snippet works.

regex too longwinded and multiline issues

im trying to change text to go from this:
\v 1 something \f + \xo footnote one \f* whatever \x + \xo footnote two \x* more text \f + \xo footnote three \f* blah blah blah \x + \xo footnote four \x*
\v 2 something \x + \xo footnote one \x*
to this:
\v 1 something \f * \xo footnote one \f* whatever \x ** \xo footnote two \x* more text \f *** \xo footnote three \f* blah blah blah \x $ \xo footnote four \x* \v 2 something \x * \xo footnote one \x*
so in each footnote, instead of a '+' each will have the next in the sequence (* ** *** $ $$ $$$ £) but the sequence has to reset when it gets to a new verse (\v) there can be up to 7 occurrences of footnotes between each verse.
i'm new to ruby so i know there's a better way to do this, what i've done is very long winded:
file = File.open('input.txt', 'r+')
contents = file.read
reassign = contents.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***\7$\9$$')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***\7$')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**\5***')
.gsub(/(\\v.*?(\\x|\\f) )\+(.*?(\\x|\\f) )\+/m, '\1*\3**')
.gsub(/(\\v.*?(\\x|\\f) )\+/m, '\1*')
new_file = File.open("output.txt", "w+")
new_file.write(reassign)
new_file.close
if i don't add the m after the regex search it skips a lot of footnotes longer than one line, but if i add it, it skips over the verses altogether and doesn't reset the sequence
Thanks
I suggest you first split('\v') the string, giving you an array of strings, map each of the strings in the resulting array to a string with the footnote symbols replaced with the appropriate strings, then join('\v') the strings back together.
Code
def map_footnote_symbols(str)
str.split('\v').map do |s|
t = %w{ * ** *** $ $$ $$$ £ }
s.gsub(/./) { |c| (c == '+') ? t.shift : c }
end.join('\v')
end
Example
str = "\v 1 something \f + \xo footnote one \f* whatever " +
"\x + \xo footnote two \x* more text \f + \xo footnote " +
"three \f* blah blah blah \x + \xo footnote four \x* \v 2 " +
"something \x + \xo footnote one \x*"
(I've broken the string into pieces so that in can be viewed without having to scroll horizontally.)
puts map_footnote_symbols str
#=> \v 1 something \f * \xo footnote one \f* whatever \x ** \xo |
# footnote two \x* more text \f *** \xo footnote three \f* |
# blah blah blah \x $ \xo footnote four \x* \v 2 something |
# \x * \xo footnote one \x*
(I've broken the output string into pieces so that in can be viewed without having to scroll horizontally. The character | indicates where I've broken each line.)
Explanation
a = str.split('\v')
#=> ["",
# " 1 something \\f + \\xo footnote one \\f* whatever \\x + \\xo |
# footnote two \\x* more text \\f + \\xo footnote three \\f* |
# blah blah blah \\x + \\xo footnote four \\x* ",
# " 2 something \\x + \\xo footnote one \\x*"]
(Again, I've broken the second string in the array into pieces so that in can be viewed without having to scroll horizontally.)
map passes each element of a into its block, assigning it to the block variable s. The first is:
s = ""
We then have:
t = %w{ * ** *** $ $$ $$$ £ }
#=> ["*", "**", "***", "$", "$$", "$$$", "£"]
b = "".gsub(/./) { |c| (c == '+') ? t.shift : c }
#=> ""
So "" is (obviously) mapped to "". The next element (string) map passes into the block is:
s = "footnote two \\x* more text \\f + \\xo footnote three \\f* " +
"blah blah blah \\x + \\xo footnote four \\x* "
The regex /./ causes gsub to pass each character of s to its block to determine the substituted value. (c == '+') is false for every character up to the first +, so these characters are all left unchanged (i.e., replaced by c). The first + is replaced by t.shift:
t = ["*", "**", "***", "$", "$$", "$$$", "£"]
t.shift #=> "*"
leaving
t #=> ["**", "***", "$", "$$", "$$$", "£"]
The characters up to the next + are left unchanged and that + is replaced by:
t.shift #=> "**"
leaving
t #=> ["***", "$", "$$", "$$$", "£"]
and so on. As a result:
c = a.map do |s|
t = %w{ * ** *** $ $$ $$$ £ }
s.gsub(/./) { |c| (c == '+') ? t.shift : c }
end
#=> ["",
# " 1 something \\f * \\xo footnote one \\f* whatever \\x ** \\xo |
# footnote two \\x* more text \\f *** \\xo footnote three \\f* |
# blah blah blah \\x $ \\xo footnote four \\x* ",
# " 2 something \\x * \\xo footnote one \\x*"]
All that remains is to reassemble the string:
c.join('\v')

Append line only when 2 search criteria are met

I want to accomplish the following with sed
1.Find first occurrence of [sometext] Exact match
2.Then start search from there for stuID = 10 Exact match
3.Then append line checkID = 4 for the first occurance of stuID under [sometext]
Note : the value of checkID will change according to [sometext] that's why i need to append line for first occurance only
My attempts
sed '/[sometext]/{ s/stuID = 10/a\checkID = 4/1 }' file.txt
sed 's/[sometext]/{ s/stuID = 10/a\checkID = 4/1 }' file.txt
sed '/[sometext]/{ s/stuID = 10/a\checkID = 4/g }' file.txt
{just to see if command works if i don't specify the number of times to add new line.
I added \ ] to escape []
Results
1.Command get executed but checkID=4 is not added anywhere in file.txt
2.Error : sed: -e expression #1, char 18: multiple g' options to s' command
Implying that syntax itself is wrong
3.Command get executed but checkID=4 is not added anywhere in file.txt
When i say executed i mean there is no error message
File.txt
[sometext]
blah blah blah
blah blah blah
stuID = 10
blah blah blah
blah blah blah
[Anothertext]
blah blah blah
blah blah blah
stuID = 5
blah blah blah
blah blah blah
I want it to be
File.txt
[sometext]
blah blah blah
blah blah blah
stuID = 10
checkID=4
blah blah blah
blah blah blah
[Anothertext]
blah blah blah
blah blah blah
stuID = 5
checkID=6
blah blah blah
blah blah blah
I am completely tired and clueless at this point .Hope someone can help me out
Regards
This might work for you (GNU sed):
sed -e '/\[sometext\]/,/stuID = 10/{/stuID = 10/{a\checkID = 4' -e ':a;n;ba}}' file
This finds the range between the 2 search strings and appends the desired text. Finally the rest of the file is passed over.

Resources