How to use gsub to simplify regular expressions - ruby

I would like to escape the # with \ when they appear in \href commands.
Normally I would write a regex such as s/(\\href\{.*?)#(.*?)\}/\1\\#\2/g, but I imagine gsub would we a good choice here to first extract the \href content and then replace # with \#.
Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
There can be multiple links in one line.
Question
Can gsub simplify these sorts of problems?

Except if one or several of the urls contained inside \href{..}s has a password part enclosed between quotes like http://username:"sdkfj#lkn#"#domainname.org/path/file.ext, the only possible place for the character # in a url is at the end and delimits the fragment part: ./path/path/file.rb?val=toto#thefragmentpart.
In other words, if I am not wrong there's max one # to escape per href{...}. Then you can simply do that:
text.gsub(/\\href{[^#}]*\K#/, "\\#")
The character class [^#}] forbids the character } and ensures that you are always between curly brackets.

You could use two gsubs : one with an argument and a block (for href{...}), one with 2 arguments (to replace # with \#):
text = %q(Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.)
puts text.gsub(/href{[^}]+}/){ |href| href.gsub('#', '\#') }
#=> Here is some text with a \href{./file.pdf\#section.1.5}{link} to section 1.5.
If you want to launch it from a terminal with ruby -e for a test.txt file, you can use:
ruby -pe '$_.gsub(/href{[^}]+}/){ |href| href.gsub(%q|#|, %q|\#|) }' test.txt
# Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
# Here is some text with a \href{./file.pdf#section.1.6}{link} to section 1.6.
# Here is some text with a \href{./file.pdf#section.1.7}{link} to section 1.7.
or
ruby -e 'puts ARGF.read.gsub(/href{[^}]+}/){ |href| href.gsub(%q|#|, %q|\#|) }' test.txt
# Here is some text with a \href{./file.pdf#section.1.5}{link} to section 1.5.
# Here is some text with a \href{./file.pdf#section.1.6}{link} to section 1.6.
# Here is some text with a \href{./file.pdf#section.1.7}{link} to section 1.7.
Do not mix ruby -pe and ARGF.read, it would only read the first line of your file!

Related

How to convert Github-style Wiki page link to Markdown-style link in Bash script

first question for me on Stack Overflow.
I am trying to write a Bash script to convert the kind of Github Wiki links generated for other internal Github Wiki pages into conventional Markdown-style links.
The Github Wiki link strings look like this:
[[An example of another page]]
I want to convert it to look like this:
[An example of another page](An-example-of-another-page.htm)
Documents have an unknown number of these links and I don't know the content.
Currently I have been playing around with one-line sed solutions given to other problems, like this one:
https://askubuntu.com/questions/1283471/inserting-text-to-existing-text-within-brackets
... with absolutely no success. I'm not even sure where to start with it.
Thanks.
You can try this sed
$ sed -E 's/\[(.[^]]*)\]/\1/g;s/\[(.[^]]*)]/&(\1)/g;:jump s/(\([^ \)]*)[ ]/\1-/;tjump' input_file
[An example of another page](An-example-of-another-page)
s/\[(.[^]]*)\]/\1/g - Remove brackets []
s/\[(.[^]]*)]/&(\1)/g - Duplicate the content inside brackets [], return the match &, then manipulate the match and add parenthesis (\1)
:jump s/(\([^ \)]*)[ ]/\1-/;tjump - Create a label jump, match the empty spaces within the match if it is within parenthesis and replace with -
You can use bash's internal regular expression support to find and replace instances of wiki linked [[text]] with [text](text.htm). The pattern you want to use is \[\[([^\]]*)\]\]
\[ and \] - escapes the left and right square brackets so that they aren't interpreted as meta-characters that let you match character classes
([^\]]*) captures all text inside the double brackets until the first right square bracket
From there you can evaluate this regex and use the $BASH_REMATCH array to extract and manipulate the text. You'll need to run this multiple times in order to match all instances in the string and then replace the string inline using the / and // operators.
Here's a sample script:
#!/usr/bin/env bash
wiki_string="Now, this is [[a story]] all about how
My life [[got flipped-turned upside down]]
And I'd [[like to take a minute]]
Just [[sit]] right there
I'll [[tell you]] how I [[became the prince]] of a town called Bel-Air"
printf 'Original: %s\n' "$wiki_string"
# find each instance of [[text]] and capture the text inside
# the square brackets
# if successful, BASH_REMATCH will contain the matched text and the
# captured value inside the parentheses
while [[ "$wiki_string" =~ \[\[([^\]]*)\]\] ]]; do
# escape the [ and ] characters so we can replace [[text]]
# with our modified value
replace_text="${BASH_REMATCH[0]}"
replace_text="${replace_text/\[\[/\\[\\[}"
replace_text="${replace_text/\]\]/\\]\\]}"
# Get the matched value inside the brackets
link_text="${BASH_REMATCH[1]}"
# store another copy of the text with the spaces replaced
# with dashes and appending .htm
link_target="${link_text// /-}.htm"
# Finally, replace the matched [[text]] with [text](text.htm)
wiki_string="${wiki_string//$replace_text/[$link_text]($link_target)}"
done
printf '\nUpdated: %s\n' "$wiki_string"
Thanks to HatLess for the answer which I adapted. The snippet below converts Github-style links into Markdown-style links, without the two issues that HatLess's solution had. Specifically this doesn't break pre-existing Markdown-style links and it doesn't replace spaces with hyphens within brackets unless part of a link.
sed -E 's/\[\[(.[^]]*)]]/&(support\-\1\.htm)/g;:jump s/(]\([^ \)]*)[ ]/\1-/;tjump;s/\[\[/\[/g;s/]]\(/]\(/g' | pandoc -t html

How to add bash variable to text file after matched keywords?

I have a file(yaml) with the term "name:" and I would like to add variable($var) from bash and insert into yaml. I am able to find the key words and add variable after that:
sed -i "s/name:/& $var/" "yaml file"
However the variable keep added up in yaml file, as name: abc def ghi(I would like to have single name only)
How to fix it? also how can I add some text after $var, something like "$var-role"?
Thanks.
You need to replace the whole line, not just name:. So add .* to the regexp to match everything after it on the line.
sed -i "s/name:.*/name: $var/" "yaml file"
You can't use & in this version because that would include the rest of the line as well.
If you want to add more text, just put it after the variable.
sed -i "s/name:.*/name: ${var}-role/" "yaml file"
Put {} around the variable name to separate it from the following text (not really needed when the text begins with -, but would be needed if it began with a character that can be part of a variable name).

How to use line breaks in a slim variable?

I have this variable in slim:
- foo = 'my \n desired multiline <br/> string'
#{foo}
When I parse the output using the slimrb command line command the contents of the variable are encoded:
my \n desired multiline <br/> string
How can I have slimrb output the raw contents in order to generate multi-line strings?
Note that neither .html_safe nor .raw are available.
There are two issues here. First in Ruby strings using single quotes – ' – don’t convert \n to newlines, they remain as literal \ and n. You need to use double quotes. This applies to Slim too.
Second, Slim HTML escapes the result of interpolation by default. To avoid this use double braces around the code. Slim also HTML escapes Ruby output by default (using =). To avoid escaping in that case use double equals (==).
Combining these two, your code will look something like:
- foo = "my \n desired multiline <br/> string"
td #{{foo}}
This produces:
<td>my
desired multiline <br/> string</td>
An easier way is to use Line Indicators as verbatim texts | . Documentation here . For example;
p
| This line is on the left margin.
This line will have one space in front of it.

Remove line break if line does not start with KEYWORD

I have a flat file with lines that look like
KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....
How do I go about removing the linebreak so that
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
turns into
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
This is in a HP-UNIX environment and I can move the file to another system (windows box with powershell and ruby installed).
I don't know what tools are you using, but you can use this regex to match every \n (or maybe \r) that isn't followed by KEYWORD so you can replace it for SPACE and you would have it.
DEMO
Regex: \r(?!KEYWORD) (With global modifier)
Ruby's Array has a nice method called slice_before that it inherits from Enumerable, which comes to the rescue here:
require 'pp'
text = 'KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING
KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING
KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY
LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE
KEYWORD|.....'
pp text.split("\n").slice_before(/^KEYWORD/).map{ |a| a.join(' ') }
=> ["KEYWORD|DATA STRING HERE|32|50135|ANOTHER DATA STRING",
"KEYWORD|STRING OF DATA|1333|552555666|ANOTHER STRING",
"KEYWORD|STRING OF MORE DATA|4522452|5345245245|REALLY REALLY REALLY REALLY LONGSTRING THAT INSERTED A LINE BREAK WHEN I WAS EXTRACTING FROM SQLPLUS/ORACLE",
"KEYWORD|....."]
This code just splits your text on line breaks, then uses slice_before to break the resulting array into sub-arrays, one for each block of text starting with /^KEYWORD/. Then it walks through the resulting sub-arrays, joining them with a single space. Any line that wasn't pre-split will be left alone. Ones that were broken are rejoined.
For real use you'd probably want to replace pp with a regular puts.
As for moving the code to Windows with Ruby, why? Install Ruby on HP-Unix and run it there. It's a more natural fit.
this short awk oneliner should do the job:
awk '/^KEYWORD/{print ""}{printf $0}' file
This might work for you (GNU sed):
sed ':a;$!{N;/\n.*|/!{s/\n/ /;ba}};P;D' file
Keep two lines in the pattern space and if the second line doesn't contain a | replace the newline with a space and repeat until it does or the the end of the file is reached.
This assumes the last field is the field that overflows, otherwise use the KEYWORD such:
sed ':a;$!{N;/\nKEYWORD/!{s/\n/ /;ba}};P;D' file
Powershell way:
[System.IO.File]::ReadAllText( "c:\myfile.txt" ) -replace "`r`n(?!KEYWORD)", ' '
You can use sed or awk (preferred) for this »
sed -n 's|\r||g;$!{1{x;d};H};${H;x;s|\n\(KEYWORD\)|\r\1|g;s|\n||g;s|\r|\n|g;p}' file.txt
awk 'BEGIN{ORS="";}NR==1{print;next;}/^KEYWORD/{print"\n";print;next;}{print;}' file.txt
Note: Write each command (sed, awk) in one line

Rubular/Ruby discrepancy in captured text

I've carefully cut and pasted from this Rubular window http://rubular.com/r/YH8Qj2EY9j to my code, yet I get different results. The Rubular match capture is what I want. Yet
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
only gets me the first line, i.e.
<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
I don't think it's my test data, but that's possible. What am I missing?
(ruby 1.9 on Ubuntu 10.10(
Paste your test data into an editor that is able to display control characters and verify your line break characters. Normally it should be only \n on a Linux system as in your regex. (I had unusual linebreaks a few weeks ago and don't know why.)
The other check you can do is, change your brackets and print your capturing groups. so that you can see which part of your regex matches what.
/^<DD>(.*)\n?(.*)\n/
Another idea to get this to work is, change the .*. Don't say match any character, say match anything, but \n.
^<DD>([^\n]*\n?[^\n]*)\n
I believe you need the multiline modifier in your code:
/m Multiline mode: dot matches newlines, ^ and $ both match line starts and endings.
The following:
#!/usr/bin/env ruby
desc= '<DD>#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
<DT>la la this should not be matched oh good'
desc_pattern = /^<DD>(.*\n?.*)\n/
if desc =~ desc_pattern
puts description = $1
end
prints
#mathpunk Griefing (i.e. trolling) as Play: http://t.co/LwOH1Vb<br />
– Johnny Badhair (8spiders) http://twitter.com/8spiders/status/92876473853157377
on my system (Linux, Ruby 1.8.7).
Perhaps your line breaks are really \r\n (Windows style)? What if you try:
desc_pattern = /^<DD>(.*\r?\n?.*)\r?\n/

Resources