Ruby scan Regular Expression - ruby

I'm trying to split the string:
"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
into the following array:
[
["test","blah"]
["foo","bar bar bar"]
["test","abc","123","456 789"]
]
I tried the following, but it isn't quite right:
"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
.scan(/\[(.*?)\s*\|\s*(.*?)\]/)
# =>
# [
# ["test", "blah"]
# ["foo", "bar bar bar"]
# ["test", "abc |123 | 456 789"]
# ]
I need to split at every pipe instead of the first pipe. What would be the correct regular expression to achieve this?

s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
arr = s.scan(/\[(.*?)\]/).map {|m| m[0].split(/ *\| */)}

Two alternatives:
s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
s.split(/\s*\n\s*/).map{ |p| p.scan(/[^|\[\]]+/).map(&:strip) }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
irb> s.split(/\s*\n\s*/).map do |line|
line.sub(/^\s*\[\s*/,'').sub(/\s*\]\s*$/,'').split(/\s*\|\s*/)
end
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
Both of them start by splitting on newlines (throwing away surrounding whitespace).
The first one then splits each chunk by looking for anything that is not a [, |, or ] and then throws away extra whitespace (calling strip on each).
The second one then throws away leading [ and trailing ] (with whitespace) and then splits on | (with whitespace).
You cannot get the final result you want with a single scan. About the closest you can get is this:
s.scan /\[(?:([^|\]]+)\|)*([^|\]]+)\]/
#=> [["test", " blah"], ["foo ", "bar bar bar"], ["123 ", " 456 789"]]
…which drops information, or this:
s.scan /\[((?:[^|\]]+\|)*[^|\]]+)\]/
#=> [["test| blah"], ["foo |bar bar bar"], ["test| abc |123 | 456 789"]]
…which captures the contents of each "array" as a single capture, or this:
s.scan /\[(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?([^|\]]+)\]/
#=> [["test", nil, nil, " blah"], ["foo ", nil, nil, "bar bar bar"], ["test", " abc ", "123 ", " 456 789"]]
…which is hardcoded to a maximum of four items, and inserts nil entries that you would need to .compact away.
There is no way to use Ruby's scan to take a regex like /(?:(aaa)b)+/ and get multiple captures for each time the repetition is matched.

Why the hard path (single regex)? Why not a simple combo of splits? Here are the steps, to visualize the process.
str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
arr = str.split("\n").map(&:strip) # => ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
arr = arr.map{|s| s[1..-2] } # => ["test| blah", "foo |bar bar bar", "test| abc |123 | 456 789"]
arr = arr.map{|s| s.split('|').map(&:strip)} # => [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
This is likely far less efficient than scan, but at least it's simple :)

A "Scan, Split, Strip, and Delete" Train-Wreck
The whole premise seems flawed, since it assumes that you will always find alternation in your sub-arrays and that expressions won't contain character classes. Still, if that's the problem you really want to solve for, then this should do it.
First, str.scan( /\[.*?\]/ ) will net you three array elements, each containing pseudo-arrays. Then you map the sub-arrays, splitting on the alternation character. Each element of the sub-array is then stripped of whitespace, and the square brackets deleted. For example:
str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
str.scan( /\[.*?\]/ ).map { |arr| arr.split('|').map { |m| m.strip.delete '[]' }}
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]
Verbosely, Step-by-Step
Mapping nested arrays is not always intuitive, so I've unwound the train-wreck above into more procedural code for comparison. The results are identical, but the following may be easier to reason about.
string = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
array_of_strings = string.scan( /\[.*?\]/ )
#=> ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
sub_arrays = array_of_strings.map { |sub_array| sub_array.split('|') }
#=> [["[test", " blah]"],
# ["[foo ", "bar bar bar]"],
# ["[test", " abc ", "123 ", " 456 789]"]]
stripped_sub_arrays = sub_arrays.map { |sub_array| sub_array.map(&:strip) }
#=> [["[test", "blah]"],
# ["[foo", "bar bar bar]"],
# ["[test", "abc", "123", "456 789]"]]
sub_arrays_without_brackets =
stripped_sub_arrays.map { |sub_array| sub_array.map {|elem| elem.delete '[]'} }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Related

BASH - Shuffle characters in strings from several rows

I have a file (filename.txt) with the following structure:
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
I would like to shuffle the characters in the strings that do not start wit >. The output would (for example) look like the following:
>line1
DGC
FEI
HBA
>line2
JRP
OKN
QML
>line3
SZV
YXT
UW
This is what I tried to shuffle the characters for each >line[number]: ruby -lpe '$_ = $_.chars.shuffle * "" if !/^>/' filename.txt. The command works (see my post BASH - Shuffle characters in strings from file) but it shuffles line by line. I was wondering how I could modify the command to shuffle characters between all strings of each >line[number]). Using ruby is not a requirement.
First, we need to solve the problem: how to shuffle all characters in multiple lines:
echo -e 'ABC\nDEF\nGHI' |grep -o . |shuf |tr -d '\n'
GDAFHEIBC
and, we also need an array to record the length of each line in origin strings.
s=GDAFHEIBC
lens=(3 3 3)
start=0
for len in "${lens[#]}"; do
echo ${s:${start}:${len}}
((start+=len))
done
GDA
FHE
IBC
So, the origin multiple lines:
ABC
DEF
GHI
have been shuffled to:
GDA
FHE
IBC
Now, we can do our jobs:
lens=()
string=""
function shuffle_lines {
local start=0
local shuffled_string=$(grep -o . <<< ${string} |shuf |tr -d '\n')
for len in "${lens[#]}"; do
echo ${shuffled_string:${start}:${len}}
((start+=len))
done
lens=()
string=""
}
while read -r line; do
if [[ "${line}" =~ ^\> ]]; then
shuffle_lines
echo "${line}"
else
string+="${line}"
lens+=(${#line})
fi
done <filename.txt
shuffle_lines
Examples:
$ cat filename.txt
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
>line4
0123
456
78
9
$ ./solution.sh
>line1
HFG
BED
AIC
>line2
JOP
KMQ
RLN
>line3
UVW
TYZ
XS
>line4
1963
245
08
7
#!/bin/bash
# echo > output.txt # uncomment to write in a file output.txt
mix(){
{
echo "$title"
line="$( fold -w1 <<< "$line" | shuf )"
echo "${line//$'\n'}" | fold -w3
} # >> output.txt # uncomment to write in a file output.txt
unset line
}
while read -r; do
if [[ $REPLY =~ ^\> ]]; then
mix
title="$REPLY"
else
line+="$REPLY"
fi
done < filename.txt
mix # final mix after loop's exit, otherwise line3 will be not mixed
exit
edited with comment of gniourf-gniourf
First create a test file.
str =<<FINI
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
FINI
File.write('test', str)
#=> 56
Now read the file and perform the desired operations.
result = File.read('test').split(/(>line\d+)/).map do |s|
if s.match?(/\A(?:|>line\d+)\z/)
s
else
a = s.scan(/\p{Lu}/).shuffle
s.gsub(/\p{Lu}/) { a.shift }
end
end.join
# ">line1\nECF\nHIA\nGBD\n>line2\nJNP\nKLR\nOQM\n>line3\nTXY\nUZV\nSW\n"
puts result
>line1
ECF
HIA
GBD
>line2
JNP
KLR
OQM
>line3
TXY
UZV
SW
To do this from the command convert the code to a string with statements separated by a semicolon.
ruby -e "puts (File.read('test').split(/(>line\d+)/).map do |s|; if s.match?(/\A(?:|>line\d+)\z/); s; else; a = s.scan(/\p{Lu}/).shuffle; s.gsub(/\p{Lu}/) { a.shift }; end; end).join"
The steps are as follows.
a = File.read('test')
#=> ">line1\nABC\nDEF\nGHI\n>line2\nJKL\nMNO\nPQR\n>line3\nSTU\nVWX\nYZ\n"
b = a.split(/(>line\d+)/)
#=> ["", ">line1", "\nABC\nDEF\nGHI\n", ">line2", "\nJKL\nMNO\nPQR\n",
# ">line3", "\nSTU\nVWX\nYZ\n"]
Notice that the regular expression that is split's argument places >line\d+ within a capture group. Without that, ">line1", ">line2" and ">line3" would not be included in b.
c = b.map do |s|
if s.match?(/\A(?:|>line\d+)\z/)
s
else
a = s.scan(/\p{Lu}/).shuffle
s.gsub(/\p{Lu}/) { a.shift }
end
end
#=> ["", ">line1", "\nEAC\nIHB\nDGF\n", ">line2", "\nKQJ\nROL\nMPN\n",
# ">line3", "\nSUY\nXTV\nZW\n"]
c.join
#=> ">line1\nEAC\nIHB\nDGF\n>line2\nKQJ\nROL\nMPN\n>line3\nSUY\nXTV\nZW\n"
Now consider more closely the calculation of c. The first element of b is passed to the block and the block variable s is set to its value:
s = ""
We then compute
s.match?(/\A(?:|>line\d+)\z/)
#=> true
so "" is returned from the block. The regular expression can be expressed as follows.
/
\A # match the beginning of the string
(?: # begin a non-capture group
# match an empty space
| # or
>line\d+ # match '>line' followed by one or more digits
) # end non-capture group
\z # match the end of the string
/x # free-spacing regex definition mode.
Within the non-capture group an empty space was matched.
The next element of b is then passed to the block.
s = ">line1"
Again
s.match?(/\A(?:|>line\d+)\z/)
#=> true
so s is return from the block.
Now the third element of b is passed to the block. (Finally, something interesting.)
s = "\nABC\nDEF\nGHI\n"
d = s.scan(/\p{Lu}/)
#=> ["A", "B", "C", "D", "E", "F", "G", "H", "I"]
a = d.shuffle
#=> ["D", "C", "G", "H", "B", "F", "I", "E", "A"]
s.gsub(/\p{Lu}/) { a.shift }
#=> "\nDCG\nHBF\nIEA\n"
The remaining calculations are similar.

Display Unique Shell Columns

Given we have two formatted strings that are unrelated to each other.
#test.rb
string_1 = "Title\nfoo bar\nbaz\nfoo bar baz boo"
string_2 = "Unrelated Title\ndog cat farm\nspace moon"
How can I use ruby or call shell commands to have each of these string display as columns in terminal? The key is that the data of each string are not building a correlated row, ie this is not a table, rather 2 lists side by side.
Title Unrelated Title
foo bar dog cat farm
baz space moon
foo bar baz boo
You can try using paste and column command together. Note that this is a shell command so spaces between the assignment operator should be corrected.
$ string_1="Title\nfoo bar\nbaz\nfoo bar baz boo"
$ string_2="Unrelated Title\ndog cat farm\nspace moon"
$ paste -d '|' <(echo -e "$string_1") <(echo -e "$string_2") | column -s'|' -t
Title Unrelated Title
foo bar dog cat farm
baz space moon
foo bar baz boo
We paste the lines with | as delimiter and tell column command to use | as a reference to form columns.
In Ruby, you could do it this way:
#!/usr/bin/env ruby
string_1 = "Title\nfoo bar\nbaz\nfoo bar baz boo"
string_2 = "Unrelated Title\ndog cat farm\nspace moon"
a1 = string_1.split("\n")
a2 = string_2.split("\n")
a1.zip(a2).each { |pair| puts "%-20s%s" % [pair.first, pair.last] }
# or
# a1.zip(a2).each { |left, right| puts "%-20s%s" % [left, right] }
This produces:
Title Unrelated Title
foo bar dog cat farm
baz space moon
foo bar baz boo
Hi , If you Use temp files
string_1 = "Title\nfoo bar\nbaz\nfoo bar baz boo"
string_2 = "Unrelated Title\ndog cat farm\nspace moon"
echo -e $string_1 >a.txt
echo -e $string_2 >b.txt
paste a.txt b.txt
I hope it will help.

Why does gsub's '\1' capture group produce this string?

I am confused as to why I am capturing this pattern via '\1' grouping. I am capturing two digits at a time, but why does it skip here:
"123 456 789".gsub(/(\d)(\d)/, '\1')
=> "13 46 79"
I can understand that '\0' gives me the original string:
"123 456 789".gsub(/(\d)(\d)/, '\0')
=> "123 456 789"
This also confuses me, but I can understand '\2' once I learn what '\1' is doing:
"123 456 789".gsub(/(\d)(\d)/, '\2')
=> "23 56 89"
The regex matches "12", "45", "78", and gsub replaces them with "1", "4", "7", respectively, giving "13 46 79".
To obtain 12 45 78, you need to use
(\d)\d\b
And replace with \1.
See demo
Here, we match a digit and capture it ((\d)), then we match another digit (with \d) that is right before a word boundary \b.
IDEONE demo:
puts "123 456 789".gsub(/(\d)\d\b/, '\1')

Printing string fields in Ruby

I have this string in the variable var:
cheese dogs cats alligators
I know I could get the second field in this string " dogs" using awk if I was on a linux command line.
> cat var | awk '{print $2}'
dogs
But how would I do this in Ruby?
Ruby has a String#split method that splits on whitespace by default, returning an array whose second element can then be accessed:
irb(main):001:0> 'cheese dogs cats alligators'.split
=> ["cheese", "dogs", "cats", "alligators"]
irb(main):002:0> 'cheese dogs cats alligators'.split[1]
=> "dogs"
echo cheese dogs cats alligators | ruby -ne 'puts $_.split[1] '

Succinct way in Ruby to manipulate this string

Sometimes I like learning how to do things the "Ruby" way. I was wondering - what is the most succinct, yet readable way to take a string such as:
foo-bar
and manipulate it to read:
Foo Bar
"foo-bar".split("-").map(&:capitalize).join(" ")
"foo-bar".gsub(/\b(\w)/){|m| m.capitalize}.sub '-', ' '
>> p "foo-bar".scan(/\w+/).map(&:capitalize).join(" ")
"Foo Bar"
=> "Foo Bar"
>> p "foo---bar".scan(/\w+/).map(&:capitalize).join(" ")
"Foo Bar"
=> "Foo Bar"
>> p "foo 123 bar".scan(/\w+/).map(&:capitalize).join(" ")
"Foo 123 Bar"
=> "Foo 123 Bar"
string = "foo-bar"
"foo-bar".split("-").map(&:capitalize).join(" ") # edited to because former answer was not optimal

Resources