Wrapping delimited lines, retaining first column, with minimum final length - ruby

Looking to split up lines of content, retaining a headword.
I do a ton of text processing, and I like to use unix one-liners because they are easy for me to organize over time (vs. tons of scripts), I can easily chain them together, and I like (re)learning how to use classic unix functions. Often I will use a short awk, perl, or ruby one-liner, depending on which is the most elegant.
Here I have lines with X number of comma-delimited items. I want to divide these up, retaining the headword.
INPUT:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
OUTPUT:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
Algorithm details:
input lines consist of a headword, then equals-sign, then a comma delimited list of at least 1 item.
In this example, most words are singles, but words could contain spaces (e.g. "horseshoe crab" at the end)
Split is at 9 items, UNLESS there are <3, in which case the final split could yield 12 on a line
There are multiple lines. e.g. the next line could be planets.
I had an idea to escape spaces, then use unix fold, and then awk to pull down the first column. This works exactly like the above:
echo "animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab" \
| \tr ' ,' '_ ' \
| fold -s \
| perl -pe 's/=/\t/; s/^_/\t_/g;' \
| awk 'BEGIN{FS=OFS="\t"} $1==""{$1=p} {p=$1} 1' \
| tr '\t _' '=, '
But it only considers character length (not item count), and fails to consider my special case that I don't want <3 items hanging on the final line.
I think this is an elegant little puzzle, got ideas?

With Perl, one way
perl -wnE'
($head, #items) = split /\s*[,=]\s*/;
while (#items) {
#elems = splice #items, 0, 9;
if (#elems < 3) { $lines[-1] .= ", " . join ", ", #elems }
else { push #lines, join ", ", #elems }
}
say "$head = $_" for #lines; #lines = ()
' file
or
perl -wnE'
($head, #items) = split /\s*[,=]\s*/;
push #lines, join ", ", splice #items, 0, 9 while #items;
$lines[-2] .= ", " . pop #lines if 2 > $lines[-1] =~ tr/,//;
say "$head = $_" for #lines; #lines = ()
' file
Shown over multiple lines for readability, and can be copy-pasted into a bash terminal as such, but they can also be entered on one line. Tested with an added line of 11 (9+2) items.
Notes
split-ing by either , or = extracts the head-word first, and then the items on a line
splice removes and returns (the first 9) elements, which joined by , generate a line to print, until all elements are processed. The last group is added to the previous line-to-print instead if it has fewer than 3 elements
In the second version all elements are processed and then the last line-to-print checked for whether it need be added to the previous one instead, by counting commas in it

You may consider this awk:
awk 'BEGIN {FS=OFS=" = "} {
s = $2
while (match(s, /([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/)) {
v = substr(s, RSTART, RLENGTH)
sub(/, $/, "", v)
print $1, v
s = substr(s, RLENGTH+1)
}
}' file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
Pay special attention to regex used here /([^,]+, ){1,9}(([^,]+, ){2}[^,]+$)?/
That matches 1 to 9 words separated with , delimiter. This regex also has an optional part that matches upto 3 words before end of line.

With your shown samples only, please try following awk program. Written and tested in GNU awk should work in any awk.
Where I have created an awk variable named numberOfFields which contains number of fields you want to print(as segregated with new line as per shown samples).
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=$0
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
OR Above code is having printf statement in 2 lines(for readability purposes) if you want to have that into a single line itself then try following:
awk -v numberOfFields="9" '
BEGIN{
FS=", ";OFS=", "
}
{
line=$0
sub(/ = .*/,"",line)
sub(/^[^ ]* =[^ ]* /,"")
for(i=1;i<=NF;i++){
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{
print ""
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v numberOfFields="9" ' ##Starting awk program from here, creating variable named numberOfFields and setting its value to 9 here.
BEGIN{ ##Starting BEGIN section of awk here.
FS=", ";OFS=", " ##Setting FS and OFS to comma space here.
}
{
line=$0 ##Setting value of $0 to line here.
sub(/ = .*/,"",line) ##Substituting space = space everything till last of value in line with NULL.
sub(/^[^ ]* =[^ ]* /,"") ##Substituting from starting till first occurrence of space followed by = followed by again first occurrence of space with NULL in current line.
for(i=1;i<=NF;i++){ ##Running for loop here for all fields.
printf("%s",(i%numberOfFields==0?OFS $i ORS line" = ":\ ##Using printf and its conditions are explained below of code.
(i==1?line " = " $i:(i%numberOfFields>1?OFS $i:$i))))
}
}
END{ ##Starting END block of this program from here.
print "" ##Printing newline here.
}
' Input_file ##Mentioning Input_file name here.
Explanation of printf condition above:
(
i%numberOfFields==0 ##checking if modules value of i%numberOfFields is 0 here, if this is TRUE:
?OFS $i ORS line" = " ##Then printing OFS $i ORS line" = "(comma space field value new line line variable and space = space)
:(i==1 ##If very first condition is FALSE then checking again if i==1
?line " = " $i ##Then print line variable followed by space = space followed by $i
:(i%numberOfFields>1?OFS $i:$i) ##Else if if modules value of i%numberOfFields is greater than 1 then print OFS $i else print $i.
)
)

One awk idea:
awk -F'[=,]' -v min=3 -v max=9 '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline=$1 "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}
' raw.dat
Sample data:
$ cat raw.dat
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto, vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
With -v min=3 -v max=9 we get:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13
Addressing OP's comment about using one-liners ...
While this awk script can certainly be jammed into a one-liner I'm guessing OP will a) find it hard to edit/maintain and b) too easy to screw up if having to (re)type over and over again.
One (obvious?) idea is to wrap the awk code in a function, eg:
splitme() {
awk -F'[=,]' -v min=$1 -v max=$2 '
{ for (i=2; i<=NF; i++) {
if ( (i-1) % max == 1 && (NF-i+1 > min) ) {
if ( i > max ) print newline
newline=$1 "="
pfx=""
}
newline=newline pfx $i
pfx=","
}
print newline
}' "${3:--}"
}
NOTES:
parameterized the min and max values so as to pull from the command line
parameterized the file reference to pull from either the command line ($3) or stdin (-)
OP can add more logic to verify/validate input parameters as needed
Whether calling as a standalone against a file:
$ splitme 3 9 raw.dat
Or calling in a pipeline:
$ cat raw.dat | splitme 3 9
Both generate:
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
numbers2 = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers2 = 10, 11, 12, 13

awk -F"[=,]" -v max="9" '{
for(i=2; i<=NF; i+=max){
row = ""
for(j=i; j<=i+max-1; j++){
row=row $(j) ","
}
gsub(/,+$/, "", row)
printf "%s=%s \n", $1, row
}
}' input_file
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto
planets = vulcan, arrakis, hoth, naboo
numbers = 1, 2, 3, 4, 5, 6, 7, 8, 9
numbers = 10, 11, 12, 13, 14, 15, 16
cars = mercedes benz, bmw, audi, vw, porsche, seat, skoda, opel, renault
cars = mazda, toyota, honda

Here are two Ruby solutions to process one line. The variable str holds one line (the line beginning 'animals = ...' in the example).
#1 Use a regular expression
RGX = \A\w+| *= *|(?:[^,]+, *){0,10}[^,]+\z|(?:[^,]+, *){9}
def break_line(str)
headword, _, *lines = str.scan(RGX)
lines.each { |line| puts "#{headword} = #{line.sub(/, *\z/, '')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
The regular expression can be written in free-spacing mode to make it self-documenting.
RGX =
/
\A # match beginning of string
\w+ # match one or more word chars (e.g., "animals")
| # or
[ ]*=[ ]* # "=" preceded and followed by zero or more spaces
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){0,10} # end non-capture group and execute 0-10 times
[^,]+ # match one or more chars other than a comma
\z # match end of string
| # or
(?: # begin a non-capture group
[^,]+ # match one or more chars other than a comma
,[ ]* # match a comma and zero or more spaces
){9} # end non-capture group and execute 1-7 times
/x # invoke free-spacing regex definition mode
Demo
When executed for the example str we would find the following.
headword
#=> "animals"
_
#=> "="
lines
#=> ["lizard, bird, bee, snake, whale, eagle, beetle, ",
"mule, hare, goose, horse, mouse, pig, dog, ",
"frog, bug, fish, duck, camel, squirrel, owl, ",
"chicken, pigeon, lion, sheep, bear, spider, deer, ",
"tiger, lobster, dinosaur, cat, goat, rat, cricket, ",
"rabbit, elephant, crow, fox, donkey, monkey, butterfly, ",
"crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab"]
Ruby has a convention of using the variable _ in situations when its value is not subsequently used in calculations. This is mainly to so-inform the reader.
#2 Extract and group words
def break_line(str)
headword, *words = str.split(/ *[,=] */)
groups = words.each_slice(9).to_a
if groups[-1].size < 3
groups[-2] += groups[-1]
groups.pop
end
groups.each { |group| puts "#{headword} = #{group.join(', ')}" }
end
brake_line(str)
animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
animals = shark, salmon, shrimp, mosquito, horseshoe crab
By way of a partial explanation, we would obtain the following for the example:
headword
#=> "animals"
words
#=> ["lizard", "bird",,..."horseshoe crab"]
groups
#=> [["lizard", "bird", "bee", "snake", "whale", "eagle",
"beetle", "mule", "hare"],
["goose", "horse", "mouse", "pig", "dog", "frog",
"bug", "fish", "duck"],
["camel", "squirrel", "owl", "chicken", "pigeon", "lion",
"sheep", "bear", "spider"],
["deer", "tiger", "lobster", "dinosaur", "cat", "goat",
"rat", "cricket", "rabbit"],
["elephant", "crow", "fox", "donkey", "monkey", "butterfly",
"crab", "leopard", "moth"],
["shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]]
As the element of groups contains more than two elements (it contains five), groups is not subsequently modified. Had the last line been permitted to have at most 14 (rather than 11) elements it would have been changed to
["elephant", "crow", "fox", "donkey", "monkey", "butterfly", "crab",
"leopard", "moth", "shark", "salmon", "shrimp", "mosquito", "horseshoe crab"]

Took a while to modify my solution to make it work across both gawk and mawk by performing the equivalent of $1 = $1 towards the end of the regex chain;
$(NF!=NF=NF) expands to NF != (NF=NF) internally, which is always false, so the whole thing just means $0, but embedding $1=$1 within it :
input ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare, goose, horse, mouse, pig, dog, frog, bug, fish, duck, camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider, deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit, elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth, shark, salmon, shrimp, mosquito, horseshoe crab
2 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
command ::
[mg]awk '
BEGIN {
FS = (OFS = " = ") "*"
_=__ = (___="[^,]+")"[,]"
gsub(".",_,__)
__ = (__)_ "(("_")?("_")?"___"$)?"
_ = ORS } gsub(__,"&"_ $1 OFS)+gsub("[,]"_,_)+sub((_)"+([^,]*)$","", $(NF!=NF=NF))'
output (mawk 1.3.4) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX
output (gawk 5.1.1) ::
1 animals = lizard, bird, bee, snake, whale, eagle, beetle, mule, hare
2 animals = goose, horse, mouse, pig, dog, frog, bug, fish, duck
3 animals = camel, squirrel, owl, chicken, pigeon, lion, sheep, bear, spider
4 animals = deer, tiger, lobster, dinosaur, cat, goat, rat, cricket, rabbit
5 animals = elephant, crow, fox, donkey, monkey, butterfly, crab, leopard, moth
6 animals = shark, salmon, shrimp, mosquito, horseshoe crab
7 planets = mercury, venus, earth, mars, jupiter, saturn, uranus, neptune, pluto-cuz-it-shoudlve-been, planetX

Related

Ignore Lorem Ipsum text in a file Ruby

I have a .txt file that has last name, first name on one line and on every other line I have Lorem Ipsum text. I need to detect the Lorem Ipsum in every other line and skip it.
example txt.file
Spade, Kate
Voluptatem ipsam et at.
Vuitton, Louis
Facere et necessitatibus animi.
Bucks, Star
Eveniet temporibus ducimus amet eaque.
Cage, Nicholas
Unde voluptas sit fugit.
Brown, James
Maiores ab officia sed.
expected output:
#Spade, Kate
#Vuitton, Louis
#Bucks, Star
#Cage, Nicholas
#Brown, James
Reading 2 lines and ignoring the second:
File.open("test.txt", "r") do |f|
f.each_slice(2) do |odd, _even|
puts odd
end
end
If you just want to skip every second line you can do something like this:
File.open("text.txt", "r") do |f|
f.each_line.with_index do |line, i|
next unless i.even?
puts line
end
end
#Spade, Kate
#Vuitton, Louis
#Bucks, Star
#Cage, Nicholas
#Brown, James
Now I'm not really good with regexp, but you could also do something like this to process only the lines that are two words, both starting with a capital letter separated by a comma and space (basically first name and last name):
File.open("text.txt", "r") do |f|
f.each_line do |line|
next unless line =~ /[A-Z][a-z]+, [A-Z][a-z]+/
puts line
end
end
#Spade, Kate
#Vuitton, Louis
#Bucks, Star
#Cage, Nicholas
#Brown, James
You could also load the full Lorem Ipsum text from a file like this:
lorem = File.open("lorem.txt", "r").map(&:chomp).join(" ")
And then check each line if it's contained in the Lorem Ipsum text:
File.open("text.txt", "r") do |f|
f.each_line do |line|
next if lorem.include?(line[0...-1]) #removing the last character because you seem to have a dot at the end even though in the lorem text there's no dot on these positions.
puts line
end
end
#Spade, Kate
#Vuitton, Louis
#Bucks, Star
#Cage, Nicholas
#Brown, James
Now depending on what you want to do with the data you can replace the puts line line with something else.
Your description is unclear. If you just want to skip every other line, you can do something like this:
File.foreach("test.txt").with_index(1) do |l, i|
next if i.even?
puts l
end
Let's first create a file.
FName = 'temp.txt'
IO.write(FName,
<<~END
Spade, Kate
Voluptatem ipsam et at.
Vuitton, Louis
Facere et necessitatibus animi.
Bucks, Star
Eveniet temporibus ducimus amet eaque.
Cage, Nicholas
Unde voluptas sit fugit.
Brown, James
Maiores ab officia sed.
END
)
#=> 211
Here's one way to return every other line.
IO.foreach(FName).each_slice(2).map(&:first)
#=> ["Spade, Kate\n", "Vuitton, Louis\n", "Bucks, Star\n",
# "Cage, Nicholas\n", "Brown, James\n"]
See IO::write, IO::foreach, Enumerable#each_slice and Array#map.
Note that foreach, each_slice and map all return enumerators when they are not given block. We therefore obtain the following:
enum0 = IO.foreach(FName)
#=> #<Enumerator: IO:foreach("temp.txt")>
enum1 = enum0.each_slice(2)
#=> #<Enumerator: #<Enumerator: IO:foreach("temp.txt")>:each_slice(2)>
enum2 = enum1.map
#=> #<Enumerator: #<Enumerator: #<Enumerator: IO:foreach("temp.txt")>
# :each_slice(2)>:map>
enum2.each(&:first)
#=> ["Spade, Kate\n", "Vuitton, Louis\n", "Bucks, Star\n",
# "Cage, Nicholas\n", "Brown, James\n"]
Examine the return values for the calculation of enum1 and enum2. It may be helpful to think of these as These could be thought of as compound enumerators.
Two other ways:
enum = [true, false].cycle
#=> #<Enumerator: [true, false]:cycle>
IO.foreach(FName).select { enum.next }
#=> <as above>
keep = false
IO.foreach(FName).select { keep = !keep }
#=> <as above>

How to extract values from string with its formatted mask in Ruby

We can do it in Ruby: "I have %{amount} %{food}" % {amount: 5, food: 'apples'} to get "I have 5 apples". Is there common way for the inverse transformation: using "I have 5 apples" and "I have %{amount} %{food}" to get {amount: 5, food: 'apples'}?
def doit(s1, s2)
a1 = s1.split
a2 = s2.split
a2.each_index.with_object({}) do |i,h|
word = a2[i][/(?<=%\{).+(?=\})/]
h[word.to_sym] = a1[i] unless word.nil?
end.transform_values { |s| s.match?(/\A\-?\d+\z/) ? s.to_i : s }
end
s1 = "I have 5 apples"
s2 = "I have %{amount} %{food}"
doit(s1, s2)
#=> {:amount=>5, :food=>"apples"}
s1 = "223 parcels were delivered last month"
s2 = "%{number} parcels were %{action} last %{period}"
doit(s1, s2)
#=> {:number=>223, :action=>"delivered", :period=>"month"}
The regular expression reads, "match one or more characters (.+), immediately preceded by "%{" ((?<=%\{) being a positive lookbehind) and immediately followed by "}" ((?=\}) being a positive lookahead).
If the substrings are separated with spaces, you could find the corresponding regex with named captures:
text = "I have 5 apples"
# "I have %{amount} %{food}"
format = /\AI have (?<amount>\S+) (?<food>\S+)\z/
p text.match(format).named_captures
# {"amount"=>"5", "food"=>"apples"}
You didn't show any code, so I'll leave it as an exercise to transform the "I have %{amount} %{food}" string into the /\AI have (?<amount>\S+) (?<food>\S+)\z/ regex.

How to split string in ruby [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
I have a string:
"1 chocolate bar at 25"
and I want to split this string into:
[1, "chocolate bar", 25]
I don't know how to write a regex for this split. And I wanted to know whether there are any other functions to accomplish it.
You could use scan with a regex:
"1 chocolate bar at 25".scan(/^(\d+) ([\w ]+) at (\d+)$/).first
The above method doesn't work if item_name has special characters.
If you want a more robust solution, you can use split:
number1, *words, at, number2 = "1 chocolate bar at 25".split
p [number1, words.join(' '), number2]
# ["1", "chocolate bar", "25"]
number1 is the first part, number2 is the last one, at the second to last, and *words is an array with everything in-between. number2 is guaranteed to be the last word.
This method has the advantage of working even if there are numbers in the middle, " at " somewhere in the string or if prices are given as floats.
It is not necessary to use a regular expression.
str = "1 chocolate bar, 3 donuts and a 7up at 25"
i1 = str.index(' ')
#=> 1
i2 = str.rindex(' at ')
#=> 35
[str[0,i1].to_i, str[i1+1..i2-1], str[i2+3..-1].to_i]
#=> [1, "chocolate bar, 3 donuts and a 7up", 25]
I would do:
> s="1 chocolate bar at 25"
> s.scan(/[\d ]+|[[:alpha:] ]+/)
=> ["1 ", "chocolate bar at ", "25"]
Then to get the integers and the stripped string:
> s.scan(/[\d ]+|[[:alpha:] ]+/).map {|s| Integer(s) rescue s.strip}
=> [1, "chocolate bar at", 25]
And to remove the " at":
> s.scan(/[\d ]+|[[:alpha:] ]+/).map {|s| Integer(s) rescue s[/.*(?=\s+at\s*)/]}
=> [1, "chocolate bar", 25]
You may try returning captures property of match method on regex (\d+) ([\w ]+) at (\d+):
string.match(/(\d+) +(\D+) +at +(\d+)/).captures
Live demo
Validating input string
If you didn't validate your input string to be within desired format already, then there may be a better approach in validating and capturing data. This solution also brings the idea of accepting any type of character in item_name field and decimal prices at the end:
string.match(/^(\d+) +(.*) +at +(\d+(?:\.\d+)?)$/).captures
You can also do something like this:
"1 chocolate bar at 25"
.split()
.reject {|string| string == "at" }
.map {|string| string.scan(/^\D+$/).empty? ? string.to_i : string }
Code Example: http://ideone.com/s8OvlC
I live in the country where prices might be float, hence the more sophisticated matcher for the price.
"1 chocolate bar at 25".
match(/\A(\d+)\s+(.*?)\s+at\s+(\d[.\d]*)\z/).
captures
#⇒ ["1", "chocolate bar", "25"]

need help printing contents of a ruby hash into a table

I have a file that contains this:
PQRParrot, Quagga, Raccoon
DEFDo statements, Else statements, For statements
GHIGeese, Hippos, If statements
YZ Yak, Zebra
JKLJelly Fish, Kudu, Lynx
MNOManatee, Nautilus, Octopus
ABCApples, Boas, Cats
VWXVulture, While statements, Xmen
STUSea Horse, Tapir, Unicorn
I need to display it in a table like this:
Key Data
ABC Apples, Boas, Cats
DEF Do statements, Else statements, For statements
GHI Geese, Hippos, If statements
JKL Jelly Fish, Kudu, Lynx
MNO Manatee, Nautilus, Octopus
PQR Parrot, Quagga, Raccoon
STU Sea Horse, Tapir, Unicorn
VWX Vulture, While statements, Xmen
YZ Yak, Zebra
Here is the code that I have so far:
lines = File.open("file.txt").read.split
fHash = {}
lines.each do |line|
next if line == ""
fHash[line[0..2]] = line[3..-1]
end
f = File.open("file.txt")
fHash = {}
loop do
x = f.gets
break unless x
fHash[x[0..2]] = x[3..-1]
end
fHash = fHash.to_a.sort.to_h
puts fHash
f.close
And this is what the code outputs:
{ "ABC" => "Apples, Boas, Cats\n",
"DEF" => "Do statements, Else statements, For statements\n",
"GHI" => "Geese, Hippos, If statements\n",
"JKL" => "Jelly Fish, Kudu, Lynx\n",
"MNO" => "Manatee, Nautilus, Octopus\n",
"PQR" => "Parrot, Quagga, Raccoon\n",
"STU" => "Sea Horse, Tapir, Unicorn\n",
"VWX" => "Vulture, While statements, Xmen\n",
"YZ " => "Yak, Zebra\n"
}
So what i'm trying to do is read the contents of the file, take the first three characters and set it as the key while the rest as data, sort the hash by the key value, then display the data as a table.
I have looked around, found a few things similar to my issue but nothing worked out for me.
I think you're overthinking this. If you have a file with those contents, to print the table all you need to do is insert a space after the third character of each line and then sort them (or the other way around). That's pretty simple:
lines = File.foreach("homework02.txt")
.map {|line| line.insert(3, " ") }
puts "Key Data"
puts lines.sort
If instead you want to build a Hash from the lines of the file, all you have to do is this:
hsh = File.foreach("homework02.txt")
.map {|line| [ line.slice!(0,3), line ] }
.sort.to_h
This builds an array of two-element arrays whose first element is the first three characters of each line and whose second is the rest of the line, then sorts it and turns it into a hash.
Then, to print your table:
puts "Key Data"
puts hsh.map {|key, val| "#{key} #{val}" }
I got it to work by changing the way it sorts. Updated code below.
lines = File.open("homework02.txt").read.split
fHash = {}
lines.each do |line|
next if line == ""
fHash[line[0..2]] = line[3..-1]
end
f = File.open("homework02.txt")
fHash = {}
loop do
x = f.gets
break unless x
fHash[x[0..2]] = x[3..-1]
end
fHash = Hash[fHash.sort_by { |k, v| k }]
print "Key ", " Data \n"
fHash.each do |key, val|
print key, " ", val
end
f.close
I have assumed that every line begins with one or more capital letters, followed by an optional space, followed by a capital letter, followed by a lowercase letter.
Code
R = /
\A[A-Z]+ # Match start of string followed by one or more capital letters
\K # Forget everything matched so far
(?=[A-Z][a-z]) # Match a capital letter followed by a lowercase letter
# in a postive lookahead
/x # Extended/free-spacing regex definition mode
Read the file, line by line, format each line, partition each line on the first space and sort:
def make_string(fname)
File.foreach(fname).map { |s| s.gsub(R, ' ').chomp.partition(' ') }.
sort.
map(&:join)
end
If you instead wish to create the specified hash, you could write:
def make_hash(fname)
File.foreach(fname).map { |s| s.gsub(R, ' ').chomp.partition(' ') }.
sort.
map { |f,_,l| [f,l] }.
to_h
end
In the regex the first part of the string cannot be matched in a positive lookbehind because the match is variable-length. That's why I used \K, which does not have that limitation.
Examples
First, let's create the file:
str = <<_
PQRParrot, Quagga, Raccoon
DEFDo statements, Else statements, For statements
GHIGeese, Hippos, If statements
YZ Yak, Zebra
JKLJelly Fish, Kudu, Lynx
MNOManatee, Nautilus, Octopus
ABCApples, Boas, Cats
VWXVulture, While statements, Xmen
STUSea Horse, Tapir, Unicorn
_
FName = 'temp'
File.write(FName, str)
#=> 265
Then
puts make_string(FName)
ABC Apples, Boas, Cats
DEF Do statements, Else statements, For statements
GHI Geese, Hippos, If statements
JKL Jelly Fish, Kudu, Lynx
MNO Manatee, Nautilus, Octopus
PQR Parrot, Quagga, Raccoon
STU Sea Horse, Tapir, Unicorn
VWX Vulture, While statements, Xmen
YZ Yak, Zebra
make_hash(FName)
#=> {"ABC"=>"Apples, Boas, Cats",
# "DEF"=>"Do statements, Else statements, For statements",
# "GHI"=>"Geese, Hippos, If statements",
# "JKL"=>"Jelly Fish, Kudu, Lynx",
# "MNO"=>"Manatee, Nautilus, Octopus",
# "PQR"=>"Parrot, Quagga, Raccoon",
# "STU"=>"Sea Horse, Tapir, Unicorn",
# "VWX"=>"Vulture, While statements, Xmen",
# "YZ"=>"Yak, Zebra"}
As a second example, suppose:
str = <<_
PQRSTUVParrot, Quagga, Raccoon
DEFDo statements, Else statements, For statements
Y Yak, Zebra
_
FName = 'temp'
File.write(FName, str)
#=> 94
Then
puts make_string(FName)
PQRSTUV Parrot, Quagga, Raccoon
Y Yak, Zebra
make_hash(FName)
# => {"DEF"=>"Do statements, Else statements, For statements",
# "PQRSTUV"=>"Parrot, Quagga, Raccoon", "Y"=>"Yak, Zebra"}

How to append a hash to an array?

lines = ["title= flippers dippers track= 9", "title= beaner bounce house track= 3", "title= fruit jams live track= 12"]
songs_formatted = []
songs = {}
lines.each do |line|
line =~ /title=\s?(.*)\s+t/
title = "#$1".strip
songs[:title] = title
line =~ /track=\s?(.*)/
track = "#$1".strip
songs[:track] = track
songs_formatted << songs
end
p songs_formatted
#=> [{:title=>"flippers dippers", :track=>"9"}]
#=> [{:title=>"beaner bounce house", :track=>"3"}, {:title=>"beaner bounce house", :track=>"3"}]
#=> [{:title=>"fruit jams live", :track=>"12"}, {:title=>"fruit jams live", :track=>"12"}, {:title=>"fruit jams live", :track=>"12"}]
Each successive line is overwriting the line before it. Why isn't this just appending in order? Desired result is:
songs_formatted = [{:title=>"flippers dippers", :track=>"9"}, {:title=>"beaner bounce house", :track=>"3"}, {:title=>"fruit jams live", :track=>"12"}]
Need to place the songs hash inside of the each loop. Working code:
lines = ["title= flippers dippers track= 9", "title= beaner bounce house track= 3", "title= fruit jams live track= 12"]
songs_formatted = []
lines.each do |line|
songs = {}
line =~ /title=\s?(.*)\s+t/
title = "#$1".strip
songs[:title] = title
line =~ /track=\s?(.*)/
track = "#$1".strip
songs[:track] = track
songs_formatted << songs
end
p songs_formatted
Proper output:
#=> [{:title=>"flippers dippers", :track=>"9"}, {:title=>"beaner bounce house", :track=>"3"}, {:title=>"fruit jams live", :track=>"12"}]
Since you want one output per line, you can use map. Also, you can extract both with one regex.
lines.map do |line|
title, track = line.match(/title=\s?(.*?)\s*track=\s?(\d+)/)[1,2]
{title: title, track: track}
end
This gives you the output you want.

Resources