Simple one-liner to merge lines with common first field - ruby

In my work building an English language database, I often deal with text content from different sources, and need to merge lines that share the same first field. I often hack this in a text editor with a regex that captures a first field, searching across "\n", but often I have text files >10GB, so a command-line, streaming solution is preferred to in-memory.
Sample input:
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
Desired output:
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
The logic is to concatenate (joined by "|") all lines with the same first field.
The only delimiter is "|", and the delimiter only appears once per input line. i.e. it's effectively a 2-column text file. The file sorting does not matter, the only concern is consecutive lines with the identical first field.
I have lots of solutions and one-liners (often in awk or ruby) to process same-line content, but I run into knots when dealing with multiple lines, and would appreciate help. For some reason, multiline processing always bogs me down.
I'm sure this is can be done succinctly with awk.

Assumptions/understandings:
overall file may not be sorted (by 1st field)
all lines with the same string in the 1st field will be listed consecutively; this should eliminate the need to maintain a large volume of data in memory with the tradeoff that we'll need a bit more typing
2nd field may contain trailing white space (per sample input); this will need to be removed
ouput does not need to be sorted (by 1st field)
One awk idea:
awk '
function print_line() {
if (prev != "")
print prev,data
}
BEGIN { FS=OFS="|" }
{ if ($1 != prev) {
print_line()
prev=$1
data=""
}
gsub(/[[:space:]]+$/,"",$2) # strip trailing white space
data= data (data=="" ? "" : OFS) $2 # concatentate 2nd fields with OFS="|"
}
END { print_line() } # flush last set of data to stdout
' pipe.dat
This generates:
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise

Using any awk in any shell on every Unix box and assuming your input is grouped by the first field as shown in your sample input and you don't really have trailing blanks at the end of some lines:
$ cat tst.awk
BEGIN { FS=OFS="|" }
$1 != prev {
if ( NR>1 ) {
print out
}
out = prev = $1
}
{ out = out OFS $2 }
END { print out }
$ awk -f tst.awk file
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
If it's not grouped then do sort file | awk -f tst.awk and if there are trailing blanks then add { sub(/ +$/,"") } as the first line of the script.

Here is a Ruby solution that reads the file line-by-line. At the end I show how much simpler the solution could be if the file could be gulped into a string.
Let's first create an input file to work with.
str =<<~_
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
_
file_name_in = 'file_in'
File.write(file_name_in, str)
#=> 112
Solution when file is read line-by-line
We can produce the desired output file with the following method.
def doit(file_name_in, file_name_out)
fin = File.new(file_name_in, "r")
fout = File.new(file_name_out, "w")
str = ''
until fin.eof?
s = fin.gets.strip
k,v = s.split(/(?=\|)/)
if str.empty?
str = s
key = k
elsif k == key
str << v
else
fout.puts(str)
str = s
key = k
end
end
fout.puts(str)
fin.close
fout.close
end
Let's try it.
file_name_out = 'file_out'
doit(file_name_in, file_name_out)
puts File.read(file_name_out)
prints the following.
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Note that
"apple|pear".split(/(?=\|)/)
#=> ["apple", "|pear"]
The regular expression contains the positive lookahead (?=\|) which matches the zero-width location between 'e' and '|'.
Solution when file is gulped into a string
The OP does not want to gulp the file into a string (hence my solution above) but I would like to show how much simpler the problem is if one could do so. Here is one of many ways of doing that.
def gulp_it(file_name_in, file_name_out)
File.write(file_name_out,
File.read(file_name_in).gsub(/^(.+)\|.*[^ ]\K *\r?\n\1/, ''))
end
gulp_it(file_name_in, file_name_out)
#=> 98
puts File.read(file_name_out)
prints
apple|pear|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson|ruddy
cherry|cerise
Thinking about what the regex engine will be doing, this may be acceptably fast, depending on file size, of course.
Regex demo
While the link uses the PCRE engine the result would be the same using Ruby's regex engine (Onigmo). We can make the regular expression self-documenting by writing it in free-spacing mode.
/
^ # match the beginning of a line
(.+) # match one or more characters
\|.*[^ ] # match '|', then zero or more chars, then a non-space
\K # resets the starting point of the match and discards
# any previously-matched characters
[ ]* # match zero or more chars
\r?\n # march the line terminator(s)
\1 # match the content of capture group 1
/x # invoke free-spacing mode
(.+) matches, 'apple', 'banana' and 'cherry' because those words are at the beginning lines. One could alternatively write ([^|]*).

Assuming you have the following sample.txt
apple|pear
apple|quince
apple cider|juice
banana|plantain
cherry|cheerful, crimson
cherry|ruddy
cherry|cerise
I am not sure why you want the solution as a "one liner", but the following will do what you want.
cat sample.txt | ruby -e 'puts STDIN.readlines.map {_1.strip}.group_by {_1.split("|").first}.map{|k,v| v.reduce("#{k}") {"#{_1}|#{_2.split("|").last}"}}'
A more readable version with comments describing what's going on:
stripped_lines = STDIN.readlines.map { |l| l.strip } # remove leading and trailing whitespace
# create a hash where the keys are the value to the left of the |
# and the values are lines begining with that key ie
# {
# "apple"=>["apple|pear", "apple|quince"],
# "apple cider"=>["apple cider|juice"],
# "banana"=>["banana|plantain"],
# "cherry"=>["cherry|cheerful, crimson", "cherry|ruddy", "cherry|cerise"]
# }
grouped_by_first_element = stripped_lines.group_by { |sl| sl.split('|').first }
# map to the desired result by starting with the key
# and then concatinating the part to the right of the | for each element
# ie start with apple then append |pear to get apple|pear then append quince to that to get
# apple|pear|quince
result = grouped_by_first_element.map do |key, values|
values.reduce("#{key}") do |memo, next_element|
"#{memo}|#{next_element.split('|').last}"
end
end
puts result

If we assume s is a string containing all of the lines in the file.
s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }
Will yield:
{"apple"=>["pear", "quince"], "apple cider"=>["juice"], "banana"=>["plantain"], "cherry"=>["cheerful, crimson", "ruddy", "cerise"]}
Then:
s.split("\n").inject({}) { |h, x| k, v = x.split('|'); h[k] ||= []; h[k] << v.strip; h }.map { |k, v| "#{k}|#{v.join('|')}" }
Yields:
["apple|pear|quince", "apple cider|juice", "banana|plantain", "cherry|cheerful, crimson|ruddy|cerise"]

A pure bash solution could look like this:
unset out # make sure we start fresh (important if this is in a loop)
declare -A out # declare associative array
d='|' # delimiter
# append all values to the key
while IFS=${d} read -r key val; do
out[${key}]="${out[${key}]}${d}${val}"
done <file
# print desired output
for key in "${!out[#]}"; do
printf '%s%s\n' "${key}" "${out[$key]}"
done | sort -t"${d}" -k1
### actual output
apple cider|juice
apple|pear|quince
banana|plantain
cherry|cheerful, crimson|ruddy|cerise
Or you could do this with awk. As mentioned in a comment, pure bash is not a great option, mostly due to performance and portability.
awk -F'|' '{
sub(/[[:space:]]*$/,"") # only necessary if you wish to trim trailing whitespace, which existed in your example data
a[$1]=a[$1] "|" $2 # append value to string
} END {
for(i in a) print i a[i] # print all recreated lines
}' <file
### acutal output
apple|pear|quince
banana|plantain
apple cider|juice
cherry|cheerful, crimson|ruddy|cerise

Related

Deleting lines with more than 30% lowercase letters

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.
Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars
awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing
Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

Using awk to format text

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.
Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}
sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...
It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.
With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat

Sort Markdown file by heading

Is it possible to sort a markdown file by level 1 heading? Looking for sed or similar command line solution
#B
a content of B
#A
b content of A
to...
#A
b content of A
#B
a content of B
A perl one-liner, split for readability
perl -0777 -ne '
(undef,#paragraphs) = split /^#(?=[^#])/m;
print map {"#$_"} sort #paragraphs;
' file.md
You'll want to end the file with a blank line, so there's a blank line before #B. Or you could change
map {"#$_"} to map {"#$_\n"}
to forcibly insert one.
You can use GNU Awk with PROCINFO["sorted_in"] = "#ind_str_asc":
gawk 'BEGIN { PROCINFO["sorted_in"] = "#ind_str_asc"; RS = ""; ORS = "\n\n" }
{ a[$1] = $0 } END { for (i in a) print a[i] }' file
Output:
#A
b content of A
#B
a content of B
Reference:
PROCINFO["sorted_in"]
If this element exists in PROCINFO,
then its value controls the order in
which array elements are traversed in
for loops. Supported values are
"#ind_str_asc", "#ind_num_asc",
"#val_type_asc", "#val_str_asc",
"#val_num_asc", "#ind_str_desc",
"#ind_num_desc", "#val_type_desc",
"#val_str_desc", "#val_num_desc", and
"#unsorted". The value can also be the
name of any comparison function defined
as follows:
you can also use this script to have the sorting on 3 levels instead of just one. It also won't stripe out the content before the first occurence of the first heading.
#!/usr/bin/env perl
local $/;
my $text = <>;
my ($start, #chapters) = split/^#(?=[^#])/m, $text;
print $start;
for (sort #chapters) {
my ($level1, #subchapters) = split/^##(?=[^#])/m;
print "#$level1";
for (sort #subchapters) {
my ($level2, #subsubchapters) = split/^###(?=[^#])/m;
print "##$level2";
print map {"###$_"} sort #subsubchapters;
}
}

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

Convert leading spaces to tabs in ruby

Given the following indented text:
two spaces
four
six
non-leading spaces
I'd like to convert every 2 leading spaces to a tab, essentially converting from soft tabs to hard tabs. I'm looking for the following result (using an 'x' instead of "\t"):
xtwo spaces
xxfour
xxxsix
non-leading spaces
What is the most efficient or eloquent way to do this in ruby?
What I have so far seems to be working, but it doesn't feel right.
input.gsub!(/^ {2}/,"x")
res = []
input.split(/\n/).each do |line|
while line =~ /^x+ {2}/
line.gsub!(/^(x+) {2}/,"\\1x")
end
res << line
end
puts res.join("\n")
I noticed the answer using sed and \G:
perl -pe '1 while s/\G {2}/\t/gc' input.txt >output.txt
But I can't figure out how to mimic the pattern in Ruby. This is as far as I got:
rep = 1
while input =~ /^x* {2}/ && rep < 10
input.gsub!(/\G {2}/,"x")
rep += 1
end
puts input
Whats wrong with using (?:^ {2})|\G {2} in multi-line mode?
The first match will always be at the beginning of the line,
then \G will match succesively right next to that, or the match
will fail. The next match will always be the beginning of the line.. repeats.
In Perl its $str =~ s/(?:^ {2})|\G {2}/x/mg; or $str =~ s/(?:^ {2})|\G {2}/\t/mg;
Ruby http://ideone.com/oZ4Os
input.gsub!(/(?:^ {2})|\G {2}/m,"x")
Edit: Of course the anchors can be factored out and put into an alternation
http://ideone.com/1oDOJ
input.gsub!(/(?:^|\G) {2}/m,"x")
You can just use a single gsub for that:
str.gsub(/^( {2})+/) { |spaces| "\t" * (spaces.length / 2) }

Resources