I'm working on trying to parse text from a file into a Hash. Regex just isn't my thing and I am trying to get a better understanding of how to do something like this.
-some-
key1=value1
key2=value2
key3=value3
; comment
-something-
key4=value4
key5=value5
The result should be something like this
some.key1 = value1
some.key2 = value2
some.key3 = value3
something.key4 = value4
something.key5 = value5
I'm working in Ruby and I am able to capture what in between the dashes but the double line breaks, and comments are throwing me off. Currently I have something like this which finds what's in the dashes and puts it in group 1 and then stuffs the next line in group 2, but then stops there.
data = Hash[text.scan(/^\-(.*?)\-\s(.*?)$/m)]
Any help in understanding how this would work is greatly appreciated.
Update: Figured a way to do it with two different Regex's and a loop.
But there has to be a more efficient way, right?
data = Hash.new
Hash[text.scan(/\-(.*?)\-\s([^\r\n]*(?:[\r\n]+(?!\-).*)*)/)].each do |key, value|
data[key] = Hash[value.scan(/(?<=\s|\A)([^\s=]+)\s*=\s*(.*?)(?=(?:\s[^\s=]+=|$))/m)]
end
puts data.inspect
You can do it like this:
text = "-some-
key1=value1
key2=value2
key3=value3
; comment
-something-
key4=value4
key5=value5"
data = Hash[text.scan(/(\w+)=(\w+)/)]
puts data
Output:
{"key1"=>"value1", "key2"=>"value2", "key3"=>"value3", "key4"=>"value4", "key5"=>"value5"}
(\w+) will capture 1 or more alphanumeric symbols on either side of =, and we do not need a multiline mode option here unless you have special characters other than digits, letters and _.
Related
I have a set of data :
coords=ARRAY(0x940044c)
Label<=>Bikini beach
coords=ARRAY(0x95452ec)
City=Y
Label=Naifaru%*
How do I remove the unwanted character to make it like this?
coords=ARRAY(0x940044c)
Label=Bikini beach
coords=ARRAY(0x95452ec)
City=Y
Label=Naifaru
I tried this:
hashChar = {"!"=>nil, "#"=>nil, "$"=>nil, "%"=>nil, "*"=>nil, "<=>"=>nil, "<"=>nil, ">"=>nil}
readFile.each do |char|
unwantedChar = char.chomp
puts unwantedChar.gsub(/\W/, hashChar)
end
But the output I will get is this:
coordsARRAY0x940044c
LabelBikinibeach
coordsARRAY0x95452ec
CityY
LabelNaifaru
Please help.
If the input is not extremely long and you are fine to load it into memory, String#gsub would do. It’s always better to whitelist wanted characters, rather than blacklist unwanted ones.
readFile.gsub(/[^\w\s=\(\)]+/, '')
# coords=ARRAY(0x940044c)
# Label=Bikini beach
# coords=ARRAY(0x95452ec)
# City=Y
# Label=Naifaru
I assume from the code you posted, that readFile is a String holding the set of data you are referring to.
puts readFile.delete('!#$<>*')
should do the job.
Using a hash map with gsub
regex = Regexp.union(hashChar.keys)
puts your_string.gsub(regex, hashChar)
I'm trying to reformat German dates (e.g. 13.03.2011 to 2011-03-13).
This is my code:
str = "13.03.2011\n14:30\n\nHannover Scorpions\n\nDEG Metro Stars\n60\n2 - 3\n\n\n\n13.03.2011\n14:30\n\nThomas Sabo Ice Tigers\n\nKrefeld Pinguine\n60\n2 - 3\n\n\n\n"
str = str.gsub("/(\d{2}).(\d{2}).(\d{4})/", "/$3-$2-$1/")
I get the same output like input. I also tried my code with and without leading and ending slashes, but I don't see a difference. Any hints?
I tried to store my regex'es in variables like find = /(\d{2}).(\d{2}).(\d{4})/ and replace = /$3-$2-$1/, so my code looked like this:
str = "13.03.2011\n14:30\n\nHannover Scorpions\n\nDEG Metro Stars\n60\n2 - 3\n\n\n\n13.03.2011\n14:30\n\nThomas Sabo Ice Tigers\n\nKrefeld Pinguine\n60\n2 - 3\n\n\n\n"
find = /(\d{2}).(\d{2}).(\d{4})/
replace = /$3-$2-$1/
str = str.gsub(find, replace)
TypeError: no implicit conversion of Regexp into String
from (irb):4:in `gsub'
Any suggestions for this problem?
First mistake is the regex delimiter. You do not need place the regex as string. Just place it inside a delimiter like //
Second mistake, you are using captured groups as $1. Replace those as \\1
str = str.gsub(/(\d{2})\.(\d{2})\.(\d{4})/, "\\3-\\2-\\1")
Also, notice I have escaped the . character with \., because in regex . means any character except \n
This question already has answers here:
Match a string against multiple patterns
(2 answers)
Closed 8 years ago.
I'm new to ruby and I'm trying to solve a problem.
I'm parsing through several text field where I want to remove the header which has different values. It works fine when the header always is the same:
variable = variable.gsub(/(^Header_1:$)/, '')
But when I put in several arguments it doesn't work:
variable = variable.gsub(/(^Header_1$)/ || /(^Header_2$)/ || /(^Header_3$)/ || /(^Header_4$)/ || /^:$/, '')
You can use Regexp.union:
regex = Regexp.union(
/^Header_1/,
/^Header_2/,
/^Header_3/,
/^Header_4/,
/^:$/
)
variable.gsub(regex, '')
Please note that ^something$ will not work on strings containing something more than something :)
Cause ^ is for matching beginning of string and $ is for end of string.
So i intentionally removed $.
Also you do not need brackets when you only need to remove the matched string.
You can also use it like this:
headers = %w[Header_1 Header_2 Header_3]
regex = Regexp.union(*headers.map{|s| /^#{s}/}, /^\:$/, /etc/)
variable.gsub(regex, '')
And of course you can remove headers without explicitly define them.
Most likely there are a white space after headers?
If so, you can do it as simple as:
variable = "Header_1 something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
variable = "Header_BLAH something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
Just use a proper regexp:
variable.gsub(/^(Header_1|Header_2|Header_3|Header_4|:)$/, '')
If the header is always the same format of Header_n, where n is some integer value, then you can simplify your regex greatly:
/Header_\d+/
will find every one of these:
%w[Header_1 Header_2 Header_3].grep(/Header_\d+/)
[
[0] "Header_1",
[1] "Header_2",
[2] "Header_3"
]
Tweaking it to handle finding words, not substrings:
/^Header_\d+$/
or:
/\bHeader_\d+\b/
As mentioned, using Regexp.union is a good start, but, used blindly, can result in very slow or inefficient patterns, so think ahead and help out the engine by giving it useful sub-patterns to work with:
values = %w[foo bar]
/Header_(?:\d+|#{ values.join('|') })/
=> /Header_(?:\d+|foo|bar)/
Unfortunately, Ruby doesn't have the equivalent to Perl's Regexp::Assemble module, which can build highly optimized patterns from big lists of words. Search here on Stack Overflow for examples of what it can do. For instance:
use Regexp::Assemble;
my #values = ('Header_1', 'Header_2', 'foo', 'bar', 'Header_3');
my $ra = Regexp::Assemble->new;
foreach (#values) {
$ra->add($_);
}
print $ra->re, "\n";
=> (?-xism:(?:Header_[123]|bar|foo))
I have a file with one or more key:value lines, and I want to pull a key:value out if key=foo. How can I do this?
I can get as far as this:
if File.exist?('/file_name')
content = open('/file_name').grep(/foo:??/)
I am unsure about the grep portion, and also once I get the content, how do I extract the value?
People like to slurp the files into memory, which, if the file will always be small, is a reasonable solution. However, slurping isn't scalable, and the practice can lead to excessive CPU and I/O waits as content is read.
Instead, because you could have multiple hits in a file, and you're comparing the content line-by-line, read it line-by-line. Line I/O is very fast and avoids the scalability problems. Ruby's File.foreach is the way to go:
File.foreach('path/to/file') do |li|
puts $1 if li[/foo:\s*(\w+)/]
end
Because there are no samples of actual key/value pairs, we're shooting in the dark for valid regex patterns, but this is the basis for how I'd solve the problem.
Try this:
IO.readlines('key_values.txt').find_all{|line| line.match('key1')}
i would recommend to read the file into array and select only lines you need:
regex = /\A\s?key\s?:/
results = File.readlines('file').inject([]) do |f,l|
l =~ regex ? f << "key = %s" % l.sub(regex, '') : f
end
this will detect lines starting with key: and adding them to results like key = value,
where value is the portion going after key:
so if you have a file like this:
key:1
foo
key:2
bar
key:3
you'll get results like this:
key = 1
key = 2
key = 3
makes sense?
value = File.open('/file_name').read.match("key:(.*)").captures[0] rescue nil
File.read('file_name')[/foo: (.*)/, 1]
#=> XXXX
I have a string of four blank lines which all up makes eight lines in total in the following:
str = "aaa\n\n\nbbb\n\nccc\ddd\n"
I want to return this all in one line. The output should be like this on a single line:
aaabbbcccddd
I used various trim functions to get the output but still I am failing.
What method do I have to use here?
The Ruby (and slightly less Perl-ish) way:
new_str = str.delete "\n"
...or if you want to do it in-place:
str.delete! "\n"
str.gsub(/\n/,'')
> str = "aaa\n\n\nbbb\n\nccc\ddd\n"
=> "aaa\n\n\nbbb\n\ncccddd\n"
> str.gsub("\n", "")
=> "aaabbbcccddd"