Speed up my lexing algorithm - ruby

I'm splitting a potentially large string (let's say 20MB, though this is entirely arbitrary) into tokens defined by a list of regular expressions.
My current algorithm takes the following approach:
All regexes are optimized to have the zero-width assertion ^ at the start of them
For each regex in the list, I attempt to #slice! the input string
If we #slice! anything, we got a match AND the input string has been advanced ready to find the next token (since #slice! modifies the string)
Unfortunately this is slow, which is due to the repeated #slice! on the long string... it seems like modifying large strings in ruby isn't fast.
So I wonder if there's a way to match a my regexes against the new substring (i.e. the remainder of the string) without modifying it?
Current algorithm in (tested, runnable) pseudo-code:
rules = {
:foo => /^foo/,
:bar => /^bar/,
:int => /^[0-9]+/
}
input = "1foofoo23456bar1foo"
# or if you want your computer to cry
# input = "1foofoo23456bar1foo" * 1_000_000
tokens = []
until input.length == 0
matched = rules.detect do |(name, re)|
if match = input.slice!(re)
tokens << { :rule => name, :value => match }
end
end
raise "Uncomsumed input: #{input}" unless matched
end
pp tokens
# =>
[{:rule=>:int, :value=>"1"},
{:rule=>:foo, :value=>"foo"},
{:rule=>:foo, :value=>"foo"},
{:rule=>:int, :value=>"23456"},
{:rule=>:bar, :value=>"bar"},
{:rule=>:int, :value=>"1"},
{:rule=>:foo, :value=>"foo"}]
Note that while quite simply matching the regexes against the string an equivalent number of times is not fast by any means, it is not so slow that you'd have time to cook a pizza while you wait (a few seconds, vs many many minutes).

The String#match() method has a two-parameter version, which will match a regular expression starting at a specific character position in the string. You just need to get one-past-the-last-matching-character from the previous match as the starting position for the new match.
In untested, not-run pseudo-code:
input = "foo"
input_pos = 0
input_end = input.length
until input_pos == input_end do
matched = rules.detect do |(name, re)|
if match = input.match(re, input_pos)
tokens << { :rule => name, :value => match }
input_pos = match.post_match
end
end
end

Maybe I'm oversimplifying but String#scan will most likely outperform anything else:
tokens = input.scan(/foo|bar|\d+/).map{|m| {:value => m, :rule => rules.find{|name,re| m =~ re}[0]}}
or more generally:
rules = {
:foo => /foo/,
:bar => /bar/,
:int => /[0-9]+/
}
tokens = input.scan(Regexp.union(rules.values)).map{|m| {:value => m, :rule => rules.find{|name,re| m =~ re}[0]}}

Related

How to mask all but last four characters in a string

I've been attempting a coding exercise to mask all but the last four digits or characters of any input.
I think my solution works but it seems a bit clumsy. Does anyone have ideas about how to refactor it?
Here's my code:
def mask(string)
z = string.to_s.length
if z <= 4
return string
elsif z > 4
array = []
string1 = string.to_s.chars
string1[0..((z-1)-4)].each do |s|
array << "#"
end
array << string1[(z-4)..(z-1)]
puts array.join(", ").delete(", ").inspect
end
end
positive lookahead
A positive lookahead makes it pretty easy. If any character is followed by at least 4 characters, it gets replaced :
"654321".gsub(/.(?=.{4})/,'#')
# "##4321"
Here's a description of the regex :
r = /
. # Just one character
(?= # which must be followed by
.{4} # 4 characters
) #
/x # free-spacing mode, allows comments inside regex
Note that the regex only matches one character at a time, even though it needs to check up to 5 characters for each match :
"654321".scan(r)
# => ["6", "5"]
/(.)..../ wouldn't work, because it would consume 5 characters for each iteration :
"654321".scan(/(.)..../)
# => [["6"]]
"abcdefghij".scan(/(.)..../)
# => [["a"], ["f"]]
If you want to parametrize the length of the unmasked string, you can use variable interpolation :
all_but = 4
/.(?=.{#{all_but}})/
# => /.(?=.{4})/
Code
Packing it into a method, it becomes :
def mask(string, all_but = 4, char = '#')
string.gsub(/.(?=.{#{all_but}})/, char)
end
p mask('testabcdef')
# '######cdef'
p mask('1234')
# '1234'
p mask('123')
# '123'
p mask('x')
# 'x'
You could also adapt it for sentences :
def mask(string, all_but = 4, char = '#')
string.gsub(/\w(?=\w{#{all_but}})/, char)
end
p mask('It even works for multiple words')
# "It even #orks for ####iple #ords"
Some notes about your code
string.to_s
Naming things is very important in programming, especially in dynamic languages.
string.to_s
If string is indeed a string, there shouldn't be any reason to call to_s.
If string isn't a string, you should indeed call to_s before gsub but should also rename string to a better description :
object.to_s
array.to_s
whatever.to_s
join
puts array.join(", ").delete(", ").inspect
What do you want to do exactly? You could probably just use join :
[1,2,[3,4]].join(", ").delete(", ")
# "1234"
[1,2,[3,4]].join
# "1234"
delete
Note that .delete(", ") deletes every comma and every whitespace, in any order. It doesn't only delete ", " substrings :
",a b,,, cc".delete(', ')
# "abcc"
["1,2", "3,4"].join(', ').delete(', ')
# "1234"
Ruby makes this sort of thing pretty trivial:
class String
def asteriskify(tail = 4, char = '#')
if (length <= tail)
self
else
char * (length - tail) + self[-tail, tail]
end
end
end
Then you can apply it like this:
"moo".asteriskify
# => "moo"
"testing".asteriskify
# => "###ting"
"password".asteriskify(5, '*')
# => "***sword"
Try this one
def mask(string)
string[0..-5] = '#' * (string.length - 4)
string
end
mask("12345678")
=> "####5678"
I will add my solution to this topic too :)
def mask(str)
str.match(/(.*)(.{4})/)
'#' * ($1 || '').size + ($2 || str)
end
mask('abcdef') # => "##cdef"
mask('x') # => "x"
I offer this solution mainly to remind readers that String#gsub without a block returns an enumerator.
def mask(str, nbr_unmasked, mask_char)
str.gsub(/./).with_index { |s,i| i < str.size-nbr_unmasked ? mask_char : s }
end
mask("abcdef", 4, '#')
#=> "##cdef"
mask("abcdef", 99, '#')
#=> "######"
Try using tap
def mask_string(str)
str.tap { |p| p[0...-4] = '#' * (p[0...-4].length) } if str.length > 4
str
end
mask_string('ABCDEF') # => ##CDEF
mask_string('AA') # => AA
mask_string('S') # => 'S'

in Parslet, how to reconstruct substrings from parse subtrees?

I'm writing a parser for strings with interpolated name-value arguments, e.g.: 'This sentence #{x: 2, y: (2 + 5) + 3} has stuff in it.' The argument values are code, which has its own set of parse rules.
Here's a version of my parser, simplified to only allow basic arithmetic as code:
require 'parslet'
require 'ap'
class TestParser < Parslet::Parser
rule :integer do match('[0-9]').repeat(1).as :integer end
rule :space do match('[\s\\n]').repeat(1) end
rule :parens do str('(') >> code >> str(')') end
rule :operand do integer | parens end
rule :addition do (operand.as(:left) >> space >> str('+') >> space >> operand.as(:right)).as :addition end
rule :code do addition | operand end
rule :name do match('[a-z]').repeat 1 end
rule :argument do name.as(:name) >> str(':') >> space >> code.as(:value) end
rule :arguments do argument >> (str(',') >> space >> argument).repeat end
rule :interpolation do str('#{') >> arguments.as(:arguments) >> str('}') end
rule :text do (interpolation.absent? >> any).repeat(1).as(:text) end
rule :segments do (interpolation | text).repeat end
root :segments
end
string = 'This sentence #{x: 2, y: (2 + 5) + 3} has stuff in it.'
ap TestParser.new.parse(string), index: false
Since the code has its own parse rules (to ensure valid syntax), the argument values are parsed into a subtree (with parentheses etc. replaced by nesting within the subtree):
[
{
:text => "This sentence "#0
},
{
:arguments => [
{
:name => "x"#16,
:value => {
:integer => "2"#19
}
},
{
:name => "y"#22,
:value => {
:addition => {
:left => {
:addition => {
:left => {
:integer => "2"#26
},
:right => {
:integer => "5"#30
}
}
},
:right => {
:integer => "3"#35
}
}
}
}
]
},
{
:text => " has stuff in it."#37
}
]
However, I want to store the argument values as strings, so this would be the ideal result:
[
{
:text => "This sentence "#0
},
{
:arguments => [
{
:name => "x"#16,
:value => "2"
},
{
:name => "y"#22,
:value => "(2 + 5) + 3"
}
]
},
{
:text => " has stuff in it."#37
}
]
How can I use the Parslet subtrees to reconstruct the argument-value substrings? I could write a code generator, but that seems overkill -- Parslet clearly has access to the substring position information at some point (although it might discard it).
Is it possible to leverage or hack Parslet to return the substring?
The tree produced is based on the use of as in your parser.
You can try removing them from anything in an expression so you get a single string match for the expression. This seems to be what you are after.
If you want the parsed tree for these expressions too, then you need to either:
Transform the expression trees back to the matched text.
Re-Parse the matched text back into an expression tree.
Neither of these is ideal, but if speed is not vital, I would go the re-parse option. ie. remove the as atoms, and then later reparse the expressions to trees as needed.
As you rightly want to reuse the same rules, but this time you need as captures throughout the rules, then you could implement this by deriving a parser from your existing parser and implementing rules with the same names in terms of rule :x { super.x.as(:x)}
OR
You could have a general rule for expression that matches the whole expression without knowing what is in it.
eg. "#{" >> (("}".absent >> any) | "\\}").repeat(0) >> "}"
Then later you can parse each expression into a tree as needed. that way you are not repeating your rules. It assumes you can tell when your expression is complete without parsing the whole expression subtree.
Failing that, it leaves us with hacking parslet.
I don't have a solution here, just some hints.
Parslet has a module called "CanFlatten" that implements flatten and is used by as to convert the captured tree back to a single string. You are going to want to do something like this.
Alternatively you need to change the succ method in Atom::Base to return "[success/fail, result, consumed_upto_position]" so each match knows where it consumed up to. Then you can read from the source between the start position and end position to get the raw text back. The current position of the source at the point the parser matches should be the value you want.
Good Luck.
Note: My example expression parser doesn't handle escaping of the escape character.. (left as an exercise for the reader)
Here's the hack I ended up with. There are better ways to accomplish this, but they'd require more extensive changes. Parser#parse now returns a Result. Result#tree gives the normal parse result, and Result#strings is a hash that maps subtree structures to source strings.
module Parslet
class Parser
class Result < Struct.new(:tree, :strings); end
def parse(source, *args)
source = Source.new(source) unless source.is_a? Source
value = super source, *args
Result.new value, source.value_strings
end
end
class Source
prepend Module.new{
attr_reader :value_strings
def initialize(*args)
super *args
#value_strings = {}
end
}
end
class Atoms::Base
prepend Module.new{
def apply(source, *args)
old_pos = source.bytepos
super.tap do |success, value|
next unless success
string = source.instance_variable_get(:#str).string.slice(old_pos ... source.bytepos)
source.value_strings[flatten(value)] = string
end
end
}
end
end

Convert named matches in MatchData to Hash

I have a rather simple regexp, but I wanted to use named regular expressions to make it cleaner and then iterate over results.
Testing string:
testing_string = "111x222b333"
My regexp:
regexp = %r{
(?<width> [0-9]{3} ) {0}
(?<height> [0-9]{3} ) {0}
(?<depth> [0-9]+ ) {0}
\g<width>x\g<height>b\g<depth>
}x
dimensions = regexp.match(testing_string)
This work like a charm, but heres where the problem comes:
dimensions.each { |k, v| dimensions[k] = my_operation(v) }
# ERROR !
undefined method `each' for #<MatchData "111x222b333" width:"111" height:"222" depth:"333">.
There is no .each method in MatchData object, and I really don't want to monkey patch it.
How can I fix this problem ?
I wasn't as clear as I thought: the point is to keep names and hash-like structure.
If you need a full Hash:
captures = Hash[ dimensions.names.zip( dimensions.captures ) ]
p captures
#=> {"width"=>"111", "height"=>"222", "depth"=>"333"}
If you just want to iterate over the name/value pairs:
dimensions.names.each do |name|
value = dimensions[name]
puts "%6s -> %s" % [ name, value ]
end
#=> width -> 111
#=> height -> 222
#=> depth -> 333
Alternatives:
dimensions.names.zip( dimensions.captures ).each do |name,value|
# ...
end
[ dimensions.names, dimensions.captures ].transpose.each do |name,value|
# ...
end
dimensions.names.each.with_index do |name,i|
value = dimensions.captures[i]
# ...
end
So today a new Ruby version (2.4.0) was released which includes many new features, amongst them feature #11999, aka MatchData#named_captures. This means you can now do this:
h = '12'.match(/(?<a>.)(?<b>.)(?<c>.)?/).named_captures
#=> {"a"=>"1", "b"=>"2", "c"=>nil}
h.class
#=> Hash
So in your code change
dimensions = regexp.match(testing_string)
to
dimensions = regexp.match(testing_string).named_captures
And you can use the each method on your regex match result just like on any other Hash, too.
I'd attack the whole problem of creating the hash a bit differently:
irb(main):052:0> testing_string = "111x222b333"
"111x222b333"
irb(main):053:0> hash = Hash[%w[width height depth].zip(testing_string.scan(/\d+/))]
{
"width" => "111",
"height" => "222",
"depth" => "333"
}
While regex are powerful, their siren-call can be too alluring, and we get sucked into trying to use them when there are more simple, or straightforward, ways of accomplishing something. It's just something to think about.
To keep track of the number of elements scanned, per the OPs comment:
hash = Hash[%w[width height depth].zip(scan_result = testing_string.scan(/\d+/))]
=> {"width"=>"111", "height"=>"222", "depth"=>"333"}
scan_result.size
=> 3
Also hash.size will return that, as would the size of the array containing the keys, etc.
#Phrogz's answer is correct if all of your captures have unique names, but you're allowed to give multiple captures the same name. Here's an example from the Regexp documentation.
This code supports captures with duplicate names:
captures = Hash[
dimensions.regexp.named_captures.map do |name, indexes|
[
name,
indexes.map { |i| dimensions.captures[i - 1] }
]
end
]
# Iterate over the captures
captures.each do |name, values|
# name is a String
# values is an Array of Strings
end
If you want to keep the names, you can do
new_dimensions = {}
dimensions.names.each { |k| new_dimensions[k] = my_operation(dimensions[k]) }

How to read characters from a text file, then store them into a hash in Ruby

I am working on an assignment, and can't figure it out. We have to first parse a text file, and then feed the results into a hash. I have done this:
code = File.open(WORKING_DIR + '/code.txt','r')
char_count = {'a' => 0,'b' => 0,'c' => 0,'d' => 0,'e' => 0,'f' => 0,'g' => 0,'h' => 0,'i' => 0,
'j' => 0,'k' => 0,'l' => 0,'m' => 0,'n' => 0,'o' => 0,'p' => 0,'q' => 0,'r' => 0,
's' => 0,'t' => 0,'u' => 0,'v' => 0,'w' => 0,'x' => 0,'y' => 0,'z' => 0
}
# Step through each line in the file.
code.readlines.each do |line|
# Print each character of this particular line.
line.split('').each do
|ch|
char_count.has_key?('ch')
char_count['ch'] +=1
end
My line of thinking: open the file to a variable named code
read the individual lines
break the lines into each character.
I know this works, I can puts out the characters to screen.
Now I need to feed the characters into the hash, and it isn't working. I am struggling with the syntax (at least) and basic concepts (at most). I only want the alphabet characters, not the punctuation, etc. from the file.
Any help would be greatly appreciated.
Thanks.
I would directly do :
File.open(WORKING_DIR + '/code.txt','r') do |f|
char_count = Hash.new(0) # create a hash where 0 is the default value
f.each_char do |c| # iterate on each character
... # some filter on the character you want to reject.
char_count[c] +=1
end
end
PS : you wrote 'ch' the string instead of ch the variable name
EDIT : the filter could be
f.each_char do |c| # iterate on each character
next if c ~= \/W\ # exclude with a regexp non word character
....
Try this, using Enumerable class methods:
open("file").each_char.grep(/\w/).group_by { |char|
char
}.each { |char,num|
p [char, num.count]
}
(The grep method filter is using regex "\w" (any character, digit ou underscore); you can change to [A-Za-z] for filter only alphabets.)
I think the problem is here:
char_count.has_key?('ch')
char_count['ch'] +=1
end
You're not using the variable but a string 'ch', change that in both places for ch.
Also the hash could be created using range, for example:
char_count = {}
('a'..'z').each{|l| char_count[l] = 0}
or:
char_count = ('a'..'z').inject({}){|hash,l| hash[l] = 0 ; hash}

How to create an ordered list of matches from multiple Regexps in a string?

How can one get a list of matches in a string from multiple different Regexps, and have these matches ordered relatively by their position in the string?
The string can contain multiple matches from the same Regexp.
Based on sepp2k's answer, here's the solution I implemented (simplified example):
test_data = "
a_word
another_word
23445
12432423
third_word
"
regexps = /(?<word>[a-zA-Z_]+)/, /(?<number>[\d]+)/
words = regexps.map{|re| re.names}.flatten!
matches = []
test_data.scan(Regexp.union(regexps)) do
words.each do |word|
m = Regexp.last_match
matches << {word => m.to_s} if m[word]
end
end
p matches
This outputs:
[{"word"=>"a_word"}, {"word"=>"another_word"}, {"number"=>"23445"}, {"number"=>"12432423"}, {"word"=>"third_word"}]
You can use Regexp.union to turn all the regexps into one regexp and then use String#scan to find all matches. The array returned by scan will be ordered by the position of the match.
That seems awfully complex when inject and a case statement will do IMHO:
> %w{a_word another_word 23445 12432423 third_word}.inject([]) {|s,v| s << case v when /^[a-zA-Z_]+$/ then {'word' => v} when /^\d+$/ then {'number' => v} end }
=> [{"word"=>"a_word"}, {"word"=>"another_word"}, {"number"=>"23445"}, {"number"=>"12432423"}, {"word"=>"third_word"}]
For readability you could have the following:
data = <<EOD
a_word
another_word
23445
12432423
third_word
EOD
data.split.inject([]) do |s,v|
s << case v
when /^[a-zA-Z_]+$/
{'word' => v}
when /^\d+$/
{'number' => v}
end
end

Resources