Ruby regex match specific string with special conditions - ruby

I'm currently trying to parse a document into tokens with the help of regex.
Currently I'm trying to match the keywords in the document. For example I have the following document:
Func test()
Return blablaFuncblabla
EndFunc
The keywords that needs to be matched is Func, Return and EndFunc.
I've comed up with the following regex: (\s|^)(Func)(\s|$) to match the Func keyword, but it doesn't work exactly like I want, the whitespaces are matched as well!
How can I match it without capturing the whitespaces?

(?:\s|^)(Func)(?:\s|$)
?: makes a group non-capturing.

Related

how to match regex field with given string in elasticsearch (text to regex match)

my json example
obj=[{'id':1,'name':'dhaka','pattern':'dha*ka*'},
{'id':2,'name':'cumilla','pattern':'(c|k)(u|o)+m(i|e)l+a+'},...]
query string: homna kumilla
I want to fetch the docs whose pattern allow the string
Expected result
{'id':2,'name':'cumilla','pattern':'(c|k)(u|o)+m(i|e)l+a+'}
How will be the mappings and query?

Regex group match if present

My input string is:
/234243/source_path/a/b/c.test
or something like:
/234243/source_path/a/b/c.test/check_w123
I want a regex to match substrings starting with source and check with the result like:
source: source_path/a/b/c.test/
check: check_w123
using a regex like /(?<source>source.*)(?<check>check.*)/ without ? in the last group.
My regex is:
/(?<source>source.*)(?<check>check.*)?/
My Result is:
source: source_path/a/b/c.test/check_w123
check: nil
Just turn .* inside the first source group into it's non-greedy form. And don't forget to add end of the line anchor.
(?<source>source.*?)(?<check>check.*)?$
DEMO
(?<source>source.*?(?!.*\/))(?<check>check.*)?
Try this.See demo.
https://regex101.com/r/rU8yP6/11

Logstash filter regular expression difference in JRuby and Ruby

The Logstash filter regular expression to parse our syslog stream is getting more and more complicated, which led me to write tests. I simply copied the structure of a Grok test in the main Logstash repository, modified it a bit, and ran it with bin/logstash rspec as explained here. After a few hours of fighting with the regular expression syntax, I found out that there is a difference in how modifier characters have to be escaped. Here is a simple test for a filter involving square brackets in the log message, which you have to escape in the filter regular expression:
require "test_utils"
require "logstash/filters/grok"
describe LogStash::Filters::Grok do
extend LogStash::RSpec
describe "Grok pattern difference" do
config <<-CONFIG
filter {
grok {
match => [ "message", '%{PROG:theprocess}(?<forgetthis>(: )?(\\[[\\d:|\\s\\w/]*\\])?:?)%{GREEDYDATA:message}' ]
add_field => { "process" => "%{theprocess}" "forget_this" => "%{forgetthis}" }
}
}
CONFIG
sample "uwsgi: [pid: 12345|app: 0|req: 21/93281] BLAHBLAH" do
insist { subject["tags"] }.nil?
insist { subject["process"] } == "uwsgi"
insist { subject["forget_this"] } == ": [pid: 12345|app: 0|req: 21/93281]"
insist { subject["message"] } == "BLAHBLAH"
end
end
end
Save this as e.g. grok_demo.rb and test it with bin/logstash rspec grok_demo.rb, and it will work. If you remove the double escapes in the regexp, though, it won't.
I wanted to try the same thing in straight Ruby, using the same regular expression library that Logstash uses, and followed the directions given here. The following test worked as expected, without the need for double escape:
require 'rubygems'
require 'grok-pure'
grok = Grok.new
grok.add_patterns_from_file("/Users/ulas/temp/grok_patterns.txt")
pattern = '%{PROG:theprocess}(?<forgetthis>(: )?(\[[\d:|\s\w/]*\])?:?)%{GREEDYDATA:message}'
grok.compile(pattern)
text1 = 'uwsgi: [pid: 12345|app: 0|req: 21/93281] BLAHBLAH'
puts grok.match(text1).captures()
I'm not a Ruby programmer, and am a bit lost as to what causes this difference. Is it possible that the heredoc config specification necessitates double escapes? Or does it have to do with the way the regular expression gets passed to the regexp library within Logstash?
I never worked with writing tests for logstash before, but my guess is the double escape is due to the fact that you have strings embedded in strings.
The section:
<<-CONFIG
# stuff here
CONFIG
Is a heredoc in ruby (which is a fancy way to generate a string). So the filter, grok, match, add and all the brackets/braces are actually part of the string. Inside this string, you are escaping the escape sequence so the resulting string has a single literal escape sequence. I'm guessing that this string gets eval'd somewhere so that all the filter etc. stuff gets implemented as needed and that's where the single escape sequence is getting used.
When using "straight ruby" you aren't doing this double interpretation. You're just passing a string directly into the method to compile it.

How to query a Mongoid Regexp Field?

I want to create a regex field in my Mongoid document so that I can have a behavior something like this:
MagicalDoc.create(myregex: /abc\d+xyz/)
MagicalDoc.where(myregex: 'abc123xyz')
I'm not sure if this is possible and what kind of affect it would have. How can I achieve this sort of functionality?
Update: I've learned from the documentation that Mongoid supports Regexp fields but it does not provide an example of how to query for them.
class MagicalDoc
include Mongoid::Document
field :myregex, type: Regexp
end
I would also accept a pure MongoDB answer. I can find a way to convert it to Mongoid syntax.
Update: Thanks to SuperAce99 for helping find this solution. Pass a string to a Mongoid where function and it will create a javascript function:
search_string = 'abc123xyz'
MagicalDoc.where(%Q{ return this.myregex.test("#{search_string}") })
The %Q is a Ruby method that helps to escape quotes.
regexp is not a valid BSON type, so you'll have to figure out how Mongoid represents it to devise a proper query.
Query String using Regex
If you want to send MongoDB a regular expression and return documents MongoDB provides the $regex query operator, which allows you to return documents where a string matches your regular expression.
Query Regex using String
If you want to sent Mongo a string and return all documents that have a regular expression that matches the provided string, you'll probably need the $where operator. This allows you to run a Javascript command on each document:
db.myCollection.find( { $where: function() { return (this.credits == this.debits) } } )
You can define a function which returns True when the provided string matches the Regex stored in the document. Obviously this can't use an Index because it has to execute code for every document in the collection. These queries will be very slow.

Use Xpath to find the appropriate element based on the element value

I have the following xml snippet
<ZMARA01 SEGMENT="1">
<CHARACTERISTICS_01>X,001,COLOR_ATTRIBUTE_FR,BRUN ÉCORCE,TMBR,French C</CHARACTERISTICS_01>
<CHARACTERISTICS_02>X,001,COLOR_ATTRIBUTE,Timber Brown,TMBR,Color Attr</CHARACTERISTICS_02>
</ZMARA01>
I am looking for an xpath expression that will match based on COLOR_ATTRIBUTE. It will not always be in CHARACTERISTIC_02. It could be CHARACTERISTIC_XX. Also I don't want to match COLOR_ATTRIBUTE_FR. I have been using this:
Transaction.Input_XML{/ZMAT/IDOC/E1MARAM/ZMARA01/*[starts-with(local-name(.), 'CHARACTERISTIC_')][contains(.,'COLOR_ATTRIBUTE')]}
This gets me mostly there but it matches both COLOR_ATTRIBUTE and COLOR_ATTRIBUTE_FR
Use:
contains(concat(',', ., ','), ',COLOR_ATTRIBUTE,')
This first surrounds the string value of the context node with commas, then simply tests if the so cunstructed string contains ',COLOR_ATTRIBUTE,'.
Thus we treat all cases (pattern at the start of the string, pattern at the end of the string and pattern neither at the start or at the end) in the same single way.
If COLOR_ATTRIBUTE is guaranteed not to be in the first or last position, you could use [contains(.,',COLOR_ATTRIBUTE,')], otherwise you could use something like [contains(.,'COLOR_ATTRIBUTE') and not contains(.,'COLOR_ATTRIBUTE_FR')].

Resources