I'm opening a file and finding the line I need, but then I have trouble creating a variable from the found string
70c 08:04:04.014 rexx TRACE 2203 8=4.4|9=892|35=J|49=ICE_SM_S|56=SM|34=280|70=0241608914160889|71=0|626=2|793=16|72=|466=1164266784|857=0|73=1|11=|37=1156426784|526=1156426674|38=1|198=1310883PTM|54=1|6=117.2100000000|336=R|625=P|55=B|461=FXXXXX|200=20120901|207=IFEU|53=1|30=ICE|453=2|448=SLM|447=C|452=7|448=FFC|447=C|452=12|75=20120210|60=20120310-09:04:04|77=O|58=CYU795|232=14|233=GL_TRADEJOBOUT|234=N|233=GL_ORDERJOBOUT|234=N|233=GL_TAKEN|234=0|233=GL_TRADETYPE|234=E|
This is the string and I want to assign it to a variable of tag198, so it would be
tag198 = '1310883PTMS'
Anything after | is not needed.
tag198 = line.match(/198=(.*)/)[1]
puts tag198
but that keeps all after 198; I need just the string prior to the |.
Change your regular expression to:
/198=(.+?)\|/
That makes it non-greedy and stop at the vertical bar. You have to escape the vertical bar because it normally would mean "OR" in a regular expression.
Your regular expression's * is greedy, and will consume all characters it can without stopping the rest of the expression from matching. There is nothing in the expression that tells ruby when to stop collecting characters.
Look at regular-expressions.info. A partial fix for your problem would be to put a '|' after your capture:
tag198=line.match(/198=(.*)\|/)[1] puts tag198
The '|' is escaped as it has special meaning in regexes otherwise. This doesn't yet work though, because the * can still consume '|' characters, so long as it leaves one behind to match the '|' in our expression. To fix completely, prevent the * from capturing any pipes:
tag198 = line.match(/198=([^|]*)\|/)[1] puts tag198
See results of this change here.
If it is only letters and numbers you could use
/198=([A-Za-z0-9]*)/
Also, in case you didn't know, you can test regular expressions on rubular.com, it also provides some information about special charters in regular expressions, it is a great site for all your regular expressions needs even if it isn't for ruby.
Related
I'm trying to censor letters in a word with word.gsub(/[^#{guesses}]/i, '-'), where word and guesses are strings.
When guesses is "", I get this error RegexpError: empty char-class: /[^]/i. I could sort such cases with an if/else statement, but can I add something to the regex to make it work in one line?
Since you are only matching (or not matching) letters, you can add a non-letter character to your regex, e.g. # or %:
word.gsub(/[^%#{guesses}]/i, '-')
See IDEONE demo
If #{guesses} is empty, the regex will still be valid, and since % does not appear in a word, there is no risk of censuring some guessed percentage sign.
You have two options. One is to avoid testing if your matches are empty, that is:
unless (guesses.empty?)
word.gsub(/^#{Regex.escape(guesses)}/i, '-')
end
Although that's not your intention, it's really the safest plan here and is the most clear in terms of code.
Or you could use the tr function instead, though only for non-empty strings, so this could be substituted inside the unless block:
word.tr('^' + guesses.downcase + guesses.upcase, '-')
Generally tr performs better than gsub if used frequently. It also doesn't require any special escaping.
Edit: Added a note about tr not working on empty strings.
Since tr treats ^ as a special case on empty strings, you can use an embedded ternary, but that ends up confusing what's going on considerably:
word.tr(guesses.empty? ? '' : ('^' + guesses.downcase + guesses.upcase), '-')
This may look somewhat similar to tadman's answer.
Probably you should keep the string that represents what you want to hide, instead of what you want to show. Let's say this is remains. Then, it would be easy as:
word.tr(remains.upcase + remains.downcase, "-")
I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).
What would this mean in an expression?
(?m:.*?)
or this
(?m:\s*)
I mean, it appears to be something to do with whitespace but I'm unsure.
ADDITIONAL DETAILS:
The full expression I'm looking at is:
\A((?m:\s*)((\/\*(?m:.*?)\*\/)|(\#\#\# (?m:.*?)\#\#\#)|(\/\/ .* \n?)+|(\# .* \n?)+))+
(?...) is a way of applying modifiers to the regular expression inside the parentheses.
(?:...) allows you to treat the part between the parentheses as a group, without affecting the set of strings captured by the matching engine. But you can add option letters between the ? and the :, in which case the part of the regular expression between the parentheses behaves as if you had included those option letters when creating the regular expression. That is, /(?m:...)/ behaves the same as /.../m.
The m, in turn, enables "multiline" mode.
CORRECTED:
Here's where I got confused in the original answer, because this option has different meanings in different environments.
This question is tagged Ruby, in which "multiline mode" causes the dot character (.) to match newlines, whereas normally that's the one character it doesn't match:
irb(main):001:0> "a\nb" =~ /a.b/
=> nil
irb(main):002:0> "a\nb" =~ /a.b/m
=> 0
irb(main):003:0> "a\nb" =~ /(?m:a.b)/
=> 0
So your first regular expression, (?m:.*?) will match any number (including zero) of any characters (including newlines). Basically, it will match anything at all, including nothing.
In the second regular expression, (?m:\s*), the modifier has no effect at all because there are no dots in the contained expression to modify.
Back to the first expression. As Ωmega says, the ? after the * means that it is a non-greedy match. If that were the whole expression, or if there were no captures, it wouldn't matter. But when something follows that section and there are captures, you get different results. Without the ?, the longest possible match wins:
irb(main):001:0> /<(.*)>/.match("<a><b>")[1]
=> "a><b"
With the ?, you get the shortest one instead:
irb(main):002:0> /<(.*?)>/.match("<a><b>")[1]
=> "a"
Finally, about the above-mentioned /m confusion (though if you want to avoid becoming confused yourself, this might be a good place to stop reading):
In Perl 5 (which is the source of most regular expression extensions beyond the basic syntax), the behavior triggered by /m in Ruby is instead triggered by the /s option (which Ruby doesn't have, though if you put one on your regex it will silently ignore it). In Perl, /m, despite still being called "multiline mode", has a completely different effect: it causes the ^ and $ anchors to match at newlines within the string as well as at the beginning and end of the whole string respectively. But in Ruby, that behavior is the default, and there's not even an option to change it.
Pattern .*? will match any string, but as short string as possible, as there is a lazy operator ?.
Pattern \s* will match white-space characters (zero of more).
(?m) enables "multi-line mode". In this mode, the caret and dollar match before and after newlines in the subject string. To apply this mode to some sub-pattern only, sytax (?m:...) is used, where ... is a matching pattern.
For more information read http://www.regular-expressions.info/modifiers.html
Using Ruby, I am writing a regular expression, and I need to be a able to remove any colon that appears between parentheses. I understand that I can use
"This is a (string :)".sub!(/\([^\)]*:/, '')
to do this, but the problem is that this function will also remove the context along with it. Is there any way to specify that I only want it to remove the colon and not the entire matching expression?
So some regular expression engines support what are called look-ahead and look-behind matches that will match but not consume characters. Ruby does support look-ahead, but not look-behind (which is more difficult to do in a performant way), which means you could quite easily stick with sub and remove a colon that precedes a closing parenthesis, but only without ensuring it is after an opening parenthesis:
string = 'This is a (string :)'
string.sub /:(?=\))/, ''
# => 'This is a (string )'
The alternative would be to use subpattern capturing (which happens automatically when you use grouping in your regular expression) to rebuild the string without the undesirable portion, in this case the colon:
string.sub /(\([^:]+):\)/, '\1)'
The \1 is a back-reference to what is matched in the first group, which is delimited by the parentheses that are not escaped. You can see here I didn't bother capturing the closing parenthesis in a second group, opting instead simply to include it in the substitution. This works well in this case because it will not change, but if you don't know that the colon will appear at the end of the parentheses-enclosed content, you would need a second group:
string.sub /(\([^:]+):([^)]+\))/, '\1\2'
The prior answer will mostly work for deleting single colons within paren groups, but have trouble with multiples like '(thing:foo:bar)`. It would be nice to use lookbehind and lookahead to make the within parens assertion, but ruby (and most regexp engines) doesn't support non-deterministic length patterns in lookbehind.
irb> s = 'x (a:b:c) : (1:2:3) y'
=> "x (a:b:c) : (1:2:3) y"
irb> s.gsub /(?<=\([^\(]*):(?=[^\)]*\))/, ''
SyntaxError: (irb):10: invalid pattern in look-behind: /(?<=\([^\(]*):(?=[^\)]*\))/
from /Users/dbenhur/.rbenv/versions/1.9.2-wp/bin/irb:12:in `<main>'
You could instead use the block form of gsub to capture paren enclosed groups, then remove colons from each match:
irb> s.gsub(/\([^\)]*\)/) {|m| m.delete ':'}
=> "x (abc) : (123) y"
in regex in general, you can use '(\()(:)(\))', \1\3.
I'm not familiar with Ruby. Basically what you do is you have 3 groups, and from this three groups ( : and ) you get rid of the second one, the :.
I tested it in Notepad++ and it works.
I think this is called: regex backreference
Cheers.
If you can assume all parentheses will come in balanced pairs like they do in your example, this should be all you need:
"This is a (string :)".gsub!(/:(?=[^()]*\))/, '')
If the lookahead succeeds in finding a closing paren without seeing an opening paren first, the colon must be inside a (...) sequence. Notice how I excluded the opening paren as well as the closing paren; that's essential.
I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"