Matching repeated character on index (0,1) and also on index (1,2) - ruby

I've got a string
698636235|2004-02-19||UN|
713220614|2009-10-07|||
This is part of a pipe-separated file (I know....) that I'm trying to load into MySQL.
I'm trying to use regex to fill empty field values with \N so that MySQL will insert null. However this is a problem when there are multiple fields that are null values.
My current regex is /\|\|/ which matches one instance of double pipe. This regex will match once at index (0,1).
Is it possible for regex to match ||| twice? Once at index (0,1) and once at index (1,2)?
If no, I'll just write a proper looping function.

I'd suggest using look-arounds:
(?<=^|\||\n)(?=\||$|\n)
It finds the zero-width space between to vertical bars, or between a vertical bar and start/end of the line/text (=empty field).
The first part, the positive look-behind (?<= checks that the position we are interested in is preceded by start of text ^, a vertical bar | or a new line \n.
The second part, the positive look-ahead (?=, ensures it's followed by vertical bar, new line or end of text $.
See it here at regex101.
Edit
As per comment, added support for empty field at start of line. (Had to check, but from what I can see ruby supports look-behinds. Original if someone needs it in JS: \|(?=\||$|\n))

Related

Trimming chr(49824) in the middle of a field in oracle

Unable to trim the non breakable space in the middle of a filed in oracle
'766195491 572'
Tried the below method it works only when non breakable space is present on the sides.
select length(trim(replace('766195491 572',chr(49824),''))) from dual;
it works only when non breakable space is present on the sides
That’s what the trim() function is supposed to do:
TRIM enables you to trim leading or trailing characters (or both) from a character string
“leading or trailing” means “at the sides”. It is not supposed to have any effect on appearances of the characters anywhere else in the source string.
You need to use the replace() or translate() functions instead; or for more complicated scenarios, regular expression functions.
If the input value is in a column named input_str, then:
translate(input_str, chr(49824), chr(32))
will replace every non-breakable space in the input string with a regular (breakable) space.
If you simply want to remove all non-breakable spaces and don't want to replace them with anything, then
replace(input_str, chr(49824))
(if you omit the third argument, the result is simply removing all occurrences of the second argument).
Perhaps the requirement is more complicated though; find all occurrences of one or more consecutive non-breaking spaces and replace each such occurrence with exactly one standard space. That is more easily achieved with a regular expression function:
regexp_replace(input_str, chr(49824) || '+', chr(32))
Try CHR(32) instead of CHR(49824)
select length(replace('766195491 572',chr(32),'')) from dual;
If it does not work, use something like this.
select length(regexp_replace('766195491 572','[^-a-zA-Z0-9]','') ) from dual;
DEMO

Removing trailing newlines with regex in Ruby's 'String#scan'

I have a string, which contains a bunch of HTML documents, tagged with #name:
string = "#one\n\n<html>\n</html>\n\n#two\n<html>\n</html>\n\n\n"
I want to get an array of two-element arrays, each of which with a tag as the first element and the HTML document as the second:
[ ["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"] ]
In order to solve the problem, I crafted the following regular expression:
regex = /(#.+)\n+([^#]+)\n+/
and applied it in string.scan regex.
However, instead of the desired output, I get the following:
[ ["#one", "<html>\n</html>\n"], ["#two", "<html>\n</html>\n\n"] ]
There are trailing newline characters at the end of each document. It appears that only one newline character was removed from the documents, but others stayed at the place.
How can the aforementioned regular expression be changed in order to remove all the trailing characters from the resulting documents?
The reason only the last \n was thrown away is because the two relevant capturing parts in your regex: .+ and [^#]+ capture everything up to the last \n (in order to make matching possible at all). It does not matter that they are followed by \n+. Remember that regex works from the left to the right. If some substring (sequences of \n in this case) can fit in either the preceding part of the following part of a regex, it actually fits in the preceding part.
With generality, I would suggest doing this:
string.split(/\s+(?=#)/).map{|s| s.strip.split(/\s+/, 2)}
# => [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]
You can remove duplicated newlines first:
string.gsub(/\n+/, "\n").scan(regex)
=> [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]

Is there a way to replace a range of characters in a string using gsub?

I would like to replace every character except the last 4 with a "#"...like you would see on a credit card statement. I have accomplished this using the Array#each method to iterate through indexes [0..-4] and then another for [-4..-1] and shoveling results from both into a new string. I'm thinking that maybe this could be better done with regex? But I am new to regex, and google hasn't turned up anything I can use in regards to replacing an entire range without losing the length of the string. I have tried
str.gsub(str[0..-5],'#')
(and a few other things) but it replaces the entire range with a single character. How can I accomplish my goal using regex?
Yep, this is possible with regex.
> "12345678".gsub(/.(?=.{4})/, "#")
=> "####5678"
> "12345678901234".gsub(/.(?=.{4})/, "#")
=> "##########1234"
Explanation:
.(?=.{4}) matches a character only if it's followed by atleast four characters. So it matches all the characters except the last four chars because from the last, fourth character is followed by 3 characters not 4. So it fails to match the 4th char from the last. Likewise for 3rd, 2nd, 1st chars (from the last).
OR
> "12345678901234".gsub(/(?!.{1,4}$)./, "#")
=> "##########1234"
DEMO

Syntax Highlighting in Notepad++: how to highlight timestamps in log files

I am using Notepad++ to check logs. I want to define custom syntax highlighting for timestamps and log levels. Highlighting logs levels works fine (defined as keywords). However, I am still struggling with highlighting timestamps of the form
06 Mar 2014 08:40:30,193
Any idea how to do that?
If you just want simple highlighting, you can use Notepad++'s regex search mode. Open the Find dialog, switch to the Mark tab, and make sure Regular Expression is set as the search mode. Assuming the timestamp is at the start of the line, this Regex should work for you:
^\d{2}\s[A-Za-z]+\s\d{4}\s\d{2}:\d{2}:\d{2},[\d]+
Breaking it down bit by bit:
^ means the following Regex should be anchored to the start of the line. If your timestamp appears anywhere but the start of a line, delete this.
\d means match any digit (0-9). {n} is a qualifier that means to match the preceding bit of Regex exactly n times, so \d{2} means match exactly two digits.
\s means match any whitespace character.
[A-Za-z] means match any character in the set A-Z or the set a-z, and the + is a qualifier that means match the preceding bit of Regex 1 or more times. So we're looking for an alphabetic character sequence containing one or more alphabetic characters.
\s means match any whitespace character.
\d{4} is just like \d{2} earlier, only now we're matching exactly 4 digits.
\s means match any whitespace character.
\d{2} means match exactly two digits.
: matches a colon.
\d{2} matches exactly two digits.
: matches another colon.
\d{2} matches another two digits.
, matches a comma.
[\d]+ works similarly to the alphabetic search sequence we set up earlier, only this one's for digits. This finds one or more digits.
When you run this Regex on your document, the Mark feature will highlight anything that matches it. Unlike the temporary highlighting the "Find All in Document" search type can give you, Mark highlighting lasts even after you click somewhere else in the document.

Ruby regex for text within parentheses

I am looking for a regex to replace all terms in parentheses unless the parentheses are within square brackets.
e.g.
(matches) #match
[(do not match)] #should not match
[[does (not match)]] #should not match
I current have:
[^\]]\([^()]*\) #Not a square bracket, an opening bracket, any non-bracket character and a closing bracket.
However this is still matching words within the square brackets.
I have also created a rubular page of my progress so far: http://rubular.com/r/gG22pFk2Ld
A regex is not going to cut it for you if you can nest the square brackets (see this related question).
I think you can only do this with a regex if (a) you only allow one level of square brackets and (b) you assume all square brackets are properly matched. In that case
\([^()]*\)(?![^\[]*])
is sufficient - it matches any parenthesised expression not followed by an unpaired ]. You need (b) because of the limitations of negative lookbehind (only fixed length strings in 1.9, and not allowed at all in 1.8), which mean you are stuck matching (match)] even if you don't want to.
So basically if you need to nest, or to allow unmatched brackets, you should ditch the regex and look at the answer to the question I linked to above.
This is a type of expression you cannot parse using a pure-regex approach, because you need to keep track of the current nesting/state_if_in_square_bracket (so you don't have a type 3 language anymore).
However, depending on the exact circumstances, you can parse it with multiple regexes or simple parsers. Example approaches:
Split into sub-strings, delimited by
[/[[or ]/]], change the state
when such a square bracket is
encountered, replace () in a
sub-string if in
"not_in_square_bracket" state
Parse for square brackets (including content), remove & remember them (these are "comments"), now replace all the content in normal brackets and re-add the square brackets stuff (you can remember stuff by using unique temp strings)
The complexity of your solution also depends on the detail if escaping ] is allowed.

Resources