I am attempting to seperate blocks of Japanese text into individual sentences using regex. Right now I'm mostly experimenting on rubular but here is what I have so far.
regex: /(.*?(。|?|!))/
sample text
強面のため周囲の人から敬遠されている主人公が、クラスメイトと共通の話題を持とうとVRMMORPG「アナザーワールド」のベータテストに申し込んだ。ところが当選したのは彼一人。しかたなくひとりでゲーム内の仮想世界「イストピア」に「ケイオス」と名乗って乗り込んだが、そこはゲームでありながら五感すべてを体感でき、現実と間違えるほどのリアルな世界だった。サポートAIのテミスの協力を得つつ、クエストをこなしていったが、実はそこは本物の異世界「イストピア」であり、ケイオスのこなしたクエストによって、多くの人が影響を受けて……というお話。その戯言、聞き飽きたわ!あれ、ここにあった筆入れはどこにやったの?
The results im getting are correct however it is also separately matching the punctuation characters
How can I improve my regular expression so that the punctuation mark isn't separately matched?
Using (.*?[。?!]) seems to do the trick, check on rubular
Match 1
1. 強面のため周囲の人から敬遠されている主人公が、クラスメイトと共通の話題を持とうとVRMMORPG「アナザーワールド」のベータテストに申し込んだ。
Match 2
1. ところが当選したのは彼一人。
Match 3
1. しかたなくひとりでゲーム内の仮想世界「イストピア」に「ケイオス」と名乗って乗り込んだが、そこはゲームでありながら五感すべてを体感でき、現実と間違えるほどのリアルな世界だった。
Match 4
1. サポートAIのテミスの協力を得つつ、クエストをこなしていったが、実はそこは本物の異世界「イストピア」であり、ケイオスのこなしたクエストによって、多くの人が影響を受けて……というお話。
Match 5
1. その戯言、聞き飽きたわ!
Match 6
1. あれ、ここにあった筆入れはどこにやったの?
What about this?
str.scan /[\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}[[:punct:]]]+/
=> ["強面のため周囲の人から敬遠されている主人公が、クラスメイトと共通の話題を持とうと",
"「アナザ",
"ワ",
"ルド」のベ",
"タテストに申し込んだ。ところが当選したのは彼一人。しかたなくひとりでゲ",
"ム内の仮想世界「イストピア」に「ケイオス」と名乗って乗り込んだが、そこはゲ",
"ムでありながら五感すべてを体感でき、現実と間違えるほどのリアルな世界だった。サポ",
"ト",
"のテミスの協力を得つつ、クエストをこなしていったが、実はそこは本物の異世界「イストピア」であり、ケイオス のこなしたクエストによって、多くの人が影響を受けて……というお話。その戯言、聞き飽きたわ!あれ、ここにあった筆入れはどこにやったの?"]
http://rubular.com/r/8CtYuV8AAl
Related
Why does this regex not match 3a?
(\/\d{1,4}?|\d{1,4}?|\d{1,4}[A-z]{1})
Using \d{1,4}\D{1}, the result is the same.
Streets numbers:
/1
78
3a
89/
-1 (special case)
1
https://regex101.com/r/cYCafR/3
The digits+letter combination is not matched due to the order of alternatives in your pattern. The \d{1,4}? matches the digit before the letter, and \d{1,4}[A-z]{1} does not even have a chance to step in. See the Remember That The Regex Engine Is Eager article.
The \/\d{1,4}? will match a / and a single digit after the slash, and \d{1,4}? will always match a single digit, as {min,max}? is a lazy range/interval/limiting quantifier and as such only matches as few chars as possible. See Laziness Instead of Greediness.
Besides, [A-z] is a typo, it should be [A-Za-z].
It seems you want
\d{1,4}[A-Za-z]|\/?\d{1,4}
See the regex demo. If it should be at the start of a line, use
^(?:\d{1,4}[A-Za-z]|\/?\d{1,4})
See this regex demo.
Details
^ - start of a line
(?: - start of a non-capturing group
\d{1,4}[A-Za-z] - 1 to 4 digits and an ASCII letter
| - or
\/? - an optional /
\d{1,4} - 1 to 4 digits
) - end of the group.
Your regex uses lazy quantifiers like {1,4}?. These will match one character, and stop, because the rest of the pattern (i.e. nothing) matches the rest of the string. See here for how greedy vs lazy quantifiers work.
Another reason is that you put the \d{1,4}[A-z]{1} case last. This case will only be tried if the first two cases don't match. With 3a, the 3 already matches the second case, so the last case won't be considered.
You seem to just want:
^(\d{1,4}[A-Za-z]|\/?\d{1,4})
Note how the \/\d{1,4} case and the \d{1,4} case in your original regex are combined into one case \/?\d{1,4}.
^(?=(.*\d){4,})(?=(.*[A-Z]){3})(?!\s)(?=.*\W{2,})(?=(.*[a-z]){2,}).{12,14}$
The RegExp above is trying to:
match at least 4 digits - (?=(.*\d){4,})
match exactly 3 upper case letters - (?=(.*[A-Z]){3})
don't match spaces - (?!\s)
match at least 2 non-word characters - (?=.*\W{2,})
match at least 2 lower - (?=(.*[a-z]){2,})
string must be between 12 and 14 in length - .{12,14}
But I am having a challenge getting this to avoid matching spaces. It seems like because \W also includes spaces, my preceding negative look-ahead on spaces is being ignored.
For example:
b4A#Ac33*8Pd -- should match
b4A#Ac3 3*8Pd -- should not match
rubular link
Edited to provide further clarification:
Basically, I am trying to avoid having to spell out all the characters in the POSIX [:punct:] class ie !"#$%&'()*+,./:;<=>?#\^_\{|}~-` .. that is why I had a need to use \W .. But I would also want to exclude spaces
I can use a second pair of eyes, and more experienced suggestions here ..
Edited again, to correct mix-ups in counts specified in sub-patterns, as pointed out in the accepted answer below.
Instead of using dot ., use non spaces \S:
^(?=(.*\d){3,})(?=(.*[A-Z]){2})(?=.*\W{1,})(?=(.*[a-z]){1,})\S{12,14}$
// here ___^^
And is this a typo match at least 4 digits - (?=(.*\d){3,}),
it should be:
match at least 3 digits - (?=(.*\d){3,})
or
match at least 4 digits - (?=(.*\d){4,})
Same for other counts.
I have the following string that I want to extract from:
/Monovolume/Honda+HR+V+1+6-11399031.htm
What I want to extract is the 8 digit number at the end which I tried with the following regex:
Monovolume\/.+(\d{7,})
It says 7 or more because there are cases where there are only 7 digits. The match, however, is only 7 digits and not 8 as in the above string. When I run the part in parentheses only I get the right result. What is causing this behaviour and how can I fix it?
P.S. I can't put the "-" in the regex, because its appearance is coincidental.
You're very close. Your problem is that your .+ will always consume one of the digits, as regex is by default "greedy".
I'm not sure about your requirements, but you could do a lazy match:
Monovolume\/.+?(\d{7,})
/|\
|
It will essentially repeat as few times as possible (when it reaches 7 or more digits).
See it live
More info here: Regex Lazy Quantification
I want to scrape data from some text and dump it into an array. Consider the following text as example data:
| Example Data
| Title: This is a sample title
| Content: This is sample content
| Date: 12/21/2012
I am currently using the following regex to scrape the data that is specified after the 'colon' character:
/((?=:).+)/
Unfortunately this regex also grabs the colon and the space after the colon. How do I only grab the data?
Also, I'm not sure if I'm doing this right.. but it appears as though the outside parens causes a match to return an array. Is this the function of the parens?
EDIT: I'm using Rubular to test out my regex expressions
You could change it to:
/: (.+)/
and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:
/(?<=: ).+/
In addition to #minitech's answer, you can also make a 3rd variation:
/(?<=: ?)(.+)/
The difference here being, you create/grab the group using a look-behind.
If you still prefer the look-ahead rather than look-behind concept. . .
/(?=: ?(.+))/
This will place a grouping around your existing regex where it will catch it within a group.
And yes, the outside parenthesis in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.
I'm also using Ruby and I decided to do it with:
line_as_string.split(": ")[-1]
This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:
\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.
So, you may use
/:[[:blank:]]*\K.+/ # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/ # To match any whitespace with `\s`
Seee the Rubular demo #1 and the Rubular demo #2 and
Details
: - a colon
[[:blank:]]* - 0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).
following string:
23434 5465434
58495 / 46949345
58495 - 46949345
58495 / 55643
d 44444 ssdfsdf
64784
45643 dfgh
58495/55643
48593/48309596
675643235
34565435 34545
it only want to extract the bold ones. its a five digit number(german).
it should not match telephone numbers 43564 366334 or 45433 / 45663,etc as in my example above.
i tried something like ^\b\d{5} but thats not a good beginning.
some hints for me to get this working?
thanks for all hints
You could add a negative look-ahead assertion to avoid the matches with phone numbers.
\b[0124678][0-9]{4}\b(?!\s?[ \/-]\s?[0-9]+)
If you're using Ruby 1.9, you can add a negative look-behind assertion as well.
You haven't specified what distinguishes the number you're trying to search for.
Based on the example string you gave, it looks like you just want:
^(\d{5})\n
Which matches lines that start with 5 digits and contain nothing else.
You might want to permit some spaces after the first 5 digits (but nothing else):
^(\d{5})\s*\n
I'm not completely sure about the specified rules. But if you want lines that start with 5 digits and do not contain additional digits, this may work:
^(\d{5})[^\d]*$
If leading white space is okay, then:
^\s*(\d{5})[^\d]*$
Here is the Rubular link that shows the result.
^\D*(\d{5})(\s(\D)*$|()$)
This should (it's untested) match:
line starting with five digits (or some non-digits and then five digits), then
a space, and ending with some non-numbers
line starting and ending with five
digits (or some non-digits and then five digits)
\1 would be the five digits
\2 would be the whole second half, if any
\3 would be the word after the digits, if any
edited to fit the asker's edited question
edit again: I came up with a much more elegant solution:
^\D*(\d{5})\D*$