Here is a sample string:
Year,Quarter,Month,text1,text2,Department,BU,text3,text4,Job,Grade,Pay,Location,text5,text6,
I need to remove all occurrences of the random texts. Rules are:
The random texts always come in a pair, one after another.
The first of the two texts always begin with the word “namesa” however the length is unpredictable, but the second one follows no such pattern. Example, text1 could be “namesa-their duty 505”, text3 could be “namesa-silver lane near me”, text5 could be “namesa-regexp 101 challenge”. Text2, text4, text6 are completely unpredictable.
None of these texts contain a comma. They only end in a comma.
The number of times this whole pattern repeats is unpredictable.
select ('Year,Quarter,Month,namesa-their duty 505,text2,Department,BU,namesa-silver lane near me,text4,Job,Grade,Pay,Location,namesa-regexp 101 challenge,text6,') from dual;
For the above input, my output should be:
Year,Quarter,Month,Department,BU,Job,Grade,Pay,Location,
Basically, we need to locate the word “namesa” - start from there, go through two commas, remove everything from namesa through the second comma, then repeat the same thing again for the rest of the string. I am at a loss how to do this in regular expressions.
So you have a comma-delimited string of tokens and you're trying to replace every token that starts with "namesa" and remove it and the following token? Try this:
regexp_replace(col1, 'namesa[^,]*,[^,]*,', '')
So that breaks down to:
The string "namesa"
Zero or more characters that are not a comma
A comma (ending the first token)
Zero or more characters that are not a comma
A comma (ending the second token)
SQLFiddle
Related
I have results for FOOD, FOOD20 and FOOD 30 but have other results that come from FOOD such as DOGFOOD, CATFOOD using REGEX.
I am trying to place an EXACT filter by using:-
FOOD|FOOD20|FOOD30
to extract just these results instead of using REGEX. Unfortunately this is returning 0 results.
Is there another work around for this?
An exact filter is a literal string match, so you're explicitly looking for something matching all of "FOOD|FOOD20|FOOD30" exactly.
If you want to ensure that the value is exactly FOOD, FOOD20 or FOOD30, use REGEX matching, but precede each value with a caret (^), which marks the beginning of the line, and follow each value with the dollar sign ($), which marks the end of the line.
So, your REGEX expression would be:
^FOOD$|^FOOD20$|^FOOD30$
If your idea is to track anything that starts with "FOOD", followed by a number, and then ends, you can simplify your expression to the following:
^FOOD[0-9]*$
(The [0-9]* part means match the numbers 0 to 9 zero or more times, so it matches when there are no numbers after FOOD, or when there are some.)
This will match FOOD, FOOD20, FOOD30, FOOD99 and FOOD100, but not CATFOOD, DOGFOOD10, etc.
I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
I would like to replace every character except the last 4 with a "#"...like you would see on a credit card statement. I have accomplished this using the Array#each method to iterate through indexes [0..-4] and then another for [-4..-1] and shoveling results from both into a new string. I'm thinking that maybe this could be better done with regex? But I am new to regex, and google hasn't turned up anything I can use in regards to replacing an entire range without losing the length of the string. I have tried
str.gsub(str[0..-5],'#')
(and a few other things) but it replaces the entire range with a single character. How can I accomplish my goal using regex?
Yep, this is possible with regex.
> "12345678".gsub(/.(?=.{4})/, "#")
=> "####5678"
> "12345678901234".gsub(/.(?=.{4})/, "#")
=> "##########1234"
Explanation:
.(?=.{4}) matches a character only if it's followed by atleast four characters. So it matches all the characters except the last four chars because from the last, fourth character is followed by 3 characters not 4. So it fails to match the 4th char from the last. Likewise for 3rd, 2nd, 1st chars (from the last).
OR
> "12345678901234".gsub(/(?!.{1,4}$)./, "#")
=> "##########1234"
DEMO
I have got a strange sscanf problem with a capital letter 'N'(maybe I do not understand something correct me please):
Example 1:
char cBuff[128];
sscanf("GUIDNameNENE","%*[GUIDName]%127s" ,cBuff);
returns cBuff:ENE
Example 2:
char cBuff[128];
sscanf("GUIDNamenENE","%*[GUIDName]%127s" ,cBuff);
returns cBuff:nENE
Example 3:
char cBuff[128];
sscanf("GUIDNaMENE","%*[GUIDNa]%127s" ,cBuff);
returns cBuff:ENE
I have tried many other variants but still always skips capital N.
Where is the problem?
Thank you in advance!
%[GUIDName] is not a weird way of quoting and matching an exact string. It defines a set of characters that will match. They will match in any order, and they will match repeatedly.
The longest match for the set %[GUIDName] in your input is GUIDNameN.
You could of course say %*[G]%*[U]%*[I]%*[D]%*[N]%*[a]%*[m]%*[e] and that would not eat any of the characters GUIDNam, but it would still eat multiple es.
I would guess the reason it skips the capital N is because it's part of the set of characters that you ignore. The key point is that what you specify between the brackets are a set of characters to match, not in a fixed order, but rather that sscanf tries to match the longest string consisting of only the characters after the '[' up to the first matching ']'. If I recall correct.
You could try specifying the size for the set of characters to be skipped like this:
sscanf("GUIDNameNENE","%*8[GUIDName]%127s" ,cBuff);
But that will of course only work if the string always is eight characters long and if it is you could choose to just ignore the eight initial characters like this:
sscanf("GUIDNameNENE","%*8s%127s" ,cBuff);
This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.