Regular expression to remove a portion of text from each entry in commas separated list - oracle

I have a string of comma separated values, that I want to trim down for display purpose.
The string is a comma separated list of values of varying lengths and number of list entries.
Each entry in the list is formatted as a five character pattern in the format "##-NX" followed by some text.
e.g., "01-NX sometext, 02-NX morertext, 09-NX othertext, 12-NX etc..."
Is there an regular expression function I can use to remove the text after the 5 character prefix portion of each entry in the list, returning "01-NX, 02-NX, 09-NX, 12-NX,..."?
I am a novice with regular expressions and I haven't been able figure out how to code the pattern.

I think what you need is
regexp_replace(regexp_replace(mystring, '(\d{2}-NX)(.*?)(,)', '\1\3'), '(\d{2}.*NX).*', '\1')
The inner REGEXP_REPLACE looks for a pattern like nn-NX (two numeric characters followed by "-NX") and any number of characters up to the next comma, then replaces it with the first and third term, dropping the "any number of characters" part.
The outer REGEXP_REPLACE looks for a pattern like two numeric characters followed by any number of characters up to the last NX, and keeps that part of the string.
Here is the Oracle code I used for testing:
with a as (
select '01-NX sometext, 02-NX morertext, 09-NX othertext, 12-NX etc.' as myString
from dual
)
select mystring
, regexp_replace(regexp_replace(mystring, '(\d{2}-NX)(.*?)(,)', '\1\3'), '(\d{2}.*NX).*', '\1') as output
from a

This alternative calls REGEXP_REPLACE() once.
Match 2 digits, a dash and 'NX' followed by any number of zero or more characters (non-greedy) where followed by a comma or the end of the string. Replace with the first group and the 3rd group which will be either the comma or the end of the string.
EDIT: Took dougp's advice and eliminated the RTRIM by adding the 3rd capture group. Thanks for that!
WITH tbl(str) AS (
SELECT '01-NX sometext, 02-NX morertext, 09-NX othertext, 12-NX etc.' FROM dual
)
SELECT
REGEXP_REPLACE(str, '(\d{2}-NX)(.*?)(,|$)', '\1\3') str
from tbl;

Related

Hive split string to get all the items except last one

why below solution doesn't work when the string is split by period sign (.) .
select regexp_extract('test,data,fd,dfd','^(.*?)(?:,)(.*)$', 2) from tablename;
input : 193.54.23.456
out put : 193.54.23
If you need to extract all elements of comma or dot separated string, use this '^(.*?)(?:[.,])([^,.]*)$' regex and extract group number 1.
Regex meaning:
^ - beginning of the string anchor
(.*?) - group number 1, any character any times not greedy, this group will capture what you need to extract
(?:[.,]) - dot or comma, this group is not indexed for extract because it is not capturing group (?: means not capturing)
([^,.]*?) --capturing group number 2 - any character except dot and comma any times. This group has index 2 and will capture last element and can be extracted if necessary using index 2.
Demo:
--separated by dot
regexp_extract('test.data.fd.dfd','^(.*?)(?:[.,])([^,.]*)$', 1) --returns test.data.fd
regexp_extract('test.data.fd.dfd','^(.*?)(?:[.,])([^,.]*)$', 2) --returns dfd
--the same regex with comma separated
regexp_extract('test,data,fd,dfd','^(.*?)(?:[.,])([^,.]*)$', 1) --returns test.data.fd
regexp_extract('test,data,fd,dfd','^(.*?)(?:[.,])([^,.]*)$', 2) --returns dfd
Inside character class [] you do not need to escape dot.

Characters at the end do not match

I need to match all the alphabets and numbers in a string str.
This is my code.
str.match(/^(AB)(\d+)([A-Za-z][0-9])?/)
When str = AB57933A [sic], it matches only AB57933, and not the characters appended after the numbers.
If I try with str = AB57933AbC [sic], it matches only AB57933; it only matches up to the last number, and not the characters after that.
In the way you have written it:
/^(AB)(\d+)([A-Za-z][0-9])/
you impose that the last character is between 0 and 9, you can replace it depending on your needs by if you do not expect digits after the last letter
/^(AB)(\d+)([A-Za-z]+)/
or by
/^(AB)(\d+)([A-Za-z0-9]+)/
if AB57933AbC12 are also accepted as valid input.
Last but not least, if you do not use back references you can omit the parenthesis as you do not need capturing groups

Format string in Oracle

I'm building a string in oracle, where I get a number from a column and make it a 12 digit number with the LPad function, so the length of it is 12 now.
Example: LPad(nProjectNr,12,'0') and I get 000123856812 (for example).
Now I want to split this string in parts of 3 digit with a "\" as prefix, so that the result will look like this \000\123\856\812.
How can I archive this in a select statement, what function can accomplish this?
Assuming strings of 12 digits, regexp_replace could be a way:
select regexp_replace('000123856812', '(.{3})', '\\\1') from dual
The regexp matches sequences of 3 characters and adds a \ as a prefix
It is much easier to do this using TO_CHAR(number) with the proper format model. Suppose we use \ as the thousands separator.... (alas we can't start a format model with a thousands separator - not allowed in TO_CHAR - so we still need to concatenate a \ to the left):
See also edit below
select 123856812 as n,
'\' || to_char(123856812, 'FM000G000G000G000', 'nls_numeric_characters=.\') as str
from dual
;
N STR
--------- ----------------
123856812 \000\123\856\812
Without the FM format model modifier, TO_CHAR will add a leading space (placeholder for the sign, plus or minus). FM means "shortest possible string representation consistent with the model provided" - that is, in this case, no leading space.
Edit - it just crossed my mind that we can exploit TO_CHAR() even further and not need to concatenate the first \. The thousands separator, G, may not be the first character of the string, but the currency symbol, placeholder L, can!
select 123856812 as n,
to_char(123856812, 'FML000G000G000G000',
'nls_numeric_characters=.\, nls_currency=\') as str
from dual
;
SUBSTR returns a substring of a string passed as the first argument. You can specify where the substring starts and how many characters it should be.
Try
SELECT '\'||SUBSTR('000123856812', 1,3)||'\'||SUBSTR('000123856812', 4,3)||'\'||SUBSTR('000123856812', 7,3)||'\'||SUBSTR('000123856812', 10,3) FROM dual;

regular expression extract string between two strings

I am trying to extract strings using regexp. For example in the following string:
select DESCENDANTS([Customer].[Yearly Income],,LEAVES) on axis(0),
DESCENDANTS([Sales Territory].[Sales Territory],,LEAVES) on axis(1),
DESCENDANTS([Customer].[Total Children],,LEAVES) on axis(2)
from [Adventure Works]
where [Measures].[Internet Sales Amount]
I want to extract the substring between every pair of "DESCENDANTS(" and ",,".
So the result in this case would be: [Customer].[Yearly Income], [Sales Territory].[Sales Territory], [Customer].[Total Children]
Any help is appreciated. Thanks in advance.
If you have your text in a string called query you can do:
query.scan(/DESCENDANTS\((.+),,/).flatten
=> ["[Customer].[Yearly Income]", "[Sales Territory].[Sales Territory]",
"[Customer].[Total Children]"]
Some notes:
\( matches the literal open bracket
(.+) remembers the characters between the open bracket and the two commas as a capture
If the regexp contains captures () then scan will return an array of arrays of the captured parts for each match. In this case there is only 1 capture per match so flatten can be used to return a single array of all the matches we are interested in.
/DESCENDANTS\(([^,]+),,/
See it on rubular
Here's an uglier variation that uses split: split on "DESCENDANTS(" and ",,", and take every other substring:
s.split(/DESCENDANTS\(|,,/).each_with_index.inject([]) {|m,(e,i)| m << e if i.odd?; m}
.+? is more safe, it works correctly if SQL is in one line.
query.scan(/DESCENDANTS\((.+?),,/).flatten

ruby parametrized regular expression

I have a string like "{some|words|are|here}" or "{another|set|of|words}"
So in general the string consists of an opening curly bracket,words delimited by a pipe and a closing curly bracket.
What is the most efficient way to get the selected word of that string ?
I would like do something like this:
#my_string = "{this|is|a|test|case}"
#my_string.get_column(0) # => "this"
#my_string.get_column(2) # => "is"
#my_string.get_column(4) # => "case"
What should the method get_column contain ?
So this is the solution I like right now:
class String
def get_column(n)
self =~ /\A\{(?:\w*\|){#{n}}(\w*)(?:\|\w*)*\}\Z/ && $1
end
end
We use a regular expression to make sure that the string is of the correct format, while simultaneously grabbing the correct column.
Explanation of regex:
\A is the beginnning of the string and \Z is the end, so this regex matches the enitre string.
Since curly braces have a special meaning we escape them as \{ and \} to match the curly braces at the beginning and end of the string.
next, we want to skip the first n columns - we don't care about them.
A previous column is some number of letters followed by a vertical bar, so we use the standard \w to match a word-like character (includes numbers and underscore, but why not) and * to match any number of them. Vertical bar has a special meaning, so we have to escape it as \|. Since we want to group this, we enclose it all inside non-capturing parens (?:\w*\|) (the ?: makes it non-capturing).
Now we have n of the previous columns, so we tell the regex to match the column pattern n times using the count regex - just put a number in curly braces after a pattern. We use standard string substition, so we just put in {#{n}} to mean "match the previous pattern exactly n times.
the first non skipped column after that is the one we care about, so we put that in capturing parens: (\w*)
then we skip the rest of the columns, if any exist: (?:\|\w*)*.
Capturing the column puts it into $1, so we return that value if the regex matched. If not, we return nil, since this String has no nth column.
In general, if you wanted to have more than just words in your columns (like "{a phrase or two|don't forget about punctuation!|maybe some longer strings that have\na newline or two?}"), then just replace all the \w in the regex with [^|{}] so you can have each column contain anything except a curly-brace or a vertical bar.
Here's my previous solution
class String
def get_column(n)
raise "not a column string" unless self =~ /\A\{\w*(?:\|\w*)*\}\Z/
self[1 .. -2].split('|')[n]
end
end
We use a similar regex to make sure the String contains a set of columns or raise an error. Then we strip the curly braces from the front and back (using self[1 .. -2] to limit to the substring starting at the first character and ending at the next to last), split the columns using the pipe character (using .split('|') to create an array of columns), and then find the n'th column (using standard Array lookup with [n]).
I just figured as long as I was using the regex to verify the string, I might as well use it to capture the column.

Resources