splitting Strings into a String Array - quotation-marks

I have a Spreadsheet as a .txt file and I need to split each row into an Array for further calculations each value is divided by a ",".
My Problem is some values have quotation marks and inside these quotation marks, they use commas. How can I separate the values without splitting the value inside the quotation marks?
Example row:
1000,117090058,117970084,"170,9 + 58","179,7 + 84","Flensburg Weiche, W 203 - Flensburg Grenze",Flensburg-Weiche - Flensb. Gr

This one liner should work:
yourString.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")

Related

How to replace _A_&_B_ using gsub in R

I am trying to join two columns containing company names from two distinct data tables on R. In one column I have the pattern _A_&_B_ where A and B can be any letters. I would like to get rid of those two letters i.e letter of length 1 surrounded by _
So if I have John_K_&_E_Scott I would like to have John__&__Scott as I can remove the punctuation. I have tried the below
names[, JOINING_ID := gsub("[A-Za-z]_&_[A-Za-z]\\w", "", JOINING_ID)]
But this transforms John_A_&_ BOYS_ in John__&_ OYS_ which is not what I want.
Use the following regex pattern:
_[[:alpha:]]_&_[[:alpha:]]_
and replace with __&__. See the regex demo. It won't match strings like John_A_&_BOYS_ and thus there won't be issues like the one you are having.
Note that [[:alpha:]] matches any letter.
R usage:
gsub("_[[:alpha:]]_&_[[:alpha:]]_", "__&__", JOINING_ID)
Or, if you only expect 1 match per string, use sub:
sub("_[[:alpha:]]_&_[[:alpha:]]_", "__&__", JOINING_ID)

Complex requirements for string split around select commas

TL;DR
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
Explanation of the problem:
This is a pretty complex problem so any help is greatly appreciated. I'm adding functionality to a library I've been using for a while now. I have this string that contains an array
"['d,og,f:asdf,:hello,",,\",,alsee',,,'ho,la', "-123,4,5.3", true, :good, false,,, "gr\'\'\'true,\',\'ee\"n", ":::testme", true]"
I would like to split this string only around select commas so that I have an array containing the following values
'd,og,f:asdf,:hello,",,\",,alsee'
nil
nil
'ho,la'
"-123,4,5.3"
true
:good
false
nil
nil
"gr\'\'\'true,\',\'ee\"n"
":::testme"
true
Then nil values are coming from the side by side commas that are not contained in any string. I wrote the following regex to split the string above (I already got rid of the start and end brackets):
/(?<=(?:['\"]|false|true|^|,)),(?=(?:\s*(?:(?::[\w]+)|(?:(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*'))|(?:false|true)))\s*(?:,|$)))/
This splits the string so I get these values:
(0) "'d,og,f:asdf,:hello,",,\",,alsee',,"
(1) "'ho,la'"
(2) " "-123,4,5.3""
(3) " true"
(4) " :good, false,,"
(5) " "gr\'\'\'true,\',\'ee\"n""
(6) " ":::testme""
(7) " true"
All the values are strings as can be seen by their surrounding double quotes. They will not all end up that way though. A true or false will be converted to a boolean. The values surrounded by internal quotes will end up as strings. Then a value preceded with a : will end up as a symbol.
There are problems with the values at index 0 and 4. Index 0 should be this:
(0.0) "'d,og,f:asdf,:hello,",,\",,alsee'"
(0.1) nil
(0.2) nil
As you can see, the two commas at the end are gone. They have become the two nil values you see above. Then the string starts at the first single quote and ends at the last single quote, signifying that this value in the array is a string.
Then index 4 (" :good, false,,") should be this:
(4.0) " :good"
(4.1) " false"
(4.2) nil
(4.3) nil
The two commas at the end have become nil. Then " false" is it's own value which will later be converted to a boolean, while " :good" is also it's own value and will later be converted to a symbol.
To fix the problem with index 4 I have all the values run through a second regex. Here it is:
/^(\s*:(?:(?:[\w]+|\"[\s\S]+\"|'[\s\S]+')\s*)),([\s\S]*)$/
Instead of splitting this one I get the capture groups. It ends up returning this array for the value at index 4:
(4.0) " :good"
(4.1) " false,,"
That's what I wanted except for one problem. The value at index 4.1 (" false,,") has the two trailing commas which should be nil values in the array.
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
"['d,og,f:asdf,:hello,"
,,\
",,alsee',,,'ho,la', "
-123,4,5.3
", true, :good, false,,, "
gr\
'\'
I count 4 strings. 3 in double quotes and the last one in single quotes?
You say this is broken down into smaller strings by your regx. But what about the characters outside the 4 strings?
Sorry, it looks a bit of a mess.
Try putting it all in a here document string and then breaking it down by a regx.
I finally figured it out myself. You can see how it fits in with the rest if you look at the description of the question above.
/^(([\s]*,)*)[\s]*((?::[\w]+)|(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*')|false|true))?(([\s]*,)*)$/

I want to tokenize string using the following delimiters in pig: dash, comma, hash, space and colon

How can I do this using STRSPLIT, TOKENIZER or any other method?
You can use STRSPLIT with regex to solve this problem. I am not sure your input has single or multiple combination of delimiters(dash,comma,hypen,space and hash) but the below solution will work for both.
input
a#b c-d,e
f e,g#h:i
1,2,3,4,5
l#y#z#h#n
A B C D E
PigScript:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'[-,:\\s#]',5));
DUMP B;
Output:
(a,b,c,d,e)
(f,e,g,h,i)
(1,2,3,4,5)
(l,y,z,h,n)
(A,B,C,D,E)
If you have only single delimiter in your input, say'#' or any other delimiter that you mentioned then try the below script ( '5' in the third arg is total number of columns in your input)
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'#',5));
In case of multiple delimiter, suppose you want to add any new delimiter say '$' then just add this delimiter inside the character class of regex.
Note '$' is special character in Regex which needs escaping for double backslashs like this '[\\$-,:\\s#]'

How do I match repeated characters?

How do I find repeated characters using a regular expression?
If I have aaabbab, I would like to match only characters which have three repetitions:
aaa
Try string.scan(/((.)\2{2,})/).map(&:first), where string is your string of characters.
The way this works is that it looks for any character and captures it (the dot), then matches repeats of that character (the \2 backreference) 2 or more times (the {2,} range means "anywhere between 2 and infinity times"). Scan will return an array of arrays, so we map the first matches out of it to get the desired results.

ruby parametrized regular expression

I have a string like "{some|words|are|here}" or "{another|set|of|words}"
So in general the string consists of an opening curly bracket,words delimited by a pipe and a closing curly bracket.
What is the most efficient way to get the selected word of that string ?
I would like do something like this:
#my_string = "{this|is|a|test|case}"
#my_string.get_column(0) # => "this"
#my_string.get_column(2) # => "is"
#my_string.get_column(4) # => "case"
What should the method get_column contain ?
So this is the solution I like right now:
class String
def get_column(n)
self =~ /\A\{(?:\w*\|){#{n}}(\w*)(?:\|\w*)*\}\Z/ && $1
end
end
We use a regular expression to make sure that the string is of the correct format, while simultaneously grabbing the correct column.
Explanation of regex:
\A is the beginnning of the string and \Z is the end, so this regex matches the enitre string.
Since curly braces have a special meaning we escape them as \{ and \} to match the curly braces at the beginning and end of the string.
next, we want to skip the first n columns - we don't care about them.
A previous column is some number of letters followed by a vertical bar, so we use the standard \w to match a word-like character (includes numbers and underscore, but why not) and * to match any number of them. Vertical bar has a special meaning, so we have to escape it as \|. Since we want to group this, we enclose it all inside non-capturing parens (?:\w*\|) (the ?: makes it non-capturing).
Now we have n of the previous columns, so we tell the regex to match the column pattern n times using the count regex - just put a number in curly braces after a pattern. We use standard string substition, so we just put in {#{n}} to mean "match the previous pattern exactly n times.
the first non skipped column after that is the one we care about, so we put that in capturing parens: (\w*)
then we skip the rest of the columns, if any exist: (?:\|\w*)*.
Capturing the column puts it into $1, so we return that value if the regex matched. If not, we return nil, since this String has no nth column.
In general, if you wanted to have more than just words in your columns (like "{a phrase or two|don't forget about punctuation!|maybe some longer strings that have\na newline or two?}"), then just replace all the \w in the regex with [^|{}] so you can have each column contain anything except a curly-brace or a vertical bar.
Here's my previous solution
class String
def get_column(n)
raise "not a column string" unless self =~ /\A\{\w*(?:\|\w*)*\}\Z/
self[1 .. -2].split('|')[n]
end
end
We use a similar regex to make sure the String contains a set of columns or raise an error. Then we strip the curly braces from the front and back (using self[1 .. -2] to limit to the substring starting at the first character and ending at the next to last), split the columns using the pipe character (using .split('|') to create an array of columns), and then find the n'th column (using standard Array lookup with [n]).
I just figured as long as I was using the regex to verify the string, I might as well use it to capture the column.

Resources