How to strip out \r\n in between a quoted string in between tabs when rows are also delimited by \r\n? - ruby

In Ruby 2.1.3, I have a string representing a title such as in a tab delimited csv file format:
string = "helloworld\r\n14522\tAB-12-00420\t\"PROTOCOL \r\nRisk Effectiveness \r\nand Device Effectiveness In \r\Ebola Candidates \"\tData Collection only\t\t20\t"
I want to strip out the "\r\n" only in the tab delimited portion that starts with Protocol so I can read a complete title as "PROTOCOL Risk Effectiveness and Device Effectiveness In Ebola Candidates"....I want the end result to be:
"helloworld\r\n14522\tAB-12-00420\t\"PROTOCOL Risk Effectiveness and Device Effectiveness In Heart Failure Candidates \"\tData Collection only\t\t20\t"
If I don't do this, trying to read it in via CSV truncates the title so I only end up reading "PROTOCOL" and not the rest of the title.
Keep in mind there may be an indeterminate number of \r\n characters I want to remove within a title (I'll be parsing through different titles). How do I accomplish this? I was thinking a regular expression might be the way...

Since a newline (outside of quotes) is treated as a delimiter,
you could use this regex to isolate quoted fields then replace any \r?\n just
within that field.
You would then pass the string into the CSV module.
There are 3 groups that together constitute the entire match.
1. Delimiter
2. Double quoted field
3 Non-quoted field
Would need a replace-with-callback function implementation.
Within the callback, if group 2 is not empty, do a separate replace of all CRLF's.
Catenate goup 1 + replaced(group2) + group 3, then return the catenation.
# ((?:^|\t|\r?\n)[^\S\r\n]*)(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"(?:[^\S\r\n]*(?=$|\t|\r?\n)))|([^\t\r\n]*(?:[^\S\r\n]*(?=$|\t|\r?\n))))
( # (1 start), Delimiter tab or newline
(?: ^ | \t | \r? \n )
[^\S\r\n]* # leading optional whitespaces
) # (1 end)
(?:
( # (2 start), Quoted string field
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
(?:
[^\S\r\n]* # trailing optional whitespaces
(?= $ | \t | \r? \n ) # Delimiter ahead, tab or newline
)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^\t\r\n]*
(?:
[^\S\r\n]* # trailing optional whitespaces
(?= $ | \t | \r? \n ) # Delimiter ahead, tab or newline
)
) # (3 end)
)

Unfortunately I don't know ruby, and the solution I'm going to offer is not very nice, but here goes:
Since ruby's implementation of regex doesn't support dynamic width lookbehinds, I couldn't come up with a pattern that matches only the \r\n you want to remove. But you can replace all matches of this regex pattern
(\t"?PROTOCOL[^\t]*)[\r\n]+
with \1 (the text that has been matched by group 1), until the pattern no longer matches. Only one substitution won't remove all occurences of \r\n. See demo.
I hope you'll find a nicer solution.

Related

Where did the character go?

I matched a string against a regex:
s = "`` `foo`"
r = /(?<backticks>`+)(?<inline>.+)\g<backticks>/
And I got:
s =~ r
$& # => "`` `foo`"
$~[:backticks] # => "`"
$~[:inline] # => " `foo"
Why is $~[:inline] not "` `foo"? Since $& is s, I expect:
$~[:backticks] + $~[:inline] + $~[:backticks]
to be s, but it is not, one backtick is gone. Where did the backtick go?
It is actually expected. Look:
(?<backticks>`+) - matches 1+ backticks and stores them in the named capture group "backticks" (there are two backticks). Then...
(?<inline>.+) - 1+ characters other than a newline are matched into the "inline" named capture group. It grabs all the string and backtracks to yield characters to the recursed subpattern that is actually the "backticks" capture group. So,...
\g<backticks> - finds 1 backtick that is at the end of the string. It satisfies the condition to match 1+ backticks. The named capture "backtick" buffer is re-written here.
The matching works like this:
"`` `foo`"
||1
| 2 |
|3
And then 1 becomes 3, and since 1 and 3 are the same group, you see one backtick.

How do prevent whitespace from appearing in these bash variables?

I'm reading in values from an .ini file, and sometimes may get trailing or leading whitespace.
How do I amend this first line to prevent that?
db=$(sed -n 's/.*DB_USERNAME *= *\([^ ]*.*\)/\1/p' < config.ini);
echo -"$db"-
Result;
-myinivar -
I need;
-myinivar-
Use parameter expansion.
echo "=${db% }="
You don't need the .* inside the capturing group (or the semicolon at the end of line):
db="$(sed -n 's/.*DB_USERNAME *= *\([^ ]*\).*/\1/p' < config.ini)"
To elaborate:
.* matches anything at all
DB_USERNAME matches that literal string
* (a single space followed by an asterisk) matches any number of spaces
= matches that literal string
* (a single space followed by an asterisk) matches any number of spaces
\( starts the capturing group that is used for \1 later
[^ ] matches anything which is not a space character
* repeats that zero or more times
\) ends the capturing group
.* matches anything at all
Therefore, the result will be all the characters after DB_USERNAME = and any number of spaces, up to the next space or end of line, whichever comes first.
You can use echo to trim whitespace:
db='myinivar '
echo -"$(echo $db)"-
-myinivar-
Use crudini which handles these ini file edge cases transparently
db=$(crudini --get config.ini '' DB_USERNAME)
To get rid of more than one trailing space, use %% which removes the longest matching pattern from the end of the string
echo "=${db%% *}="

Ruby regex to split text

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.
text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[...!!??]/)
=> ["\"Hello my name is Kevin.", "\" How are you?"]
My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.
=> ["\"Hello my name is Kevin.\"", "How are you?"]
text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[...!!??]/)
The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).
pattern details:
" # literal: a double quote
(?> # open an atomic group: all that can be between quotes
[^"\\]+ # all that is not a quote or a backslash
| # OR
\\{2} # 2 backslashes (the idea is to skip even numbers of backslashes)
| # OR
\\. # an escaped character (in particular a double quote)
)* # repeat zero or more times the atomic group
" # literal double quote
| # OR
\S.*?[...!!??]
to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:
text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[...!!??]/)
where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.
(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.
Add optional quote (["']?) to the pattern:
text.scan(/\S.*?[...!!??]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

Difference of answers while using split function in Ruby

Given the following inputs:
line1 = "Hey | Hello | Good | Morning"
line2 = "Hey , Hello , Good , Morning"
file1=length1=name1=title1=nil
Using ',' to split the string as follows:
file1, length1, name1, title1 = line2.split(/,\s*/)
I get the following output:
puts file1,length1,name1,title1
>Hey
>Hello
>Good
>Morning
However, using '|' to split the string I receive a different output:
file1, length1, name1, title1 = line2.split(/|\s*/)
puts file1,length1,name1,title1
>H
>e
>y
Both the strings are same except the separating symbol (a comma in first case and a pipe in second case). The format of the split function I am using is also the same except, of course, for the delimiting character. What causes this variation?
The problem is because | has the meaning of OR in regex. If you want literal character, then you need to escape it \|. So the correct regex should be /\|\s*/
Currently, the regex /|\s*/ means empty string or series of whitespace character. Since the empty string specified first in the OR, the regex engine will break the string up at every character (you can imagine that there is an empty string between characters). If you swap it to /\s*|/, then the whitespaces will be preferred over empty string where possible and there will be no white spaces in the list of tokens after splitting.

What is the Ruby regex to match a string with at least one period and no spaces?

What is the regex to match a string with at least one period and no spaces?
You can use this :
/^\S*\.\S*$/
It works like this :
^ <-- Starts with
\S <-- Any character but white spaces (notice the upper case) (same as [^ \t\r\n])
* <-- Repeated but not mandatory
\. <-- A period
\S <-- Any character but white spaces
* <-- Repeated but not mandatory
$ <-- Ends here
You can replace \S by [^ ] to work strictly with spaces (not with tabs etc.)
Something like
^[^ ]*\.[^ ]*$
(match any non-spaces, then a period, then some more non-spaces)
no need regular expression. Keep it simple
>> s="test.txt"
=> "test.txt"
>> s["."] and s.count(" ")<1
=> true
>> s="test with spaces.txt"
=> "test with spaces.txt"
>> s["."] and s.count(" ")<1
=> false
Try this:
/^\S*\.\S*$/

Resources