XPath normalize-space, do not replace multiple inner whitespaces - xpath

How do I translate the text:
" Some Text in
my
"
to
"Some Text in
my"
using XPath?
I.e. only trim leading and trailing whitespace (in contrast to normalize-space, which also trims intermediary whitespace).

Related

How to replace all characters in a string with '*' in perl

How to get a regular expression to replace all characters in a string in perl with *? The string has some utf-8 or iso-8859-1 characters also. I tried with "s/\w/*/g". But it did not replace utf-8 or iso-8859-1 characters.
my $value="hellö";
print "$value\n";
$value =~ s/\w/*/g;
print "after replacing $value\n"; //It prints ****ö.
I expect all characters should be replaced with * i.e hellö should be replaced with *****.
Please note, few special characters like -,_,\,/ etc should be skipped.
If you want to skip just a few characters, you can always do something along the lines of
s/[^, \/\\\-]/*/g;
To replace all the characters in a string? The \w is for matching word characters, but using just a dot should match all characters: s/./*/g

Regex: match something except within arbitrary delimiters

My string:
a = "Please match spaces here <but not here>. Again match here <while ignoring these>"
Using Ruby's regex flavor, I would like to do something like:
a.gsub /regex_pattern/, '_'
And obtain:
"Please_match_spaces_here_<but not here>._Again_match_here_<while ignoring these>"
This should do it:
result = subject.gsub(/\s+(?![^<>]*>)/, '_')
This regex assumes there's nothing tricky like escaped angle brackets. Also be aware that \s matches newlines, TABs and other whitespace characters as well as spaces. That's probably what you want, but you have the option of matching only spaces:
/ +(?![^<>]*>)/
I think, it works:
a = "Please match spaces here <but not here>. Again match here <while ignoring these>"
pattern = /<(?:(?!<).)*>/
a.gsub(pattern, '')
# => "Please match spaces here . Again match here "

Keep spaces with YAML

I have this in my YAML file:
test: I want spaces before this text
In my case I would like to have a space before the text in my array or json when converted. Is that possible? How?
With JSON as output it's parsed like this:
{
"test": "I want spaces before this text"
}
No spaces.
You can test it here
You would have to quote your scalar with either single or double quotes instead of using a plain scalar (i.e. one without quotes). Which one of those two is more easy to use depends on whether there are special characters in your text.
If you use single quotes:
test: ' I want spaces before this text'
this would require doubling any single quotes already existing in your text (something like ' abc''def ').
If you use double quotes:
test: " I want spaces before this text"
this would require backslash escaping any double quotes already existing in your text (something like " abc\"def ").
With \t this work
Example:
var options = {
\t hostname: 'localhost',
\t port: 4433
};

How to strip out \r\n in between a quoted string in between tabs when rows are also delimited by \r\n?

In Ruby 2.1.3, I have a string representing a title such as in a tab delimited csv file format:
string = "helloworld\r\n14522\tAB-12-00420\t\"PROTOCOL \r\nRisk Effectiveness \r\nand Device Effectiveness In \r\Ebola Candidates \"\tData Collection only\t\t20\t"
I want to strip out the "\r\n" only in the tab delimited portion that starts with Protocol so I can read a complete title as "PROTOCOL Risk Effectiveness and Device Effectiveness In Ebola Candidates"....I want the end result to be:
"helloworld\r\n14522\tAB-12-00420\t\"PROTOCOL Risk Effectiveness and Device Effectiveness In Heart Failure Candidates \"\tData Collection only\t\t20\t"
If I don't do this, trying to read it in via CSV truncates the title so I only end up reading "PROTOCOL" and not the rest of the title.
Keep in mind there may be an indeterminate number of \r\n characters I want to remove within a title (I'll be parsing through different titles). How do I accomplish this? I was thinking a regular expression might be the way...
Since a newline (outside of quotes) is treated as a delimiter,
you could use this regex to isolate quoted fields then replace any \r?\n just
within that field.
You would then pass the string into the CSV module.
There are 3 groups that together constitute the entire match.
1. Delimiter
2. Double quoted field
3 Non-quoted field
Would need a replace-with-callback function implementation.
Within the callback, if group 2 is not empty, do a separate replace of all CRLF's.
Catenate goup 1 + replaced(group2) + group 3, then return the catenation.
# ((?:^|\t|\r?\n)[^\S\r\n]*)(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"(?:[^\S\r\n]*(?=$|\t|\r?\n)))|([^\t\r\n]*(?:[^\S\r\n]*(?=$|\t|\r?\n))))
( # (1 start), Delimiter tab or newline
(?: ^ | \t | \r? \n )
[^\S\r\n]* # leading optional whitespaces
) # (1 end)
(?:
( # (2 start), Quoted string field
"
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
(?:
[^\S\r\n]* # trailing optional whitespaces
(?= $ | \t | \r? \n ) # Delimiter ahead, tab or newline
)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^\t\r\n]*
(?:
[^\S\r\n]* # trailing optional whitespaces
(?= $ | \t | \r? \n ) # Delimiter ahead, tab or newline
)
) # (3 end)
)
Unfortunately I don't know ruby, and the solution I'm going to offer is not very nice, but here goes:
Since ruby's implementation of regex doesn't support dynamic width lookbehinds, I couldn't come up with a pattern that matches only the \r\n you want to remove. But you can replace all matches of this regex pattern
(\t"?PROTOCOL[^\t]*)[\r\n]+
with \1 (the text that has been matched by group 1), until the pattern no longer matches. Only one substitution won't remove all occurences of \r\n. See demo.
I hope you'll find a nicer solution.

Ruby regex to split text

I am using the below regex to split a text at certain ending punctuation however it doesn't work with quotes.
text = "\"Hello my name is Kevin.\" How are you?"
text.scan(/\S.*?[...!!??]/)
=> ["\"Hello my name is Kevin.", "\" How are you?"]
My goal is to produce the following result, but I am not very good with regex expressions. Any help would be greatly appreciated.
=> ["\"Hello my name is Kevin.\"", "How are you?"]
text.scan(/"(?>[^"\\]+|\\{2}|\\.)*"|\S.*?[...!!??]/)
The idea is to check for quoted parts before. The subpattern is a bit more elaborated than a simple "[^"]*" to deal with escaped quotes (* see at the end to a more efficient pattern).
pattern details:
" # literal: a double quote
(?> # open an atomic group: all that can be between quotes
[^"\\]+ # all that is not a quote or a backslash
| # OR
\\{2} # 2 backslashes (the idea is to skip even numbers of backslashes)
| # OR
\\. # an escaped character (in particular a double quote)
)* # repeat zero or more times the atomic group
" # literal double quote
| # OR
\S.*?[...!!??]
to deal with single quote to you can add: '(?>[^'\\]+|\\{2}|\\.)*'| to the pattern (the most efficient), but if you want make it shorter you can write this:
text.scan(/(['"])(?>[^'"\\]+|\\{2}|\\.|(?!\1)["'])*\1|\S.*?[...!!??]/)
where \1 is a backreference to the first capturing group (the found quote) and (?!\1) means not followed by the found quote.
(*) instead of writing "(?>[^"\\]+|\\{2}|\\.)*", you can use "[^"\\]*+(?:\\.[^"\\]*)*+" that is more efficient.
Add optional quote (["']?) to the pattern:
text.scan(/\S.*?[...!!??]["']?/)
# => ["\"Hello my name is Kevin.\"", "How are you?"]

Resources