This is related to cleaning files before parsing them elsewhere, namely, malformed/ugly CSV. I see plenty of examples for removing/matching all characters between certain strings/characters/delimiters, but I cannot find any for specific strings. Example portion of line would look something like:
","Should now be allowed by rule above "Server - Access" added by Rich"\r
To be clear, this is not the entire line, but the entire line is enclosed in quotes and separated by "," and ends in ^M (Windows newline/carriage return).The 'columns' preceding this would be enclosed at each side by ",". I would probably use this too to remove cruft that appears earlier in the line.
What I am trying to get to is the removal of all double quotes between "," and "\r ("Server - Access" - these ones) without removing the delimiters. Alternatively, I may just find and replace them with \" to delimit them for the Ruby CSV library. So far I have this:
(?<=",").*?(?="\\r)
Which basically matches everything between the delimiters. If I replace .*? with anything, be that a letter, double quotes etc, I get zero matches. What am I doing wrong?
Note: This should be Ruby compatible please.
If I understand you correctly, you can use negative lookahead and lookbehind:
text = '","Should now be allowed by rule above "Server - Access" added by Rich"\r'
puts text.gsub(/(?<!,)"(?![,\\r])/, '\"')
# ","Should now be allowed by rule above \"Server - Access\" added by Rich"\r
Of course, this won't work if the values themselves can contain comas and new lines...
Related
I have a log file, and I want to get rid of the third column that start with "external", this column is not always in the third place so I need to find the word "external" and then delete it with the string that follows the colon.
I was thinking in using -replace for that, but does "-replace" accept some regex to delete the rest of the string (after the semicolons) that is always changing?
or maybe there is a better way to do this?
02/02/2020 name:VAL_NATURE external:af2045b2-5992-432e-b790-c1ad4743038 status:good
cat mylog.log | %{$_ -replace "external???",""}
With any delimited file, the first thought I have is to break it at the delimiters (in your case, the white space) and treat it like an object. Deleting a column is trivial if you do that, and it lets you have easy access to the data for other purposes.
If, however, your only task is to remove that column with 'external' + colon + all text up to the next bit of white space, that is an easy thing to do with a regex replace.
$line = '02/02/2020 name:VAL_NATURE external:af2045b2-5992-432e-b790-c1ad4743038 status:good'
$line -replace 'external:.*\s',''
EDIT: Tested the code above, and got this output:
02/02/2020 name:VAL_NATURE status:good
The . is any character, and .* says "any character zero or more times" it continues matching until it gets to whitespace, which is represented by the \s. So this regex matches the word 'external' followed by a ':' followed by zero or more other characters followed by whitespace (space/tab/etc).
I have a task where I need to check if a value is properly quoted CSV column:
cases:
no quotation - OK
"with quotation" - OK
"opening quote - Not Good
improper"quote" - Not Good
closing quote" - Not Good
CSV flags an error like below:
Illegal quoting in line 5. (CSV::MalformedCSVError)
Question: How would I get to have this working using a single regex? I need to flag error for cases 3-5.
And if you have any idea what should be checked if a CSV value is valid or not, please tell so.
EDIT: I have added 2 scenarios/cases below:
"quote "inside quotes" - Not Good
"quotes ""inside quotes" - Not Good
EDIT: added 1 more case:
"" - OK
Without considering escaped quotes :
/^("[^"]*"|[^"]+)$/m
See it here.
It means :
beginning of line
1 quote + anything except quote + 1 quote, or
anything except quote (at least one character)
end of the line
^"{1}.+"{1}$|^[^"]*$
This matches all lines either starting and ending with one quotation mark, or lines not including quotation marks at all.
demo
This is my very simple code, which isn't working, for some reason I can't figure out.
#!/usr/bin/perl
use File::Copy;
$old = "car_lexusisf_gray_30inclination_000azimuth.png";
$new = "C:\Users\Lenovo\Documents\mycomp\simulation\cars\zzzorganizedbyviews\00inclination_000azimuth\lexuscopy.png";
copy ($old, $new) or die "File cannot be copied.";
I get the error that the file can't be copied.
I know there's nothing wrong with the copy command because if I set the value of $new to something simple without a path, it works. But what is wrong in the representation of the path as I've written it above? If I copy and past it into the address bar of windows explorer, it reaches that folder fine.
Tip: print out the paths before you perform the copy. You'll see this:
C:SERSenovodocumentsmycompsimulationrszzzorganizedbyviewsinclination_000azimuthexuscopy.png
Not what we wanted. The backslash is an escape character in Perl, which needs to be escaped itself. If the backslash sequence does not form a valid escape, then it's silently ignored. With escaped backslashes, your string would look like:
"C:\\Users\\Lenovo\\Documents\\mycomp\\simulation\\cars\\zzzorganizedbyviews\\00inclination_000azimuth\\lexuscopy.png";
or just use forward slashes instead – in most cases, Unix-style paths work fine on Windows too.
Here is a list of escapes you accidentally used:
\U uppercases the rest
\L lowercases the rest
\ca is a control character (ASCII 1, the start of heading)
\00 is an octal character, here the NUL byte
\l lowercases the next character.
If no interpolation is intended, use single quotes instead of double quotes.
In a malformed .csv file, there is a row of data with extra double quotes, e.g. the last line:
Name,Comment
"Peter","Nice singer"
"Paul","Love "folk" songs"
How can I remove the double quotes around folk and replace the string as:
Name,Comment
"Peter","Nice singer"
"Paul","Love _folk_ songs"
In Ruby 1.9, the following works:
result = subject.gsub(/(?<!^|,)"(?!,|$)/, '_')
Previous versions don't have lookbehind assertions.
Explanation:
(?<!^|,) # Assert that we're not at the start of the line or right after a comma
" # Match a quote
(?!,|$) # Assert that we're not at the end of the line or right before a comma
Of course this assumes that we won't run into pathological cases like
"Mary",""Oh," she said"
If you're not on Ruby 1.9, or just get tired of regexes sometimes, split the string on ,, strip the first/last quotes, replace remaining "s with _s, re-quote, and join with ,.
(We don't always have to worry about efficiency!)
$str = '"folk"';
$new = str_replace('"', '', $str);
/* now $new is only folk, without " */
Meta-strategy:
It's likely the case that the data was manually entered inconsistently, CSV's get messy when people manually enter either field terminators (double quote) or separators (comma) into the field itself. If you can have the file regenerated, ask them to use an extremely unlikely field begin/end marker, like 5 tilde's (~~~~~), and then you can split on "~~~~~,~~~~~" and get the correct number of fields every time.
Unless you have no other choice, get the file regenerated with correct escaping. Any other approach is asking for trouble, because the insertion of unescaped quotes is lossy, and thus cannot be reliably reversed.
If you can't get the file fixed from the source, then Tim Pietzcker's regex is better than nothing, but I strongly recommend that you have your script print all "fixed" lines and check them for errors manually.
Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).