Tokenise lines with quoted elements in Ruby - ruby

I need to tokenise strings in Ruby - string.split is almost perfect, except some of the strings may be enclosed in double-quotes, and within them, whitespace should be preserved. In the absence of lex for Ruby (correct?), writing a character-by-character tokenizer seems silly. What are my options?
I want a loop that's essentially:
while !file.eof:
line = file.readline
tokens = line.tokenize() # like split() but handles "some thing" as one token
end
I.e an an array of white-space delimited fields, but with correct handling of quoted sequences. Note there is no escape sequence for the quotes I need to handle.
The best I can imagine so far, is repeatedly match()ing a reg-exa which matches either the quotes sequence or everything until the next whitespace character, but even then I'm not sure how to formulate than neatly.

Like Andrew said the most straightforward way is parse input with stock CSV library and set appropriate :col_sep and :quote_char options.
If you insist to parse manually you may use the following pattern in a more ruby way:
file.each do |line|
tokens = line.scan(/\s*("[^"]+")|(\w+)/).flatten.compact
# do whatever with array of tokens
end

split accepts a regex so you could just write the regexp you want and call split on the line you just read.
line.split(/\w+/)

Try using Ruby's CSV library, and use a space (" ") as the :col_sep
:col_sep
The String placed between each field. This String will be transcoded
into the data’s Encoding before parsing.

Related

How to escape both " and ' when importing each row

I import a text file and save each row as a new record:
CSV.foreach(csv_file_path) do |row|
# saving each row to a new record
end
Strangely enough, the following escapes double quotes, but I have no clue how to escape different characters:
CSV.foreach(csv_file_path, {quote_char: "\""}) do |row|
How do I escape both the characters " and '?
Note that you have additional options available to configure the CSV handler. The useful options for specifying character delimiter handling are these:
:col_sep - defines the column separator character
:row_sep - defines the row separator character
:quote_char - defines the quote separator character
Now, for traditional CSV (comma-separated) files, these values default to { col_sep: ",", row_sep: "\n", quote_char: "\"" }. These will satisfy many needs, but not necessarily all. You can specify the right set to suit your well-formed CSV needs.
However, for non-standard CSV input, consider using a two-pass approach to reading your CSV files. I've done a lot of work with CSV files from Real Estate MLS systems, and they're basically all broken in some fundamental way. I've used various pre- and post-processing approaches to fixing the issues, and had quite a lot of success with files that were failing to process with default options.
In the case of handling single quotes as a delimiter, you could possibly strip off leading and trailing single quotes after you've parsed the file using the standard double quotes. Iterating on the values and using a gsub replacement may work just fine if the single quotes were used in the same way as double quotes.
There's also an "automatic" converter that the CSV parser will use when trying to retrieve values for individual columns. You can specify the : converters option, like so: { converters: [:my_converter] }
To write a converter is pretty simple, it's just a small function that checks to see if the column value matches the right format, and then returns the re-formatted value. Here's one that should strip leading and trailing single quotes:
CSV::Converters[:strip_surrounding_single_quotes] = lambda do |field|
return nil if field.nil?
match = field ~= /^'([^']*)'$/
return match.nil? ? field : match[1]
end
CSV.parse(input, { converters: [:strip_surrounding_single_quotes] }
You can use as many converters as you like, and they're evaluated in the order that you specify. For instance, to use the pre-defined :all along with the custom converter, you can write it like so:
CSV.parse(input, { converters: [:all, :strip_surrounding_single_quotes] }
If there's an example of the input data to test against, we can probably find a complete solution.
In general, you can't, because that will create a CSV-like record that is not standard CSV (Wikipedia has the rules in a bit easier to read format). In CSV, only double quotes are escaped - by doubling, not by using a backslash.
What you are trying to write is not a CSV; you should not use a CSV library to do it.

Search and Replace ENTIRE WORDS ONLY in VB

A word processor program features a search and replace function. However, partial words (character combinations found within words) are also replaced. To fix this, I plan to remove extra spaces and use the split function to change the string into an array of words by using " " as a delimiter.
However, once I search through the array, replace the appropriate words, and put the array back into a string separated by spaces, the original formatting of the user will be lost. For example, if the original string was "This is a sentence." and the user wanted "a" to be replaced with "the", the output will be "This is the sentence.", with no additional spaces.
So, my question is whether there is any way to search and replace entire words only while still preserving the formatting (extra spaces) of the user in Visual Basic.
What about using a regex?
In a regex the code \b is a word boundary so for example the regex \ba\b will match a only when a is a whole word.
So for example your code would be:
Dim strPattern As String: strPattern = "\ba\b"
Dim regex As New RegExp
regex.Global = True
regex.Pattern = strPattern
result = regex.Replace("This is a sentence.", "the")
If you use the Split function without removing your extra spaces first your array will have empty items in it so you would not lose the extra spaces and can reconstruct your document with the original formatting in tact.
Why is your formatting lost? If you split the text by space, just attach a space after each element when composing it back from an array. But you will also have to take into account words that end not with a space but punctuation.
in "This is a simple sentence, eh?", "eh" will be stored as "eh?" because u split by space. So you will have to program a complex punctuation-friendly formula or simply use regex. Be prepared - regex is... tricky.

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

String#split in Ruby not behaving as expected

File.open(path, 'r').each do |line|
row = line.chomp.split('\t')
puts "#{row[0]}"
end
path is the path of file having content like name, age, profession, hobby
I'm expecting output to be name only but I am getting the whole line.
Why is it so?
The question already has an accepted answer, but it's worth noting what the cause of the original problem was:
This is the problem part:
split('\t')
Ruby has several forms for quoted string, which have differences, usually useful ones.
Quoting from Ruby Programming at wikibooks.org:
...double quotes are designed to
interpret escaped characters such as
new lines and tabs so that they appear
as actual new lines and tabs when the
string is rendered for the user.
Single quotes, however, display the
actual escape sequence, for example
displaying \n instead of a new line.
Read further in the linked article to see the use of %q and %Q strings. Or Google for "ruby string delimiters", or see this SO question.
So '\t' is interpreted as "backslash+t", whereas "\t" is a tab character.
String#split will also take a Regexp, which in this case might remove the ambiguity:
split(/\t/)
Your question was not very clear
split("\n") - if you want to split by lines
split - if you want to split by spaces
and as I can understand, you do not need chomp, because it removes all the "\n"

Resources