I need to read some test data from an html document. The problem is there are some non-English characters there shown as HTML codes (e.g. Ø - Ø). How can I change those into a single character? Later I'll need to compare these characters to what user enters in a web form.
I'm trying to do this in Ruby 1.9.2.
Thanks in advance
This question was on SO many times. But I can't find it. So, as I can remember:
require 'CGI'
some_string = 'Ø&>'
p CGI.unescapeHTML(some_string).gsub(/&#(\d+);/){[$1.to_i].pack 'U'}
=> "\u00D8&>"
\u00D8 is your symbol. &> are just for example of use CGI::unescapeHTML.
Related
So when you view the source of an email, there are several characters in there that should be converted back to UTF-8 by the email client.
For example, in Outlook, a source email may contain =C2=A9 which converts to the copyright symbol.
In ruby, is there a way that I can find these types of characters/patterns and convert them to HTML so that it's displayed in an HTML form? For example, taking things like =C2=A9 and converting it into its associated HTML format ©?
There are two things to consider. First, the original string format using = is called "quoted-printable". Force UTF-8 encoding. Then, use htmlentities to convert to HTML entities. Here is an example:
require 'htmlentities'
coder = HTMLEntities.new
string = '=C2=A9'.unpack("M").first.force_encoding('UTF-8')
coder.encode(string) # => "©"
coder.encode(string, :named) # => "©"
I hope you find that helpful.
I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.
I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.
One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ
i am try to parse some data and meet trouble with clean a symbol. I knew that this is just a "space" but i realy got trouble to clean it from string
my code:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('my_page.hmtl')
price = page.search('#product_buy .price').text.to_s.gsub(/\s+/, "").gsub(" ","").gsub(" ", "")
puts price
And as result i always got "4 162" - with dat spaces. Don't know what to do.
Help please who meet this issue previously. Thank you
HTML escape codes don't mean anything to Ruby's regex engine. Looking for " " will look for those literal characters, not a thin space. Instead, versions of Ruby >= 1.8 support Unicode in strings, meaning that you can use the Unicode code point corresponding to a thin space to make your substitution. The Unicode code point for a thin space is 0x2009, meaning that you can reference it in a Ruby string as \u2009.
Additionally, instead of calling some_string.gsub('some_string', ''), you can just call some_string.delete('some_string').
Note that this isn't appropriate for all situations, because delete removes all instances of all characters appearing in the intersection of its arguments, while gsub will remove only segments matching the pattern provided. For example, 'hellohi'.gsub('hello', '') == "hi", while 'hellohi'.delete('hello') == 'i').
In your specific case, I'd use something like:
price = page.search('#product_buy .price').text.delete('\u2009\s')
I have my database results (áéíóúàâêô...) and when I display any of this characters I get codes like:
á
My controller is like this:
ViewBag.EstadosDeAlma = (from e in db.EstadosDeAlma select e.Title).ToList();
My cshtml page is like this:
var data = '#foreach (dynamic item in ViewBag.EstadosDeAlma){ #(item + " ") }';
In addition, if I use any rich text editor as Tiny MCE all non-latin characters are like this too.
What should I do to avoid this problem?
What output encoding are you using on your web pages? I would suggest using UTF-8 since you want a lot of non-ascii characters to work.
I think you should HTML encode/decode the values before comparing them.
Since you are using jQuery you can take advantage of the encoding functions built-in into it. For example:
$('<div/>').html('& #225;gil').html()
gives you "ágil" (notice that I added an extra space between the & and the # so that stackoverflow does not encode it, you won't need it)
This other question has more information about this.
HTML-encoding lost when attribute read from input field
I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.
Here is the HTML snippet in question:
<td>Amount 15,300 at dollars</td>
Note the change for the representation after I use Nokogiri:
<td>Amount 15,300 at dollars</td>
And outputting the inner_text:
Amount 15,300 at dollars
This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:
doc = Nokogiri::HTML(open(url))
And then I do a doc.search for the item in question.
Note that if I look at the doc, the line shows up with the on that line.
Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text to show up without the strange  symbol.
Unless you really, really want to keep the notation, there shouldn't be a problem here.
A0 is the hex character code for a non-breaking space. As such, prints a non-breaking space, and is exactly equivalent to . does the same thing, too.
What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.
Assuming that  was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0, so does the best it can. If this is a browser, adding <meta charset="utf-8"> to the head will solve this issue, and will make the rest of the output more Unicode-friendly.
If you really, really want , use gsub to replace them in your final output. Otherwise, don't worry about it.
I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".
def strip_html(str)
nbsp = Nokogiri::HTML(" ").text
str.gsub(nbsp,'')
end
You could also replace it whith a space if you wished. May many of you find this answer!
As #sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.
The usual way to fix the problem is to preprocess the content:
require 'nokogiri'
html = '<td>Amount 15,300 at dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html
Which outputs:
<td>Amount 15,300 at dollars</td>