Web scraping: How do I handle euro signs? - xpath

So I'm trying to scrape prices of a product on a website, and their HTML looks like this:
<div class="pricing_price">€12.99</div>
Now I've wrote a xpath query that gets price, and it returns a string like this:
€ 12.99.
If possible, I would like to just get the 12.99. What are my options? Should I use regular expressions? Or are there better/easier solutions?

Related

Liquid - parse YAML front matter in string

I'm running a Jekyll site that uses JSON as data in my _data folder. I'm looping through the file like normal doing things like {% for item in site.data.resources.items %} just fine. However, I'd like to parse YAML front matter that is within a string.
Example:
\n---\nblog: http://google.com\nbackground-img: http://www.ew.com/sites/default/files/i/2013/07/23/Dumb-and-Dumber.jpg\nbuttonText: Download\n---\n
How can I have Liquid parse this within my Jekyll site so I can use it like so:
<img src="{{background-img}}>Image
or something similar?
EDIT: To clarify, that string is in front matter format in a text file that I'm retrieving through an ajax call. So that string is the response I get back and the format won't be changing. My hope was that Liquid could somehow parse this string and look for a front matter type format. If not, I will revert back my JavaScript methods.
This is impossible.
Liquid/YML is being parsed while generating the site and your JSON string comes available long after the site has been generated: It only exists after the moment the JSON request for the string has been succesful.
However, you can use javascript, as you already mentioned. Simply split the string on \n for your key-value pairs and split on : for your key and value. Then use jQuery (or plain javascript) to write the results to the DOM.
Good luck!

How do I use the XPath tokenizer function in Nokogiri?

I am attempting to extract information from the following HTML using Nokogiri and XPath.
<p>Friday, February 1<br><strong>Apple <br> Orange</strong></p>
e.xpath('./text()[following-sibling::br]')
Gives me the date just fine. I want to then grab the text inside the strong node and split on br. There may be many fruits separated by br or there may just be one with no br. I would ideally like to accomplish this in xpath instead of code since I'm essentially defining a bunch of parsers via JSON.
Right now I'm thinking that I should use the tokenizer function and pass the text in the strong tag. I thought that should look like this:
e.xpath('./strong[fn::tokenize(.,"<br>")]')
and have also tried
e.xpath('fn::tokenize(./strong,"<br>")')
but I am getting:
.../gems/nokogiri-1.5.6/lib/nokogiri/xml/node.rb:159:in `evaluate': Invalid expression: ./strong/text()[fn::tokenize(.,"br")] (Nokogiri::XML::XPath::SyntaxError)
I'm modeling my usage after the documentation for the method that the error occurs in (line 139):
node.xpath('.//title[regex(., "\w+")]',...

xpath to validate text before and after <br/>

following is my html table structure and i want to validate the complete text inside td using x-path <tr><td>Sagar Nair<br/><b>Owner</b> - Verified</td></tr>
can anyone help for this.
When the tr element in your example is the current element, then the XPath expression string(.) will have as its value the string you say you would like to validate. For the actual validation of the string you are going to need some language other than XPath; since you don't mention a programming language, however, I assume that once you get the string you know what to do with it.

Ruby Regex: Return just the match

When I do
puts /<title>(.*?)<\/title>/.match(html)
I get
<h2>foobar</h2>
But I want just
foobar
What's the most elegant method for doing so?
The most elegant way would be to parse HTML with an HTML parser:
require 'nokogiri'
html = '<title><h2>Pancakes</h2></title>'
doc = Nokogiri::HTML(html)
title = doc.at('title').text
# title is now 'Pancakes'
If you try to do this with a regular expression, you will probably fail. For example, if you have an <h2> in your <title> what's to prevent you from having something like this:
<title><strong>Where</strong> is <span>pancakes</span> <em>house?</em></title>
Trying to handle something like that with a single regex is going to be ugly but doc.at('title').text handles that as easily as it handles <title>Pancakes</title> or <title><h2>Pancakes</h2></title>.
Regular expressions are great tools but they shouldn't be the only tool in your toolbox.
Something of this style will return just the contents of the match.
html[/<title>(.*?)<\/title>/,1]
Maybe you need to tell us more, like what html might contain, but right now, you are capturing the contents of the title block, irrespective of the internal tags. I think that is the way you should do it, rather than assuming that there is an internal tag you want to handle, especially because what would happen if you had two internal tags? This is why everyone is telling you to use an html parser, which you really should do.

Convert string with white space into URL

I'm using ruby and googles reverse geocode yql table to ideally automate some search query I have. The problem I hit is turning the query into a legal url format. The issue is that the encoding I'm using is returning illegal urls. The query I'm running is as follows
query="select * from google.geocoding where q='40.714224,-73.961452'"
pQuery= CGI::escape(query)
The eventual output for the processed query looks like this
http://query.yahooapis.com/v1/public/yql?q=select+%2A+from+google.geocoding+where+q%3D%2740.3714224%2C--73.961452%27+format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=
Alas the url is illegal. When checking what the query shoud look like in the YQL console I get the following
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20google.geocoding%20where%20q%3D%2240.714224%2C-73.961452%22&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=
As you can hopefully see :), the encoding is all wrong. I was wondering does anyone know how I can go about generating correct urls.
If you want to escape a URI, you should use URI::escape:
require 'uri'
URI.escape("select * from google.geocoding where q='40.714224,-73.961452'")
# => "select%20*%20from%20google.geocoding%20where%20q='40.714224,-73.961452'"

Resources