How to get meta tag content value in Watir? - ruby

I'am not able to get the content value of meta tag from site in Ruby using Watir-webdriver gem.
e.g.
<meta property="og:title" content="【楽天市場】ダヴ メンプラスケア クリーンコンフォート泡洗顔 つめかえ用(110mL)【unili3e102】【ダヴ(Dove)】[ダヴ 洗顔]:爽快ドラッグ">

The problem with browser.meta(:property, 'og:title').content is that "property" is not a valid attribute for meta tags. As a result, Watir does not allow it as a locator method.
To locate elements via unsupported attributes, you will need to use a CSS-selector:
browser.meta(css: 'meta[property="og:title"]').content
Or use XPath:
browser.meta(xpath: '//meta[#property="og:title"]').content

Related

Scrapy selector outside html tag

I have a special case where a script tag is placed outside the html tag :
<html>
....
</html>
<script>data</script>
both css and xpath selectors are not finding this script tag, the only way I found is using response.text , but that responds with a giant string and I can not make regex operations on it with selector re() function.
Is there a way to CSS or Xpath tags outside html tag?
I tried with
response.css('script')
But only consider script tags inside html tag
Thanks
Correction :
css selector does not consider tags outside HTML , xpath does.
I used some conditions to filter the tag :
response.xpath('//script[contains(., "function SelectItem()")]')

Parsing Meta Tag using xPath

How can I parse a Meta Tag such as
<meta itemprop="email" content="email#example.com" class="">
..and extract the email out of it.
When I copy the xPath of this tag, I get the following, which doesn't work
//*[#id="businessDetailsPrimary"]/div[2]/div/meta
Please advise.
Many thanks
The likelihood is that the itemprop="email" attribute will be unique across the webpage. In this case, you can select the email by accessing the content attribute via its XPath as follows:
//meta[#itemprop="email"]/#content
Demo
In case itemprop="email" is not unique, you can make your XPath more specific by selecting the element with id equal to businessDetailsPrimary first:
//*[#id="businessDetailsPrimary"]//meta[#itemprop="email"]/#content
Demo

With AMP HTML, is it legitimate to set the link canonical href attribute to pound (#)?

Is it legitimate to set the canonical link to the pound symbol as shown below, or am I required to enter a physical page name?
<link rel="canonical" href="#">
When testing this, the pound setting does not generate a validation error (ala #development=1). In my scenario, the page using this layout file will not have an alternate "regular HTML" version. The only version will be the AMP HTML version.
For additional context, I'm experimenting with an MVC site that will use AMP HTML. To keep my layout file simple, I'd prefer to use the pound symbol rather than extracting the child page name and applying that to the href attribute. I know how to apply the URL to the partial view via code like so:
<link rel="canonical" href="#HttpContext.Current.Request.Url.AbsoluteUri">
I'm just curious if it's legitimate AMP HTML to use the pound symbol instead. Thank you.
From the documentation:
Required markup
AMP HTML documents MUST:
contain a <link rel="canonical" href="$SOME_URL" /> tag inside their head that points to the regular HTML version of the AMP HTML
document or to itself if no such HTML version exists.
So instead of using href="#", you should have it point to itself in order to stay consistent with the AMP specifications.
Validation is evolving, the validator doesn't catch all issues today. The issue with using "#" or any relative URL is that when this document is served elsewhere, such as cdn.ampproject.org, that relative URL will no longer point to your intended canonical. You should instead use an absolute URL <link rel=canonical href="URL">.

Parsing multiple elements with Mechanize

I have this in html:
<meta name='DC.creator' scheme='inventor' content='Chen Yonghong' />
<meta name='DC.creator' scheme='inventor' content='Chen Yuan' />
If I want to get first creator I can do it like this:
:author => page.at('meta[#name="DC.creator"]')[:content]
The question is, how do I get second one with mechanize selectors?
You can use:
page.search('meta[#name="DC.creator"]')[1][:content]
at is equivalent to search(...).first so using the same selector with search and grabbing the second element found will work as long as there are truly two tags that match. If not, you'll get an exception because you can't take the index of a nil value.
And, as a FYI, Mechanize uses Nokogiri internally to handle its HTML parsing and manipulation. Nokogiri supports both CSS and XPath selectors so you can use whichever makes it easier for you to find the tag or element you want. I lean toward CSS for readability, but use both. See the Nokogiri tutorials for more information about searching.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<meta name='DC.creator' scheme='inventor' content='Chen Yonghong' />
<meta name='DC.creator' scheme='inventor' content='Chen Yuan' />
EOT
doc.search('meta[#name="DC.creator"]')[1][:content]
=> "Chen Yuan"

How to get a mail address from HTML code with Nokogiri

How can I get the mail address from HTML code with Nokogiri? I'm thinking in regex but I don't know if it's the best solution.
Example code:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
</body>
</html>
Does a method exist in Nokogiri to get the mail address if it is not between some tags?
You can extract the email addresses using xpath.
The selector //a will select any a tags on the page, and you can specify the href attribute using # syntax, so //a/#href will give you the hrefs of all a tags on the page.
If there are a mix of possible a tags on the page with different urls types (e.g. http:// urls) you can use xpath functions to further narrow down the selected nodes. The selector
//a[starts-with(#href, \"mailto:\")]/#href
will give you the href nodes of all a tags that have a href attribute that starts with "mailto:".
Putting this all together, and adding a little extra code to strip out the "mailto:" from the start of the attribute value:
require 'nokogiri'
selector = "//a[starts-with(#href, \"mailto:\")]/#href"
doc = Nokogiri::HTML.parse File.read 'my_file.html'
nodes = doc.xpath selector
addresses = nodes.collect {|n| n.value[7..-1]}
puts addresses
With a test file that looks like this:
<html>
<title>Example</title>
<body>
This is an example text.
Mail to me
A Web link
<a>An empty anchor.</a>
</body>
</html>
this code outputs the desired example#example.com. addresses is an array of all the email addresses in mailto links in the document.
I'll preface this by saying that I know nothing about Nokogiri. But I just went to their website and looked at the documentation and it looks pretty cool.
If you add an email_field class (or whatever you want to call it) to your email link, you can modify their example code to do what you are looking for.
require 'nokogiri'
require 'open-uri'
# Get a Nokogiri::HTML:Document for the page we’re interested in...
doc = Nokogiri::HTML(open('http://www.yoursite.com/your_page.html'))
# Do funky things with it using Nokogiri::XML::Node methods...
####
# Search for nodes by css
doc.css('.email_field').each do |email|
# assuming you have than one, do something with all your email fields here
end
If I were you, I would just look at their documentation and experiment with some of their examples.
Here's the site: http://nokogiri.org/
CSS selectors can now (finally) find text at the beginning of a parameter:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
blah
blah
EOT
doc.at('a[href^="mailto:"]')
.to_html # => "blah"
Nokogiri tries to track the jQuery extensions. I used to have a link to a change-notice or message from one of the maintainers talking about it but my mileage has varied.
See "CSS Attribute Selectors" for more information.
Try to get the whole html page and use regular expressions.

Resources