Gsub and regular expression - ruby

I have a web page. The HTML source contains this text:
<meta property="og:title" content="John"/>
John is an example, the name may vary.
I am sure that og:title will appear only once in the text.
This is my code:
$browser.goto( url )
x = $browser.html.gsub( /^.*<meta property="og:title" content="(.+?)".>/m, '\1' )
I expected to find the name John in my variable x
The '\1' should give me the first part I put in the parenthesis, i.e. (.+?), i.e. John, right?
Also, I used a dot . to match a slash / , is there a better way?

Using Watir API:
x = browser.meta.attribute_value "content"
I was not able to access the meta element using either css and xpath.

If you only want the value of content:
html = '<meta property="og:title" content="John"/>'
=> "<meta property=\"og:title\" content=\"John\"/>"
html[/property="og:title" content="([^"]+)"/, 1]
=> "John"
If you're not familiar with regex, "([^"]+)" might throw you. It means "from the first ", grab everything until the next ". In effect it means "grab everything inside the double-quotes.

That code will return all of the HTML, with the matching code (which is everything between the start of the string up to and including the />) replaced by 'John'. So that comes down to "John", followed by the HTML that was after the /> of that meta property.
If you only want to extract the name, and that tag occurs only once, you can use something like:
#browser.html =~ /<meta property="og:title" content="(.+?)"/
x = $1

Related

Using Scrapy to Scrape Content after a particular keyword/string

I'm trying to scrape content after the occurrence of a particular keyword/string.
Suppose the Xpath is as follows:
<meta property="og:url" content="https://www.example.com/tshirt/pcid111-31">
<meta property="og:url" content="https://www.example.com/tshirt/pcid3131-33">
<meta property="og:url" content="https://www.example.com/tshirt/pcid545424524-84">
1) How can I extract all the data inside the content element whose property="og:url
2)I also want to extract anything which is after the pcid, can someone suggest a way around this?
Now sure if this would work:
item ["example"] =sel.xpath("//meta[#property='og:url']/text()").extract()[0].replace("*pcid","")
Does the replace take in wildcard character references?
This will extract content attributes of elements whose property="og:url"
og_urls = response.xpath("//meta[#property='og:url']/#content").extract()
For extracting stuff from the url it's usually best to use regex, in your case it would be:
for url in og_urls:
id = re.findall("pcid(.+)") # "pcid(.+)" = any characters after 'pcid'(greedy)
# re.findall() returns a list and you probably want only the first occurrence and there mostlikely only be one anyway
id = id[0] if id else ''
print(id)
or you can split the url at the 'pcid' and take the later value, e.g.
for url in og_urls:
id = url.split('pcid')[-1]
print(id)
Try this
x=len(hxs.select("//meta/#content").extract())
for i in range(x):
print hxs.select("//meta/#content").extract()[i].split('pcid')[1]
Output:
111-31
3131-33
545424524-84

Replacing <a> tags that have two pairs of double quotes

I have asked a similar question before but this one is slightly different
I have content with this sort of links in:
Professor Steve Jackson
[UPDATE]
And this is how i read it:
content = doc.xpath("/wcm:root/wcm:element[#name='Body']").inner_text
The links has two pairs of double quotes after the href=.
I am trying to strip out the tag and retrieve only the text like so:
Professor Steve Jackson
To do this I'm using the same method which works for this sort of link which has only a single pair of double quotes:
World
This returns World:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^="ssLINK"]')
.each{|a| a.replace("<>#{a.content}</>")}
=>World
When I try To do the same for the link that has two pairs of double quotes it complains:
content = Nokogiri::XML.fragment(content_with_link)
content.css('a[href^=""ssLINK""]')
.each{|a| a.replace("<>#{a.content}</>")}
Error:
/var/lib/gems/1.9.1/gems/nokogiri-1.6.0/lib/nokogiri/css/parser_extras.rb:87:in
`on_error': unexpected 'ssLINK' after '[:prefix_match, "\"\""]' (Nokogiri::CSS::SyntaxError)
Anyone know how I can overcome this issue?
I can suggest you two ways to do it, but it depends on whether : every <a> tag has href's with two "" enclosing them or its just the one with ssLINK
Assume
output = []
input_text = 'Professor Steve Jackson'
1) If a tags has href with "" only with ssLink then just do
Nokogiri::HTML(input_text).css('a[href=""]').each do |nokogiri_obj|
output << nokogiri_obj.text
end
# => output = ["Professor Steve Jackson"]
2) If all the a tags has href with ""then you can try this
nokogiri_a_tag_obj = Nokogiri::HTML(input_text).css('a[href=""]')
nokogiri_a_tag_obj.each do |nokogiri_obj|
output << nokogiri_obj.text if nokogiri_obj.has_attribute?('sslink')
end
# => output = ["Professor Steve Jackson"]
With this second approach if
input_text = 'Professor Steve Jackson Some other TextSecond link'
then also the output will be ["Professor Steve Jackson"]
Your content is not XML, so any attempt to solve the problem using XML tools such as XSLT and XPath is doomed to failure. Use a regex approach, e.g. awk or Perl. However, it's not immediately obvious to me how to match
<a href="" sometext"">
without also matching
<a href="" sometext="">
so we need to know a bit more about this syntax that you are trying to parse.

Searching for tags while parsing Wordpress XML with Nokogiri

I have an XML file of a Wordpress blog that consists of quotes:
<item>
<title>Brothers Karamazov</title>
<content:encoded><![CDATA["I think that if the Devil doesn't exist and, consequently, man has created him, he has created him in his own image and likeness."]]></content:encoded>
<category domain="post_tag" nicename="dostoyevsky"><![CDATA[Dostoyevsky]]></category>
<category domain="post_tag" nicename="humanity"><![CDATA[humanity]]></category>
<category domain="category" nicename="quotes"><![CDATA[quotes]]></category>
<category domain="post_tag" nicename="the-devil"><![CDATA[the Devil]]></category>
</item>
The things I'm trying to extract are title, author, content and tags. Here's my code so far:
require "rubygems"
require "nokogiri"
doc = Nokogiri::XML(File.open("/Users/charliekim/Downloads/quotesfromtheunderground.wordpress.2013-04-14.xml"))
doc.css("item").each do |item|
title = item.at_css("title").text
tag = item.at_xpath("category").text
content = item.at_xpath("content:encoded").text
#each post will later be pushed to an array, but I'm not worried about that yet, so for now....
puts "#{title} #{tag}"
end
I'm struggling to get all the tags from each item. I'm getting returns of something like Brothers Karamazov Dostoyevsky. I'm not worried about how it's formatted as it's only a test to see that it's picking things up correctly. Anyone know how I can go about this?
I also want to make tags that are capitalized = Author, so if you know how to do that it would help, too, although I haven't even tried it yet.
EDIT: I changed the code to this:
doc.css("item").each do |item|
title = item.at_css("title").text
content = item.at_xpath("content:encoded").text
tag = item.at_xpath("category").each do |category|
category
end
puts "#{title}: #{tag}"
end
which returns:
Brothers Karamazov: [#<Nokogiri::XML::Attr:0x80878518 name="domain" value="post_tag">, #<Nokogiri::XML::Attr:0x80878504 name="nicename" value="dostoyevsky">]
and which seems a bit more manageable. It screws up my plans for taking the Author from a capitalized tag, but, well, it's not so big of a deal. How could I pull just the second value?
You're using at_xpath and expecting it to return more than one result, when the at_ methods only return the first result.
You want something like:
tags = item.xpath("category").map(&:text)
which will return an array.
As for identifying the author, you can use a regex to select the items that start with a capital letter:
author = tags.select{|w| w =~ /^[A-Z]/}
Which will choose any capitalized tags. This leaves the tags untouched. If you wanted instead to separate the authors from the tags, you can use partition:
author, tags = item.xpath("category").map(&:text).partition{|w| w =~ /^[A-Z]/}
Note that in the above examples, author is an array and will contain all matching items (i.e. more than one capitalized tag).

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'

To get text after the tag, containing another text

For example:
<p>
<b>Member Since:</b> Aug. 07, 2010<br><b>Time Played:</b> <span class="text_tooltip" title="Actual Time: 15.09:37:06">16 days</span><br><b>Last Game:</b>
<span class="text_tooltip" title="07/16/2011 23:41">1 minute ago</span>
<br><b>Wins:</b> 1,017<br><b>Losses / Quits:</b> 883 / 247<br><b>Frags / Deaths:</b> 26,955 / 42,553<br><b>Hits / Shots:</b> 690,695 / 4,229,566<br><b>Accuracy:</b> 16%<br>
</p>
I want to get 1,017. It is a text after the tag, containing text Wins:.
If I used regex, it would be [/<b>Wins:<\/b> ([^<]+)/,1], but how to do it with Nokogiri and XPath?
Or should I better parse this part of page with regex?
Here
doc = Nokogiri::HTML(html)
puts doc.at('b[text()="Wins:"]').next.text
You can use this XPath: //*[*/text() = 'Wins:']/text() It will return 1,017.
About regex: RegEx match open tags except XHTML self-contained tags
I would use pure XPath like:
"//b[.='Wins:']/following::node()[1]"
I've heard thousand of times (and from gurus) "never use regex to parse XML". Can you provide some "shocking" reference demonstrating that this sentence is not valid any more?
Use:
//*[. = 'Wins:']/following-sibling::node()[1]
In case this is ambiguous (selects more than one node), more strict expressions can be specified:
//*[. = 'Wins:']/following-sibling::node()[self::text()][1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[1]
Or:
(//*[. = 'Wins:'])[1]/following-sibling::node()[self::text()][1]

Resources