Using Scrapy to Scrape Content after a particular keyword/string - xpath

I'm trying to scrape content after the occurrence of a particular keyword/string.
Suppose the Xpath is as follows:
<meta property="og:url" content="https://www.example.com/tshirt/pcid111-31">
<meta property="og:url" content="https://www.example.com/tshirt/pcid3131-33">
<meta property="og:url" content="https://www.example.com/tshirt/pcid545424524-84">
1) How can I extract all the data inside the content element whose property="og:url
2)I also want to extract anything which is after the pcid, can someone suggest a way around this?
Now sure if this would work:
item ["example"] =sel.xpath("//meta[#property='og:url']/text()").extract()[0].replace("*pcid","")
Does the replace take in wildcard character references?

This will extract content attributes of elements whose property="og:url"
og_urls = response.xpath("//meta[#property='og:url']/#content").extract()
For extracting stuff from the url it's usually best to use regex, in your case it would be:
for url in og_urls:
id = re.findall("pcid(.+)") # "pcid(.+)" = any characters after 'pcid'(greedy)
# re.findall() returns a list and you probably want only the first occurrence and there mostlikely only be one anyway
id = id[0] if id else ''
print(id)
or you can split the url at the 'pcid' and take the later value, e.g.
for url in og_urls:
id = url.split('pcid')[-1]
print(id)

Try this
x=len(hxs.select("//meta/#content").extract())
for i in range(x):
print hxs.select("//meta/#content").extract()[i].split('pcid')[1]
Output:
111-31
3131-33
545424524-84

Related

The <title> of the page is changing. How to get it with XPath?

Got the page with dynamic <title> tag depending on language selected by user, e.g.
<title>English</title> or <title>Italiano</title>
I'm trying to select that page among many others with XPath selector:
//*[contains(#title, 'English') or contains(#title, 'Italiano')]
but it doesn't work at all.
Also tried
(//*[contains(#title, 'English')] | //*[contains(#title, 'Italiano')])[1] - no positive result
title is not an attribute, so no need to add #:
//*[contains(title, 'English') or contains(title, 'Italiano')]
This will return parent node. If you want to select title node then try
//title[.='English' or .='Italiano']

Get element in particular index nokogiri

How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]

Xpath/HtmlAgilityPack: Getting the specific attributes from href tag

I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles

Gsub and regular expression

I have a web page. The HTML source contains this text:
<meta property="og:title" content="John"/>
John is an example, the name may vary.
I am sure that og:title will appear only once in the text.
This is my code:
$browser.goto( url )
x = $browser.html.gsub( /^.*<meta property="og:title" content="(.+?)".>/m, '\1' )
I expected to find the name John in my variable x
The '\1' should give me the first part I put in the parenthesis, i.e. (.+?), i.e. John, right?
Also, I used a dot . to match a slash / , is there a better way?
Using Watir API:
x = browser.meta.attribute_value "content"
I was not able to access the meta element using either css and xpath.
If you only want the value of content:
html = '<meta property="og:title" content="John"/>'
=> "<meta property=\"og:title\" content=\"John\"/>"
html[/property="og:title" content="([^"]+)"/, 1]
=> "John"
If you're not familiar with regex, "([^"]+)" might throw you. It means "from the first ", grab everything until the next ". In effect it means "grab everything inside the double-quotes.
That code will return all of the HTML, with the matching code (which is everything between the start of the string up to and including the />) replaced by 'John'. So that comes down to "John", followed by the HTML that was after the /> of that meta property.
If you only want to extract the name, and that tag occurs only once, you can use something like:
#browser.html =~ /<meta property="og:title" content="(.+?)"/
x = $1

Using Xpath and HtmlAgilityPack to find all elements with innertext containing a specific word or words

I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'

Resources