Xpath/HtmlAgilityPack: Getting the specific attributes from href tag - xpath

I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles

Related

ImportXML function in Google Dynamic XML path

I am trying to import the headlines and landing page URL's from "New + Updated" section of this page:
https://www.nytimes.com/wirecutter/
The issue is that the class "_988f698c" keeps changing as the headline is being replaced with a new headline/topic.
I need a workaround to use IMPORTXML function which will dynamically capture the class of that object in that position. The current formula is:
=IMPORTXML(https://www.nytimes.com/wirecutter/,"//*[#class='_988f698c']")
Here is the html tag for example. The class "_988f698c" refreshes every hour or so with new headlines coming in.
<li class="e9a6bea7">
<a class="_988f698c" href="https://www.nytimes.com/wirecutter/reviews/gir-spatula-review/">Why We Love GIR Spatulas</a>
<p class="_9d1f22a9">today
</p>
</li>
Is there a way I can do this?
Come back a little and look for an alternative path without forcing the use of random numbers.
For the title, use:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a"
)
For the URL attached to the title:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a/#href"
)
For the text indicating the day of publication:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/p"
)
If you want to collect everything together, use | to split the paths:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a |
//ul[#data-testid='new-and-updated']/li/a/#href |
//ul[#data-testid='new-and-updated']/li/p"
)
only use it if you are absolutely sure that the values will always exist, because if they don't, you will have problems with the position in the sheet rows if you define formulas that depend on fixed values in each of the cells.

Parsing a nested tag, moving it outside of the parent, and changing its type using Nokogiri

I have HTML coming from an API that I want to clean up and format it.
I'm trying to get any <strong> tags that are the first element inside a <p> tag, and change it to be the parent of the <p> tag, and convert the <p> tag to <h4>.
For example:
<p><strong>This is what I want to pull out to an h4 tag.</strong>Here's the rest of the paragraph.</p>
becomes:
<h4>This is what I want to pull out to an h4 tag.</h4><p>Here's the rest of the paragraph.</p>
EDIT: Apologies for the nature of the question being too 'please write this for me'. I posted the solution I came up with below. I just had to take the time to really learn how Nokogiri works, but it is quite powerful and it seems like you can do almost anything with it.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.css("p").map do |paragraph|
first = paragraph.children.first
if first.element? and first.name == "strong"
first.name = 'h4'
paragraph.add_previous_sibling(first)
end
end

scrapy xpath : selector with many <tr> <td>

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

XPath Expression Issue in Html Agility Pack

I'm using Html Agility Pack to perform a basic web scraping of Google search results. As a newbie to XPath, I make sure my path expression is correct(with the help of FirePath). However, the returned HtmlNodeCollection is always NULL.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument htmlDoc = web.Load("http://www.google.com/search?num=10&q=Hello+World");
// get search result URLs
var items = htmlDoc.DocumentNode.SelectNodes("//div[#id='ires']/ol[#id='rso']/li/div[#class='vsc']/h3/a/#href");
foreach (HtmlNode node in items)
{
Console.WriteLine(node.Attributes);
}
Am I missing something? Can anyone please enlighten me?
Thanks in advance,
HAP can only process the raw HTML that is returned from the url, it will not run any additional javascript that is on the page or whatnot. You need to adjust your query accordingly.
In the raw HTML, the ires div exists but the rso doesn't get inserted until the javascript is run hence you get no results. There are other transformations done here which you'll have to adjust for as well.
Here's a fragment of the HTML:
<div id="ires">
<ol>
<li class="g">
<h3 class="r">
...
A more appropriate xpath to use for this would be:
var xpath = "//li[contains(concat(' ',#class,' '),' g ')]" +
"/h3[contains(concat(' ',#class,' '),' r ')]" +
"/a/#href";
It'd be easier to find all li with the g class as those correspond to all the results. You'll want to filter all h3 with the r class otherwise you'd include other results (such as image results).

Help in XPath expression

I have an XML document which contains nodes like following:-
<a class="custom">test</a>
<a class="xyz"></a>
I was tryng to get the nodes for which class is NOT "Custom" and I wrote an expression like following:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']]");
Now, I want to get IMG tags as well and I want to add the following experession as well to the above expression:-
//*[self::IMG or self::img]
...so that I get all the IMG nodes as well and any tag other than having "custom" as value in the class attribute.
Any help will be appreciated.
EDIT :-
I tried the following and this is an invalid syntax as this returns a boolean and not any nodelist:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']] && [self::IMG or self::img]");
Not sure of what you are asking, but have you tried something like the following?
"//A[#class!='custom'] | //a[#class!='custom'] | //IMG | //img"

Resources