How to make OR condition in Xpath query - xpath

I have two different type of html code:
my first code is:
<div class="course-general-info hide-for-medium-up">
<!-- TODO: check date format here -->
<div class="headline"><strong>2014-01-27</strong></div>
<div class="subline">Course start</div>
<div class="headline"><strong>2014-04-27</strong></div>
<div class="subline">Course end</div>
<div class="headline">Basis</div>
<div class="subline">Level</div>
</div>
my second code is:
<div class="course-general-info hide-for-medium-up">
<div class="headline">Available Soon</div>
<div class="subline">Course start</div>
<!-- TODO: check date format here -->
<div class="headline">Basis</div>
<div class="subline">Level</div>
</div>
i need to fetch the following value in single Xpath query
2014-01-27 or
Available Soon
my separate xpath queries are:
courseDurationDate = courseDetailData.xpath('//div[#class = "headline"]/strong/text()').extract()
CourseDutaionAvailableSoon = courseDetailData.xpath('//div[#class = "headline"]/text()').extract()
kindly help to write or condition in Xpath query. thanks in advance....

Assuming only one of the 2 divs will appear at the same time, this should do it:
(//div[#class = "headline"]/strong |
//div[#class = "headline" and not(strong)])[1]/text()

a one-liner
div[#class='course-general-info hide-for-medium-up']/div[#class='headline'][1]/descendant::text()

I tried following Xpath Query its worked for me:
courseDurationDate = courseDetailData.xpath('//div[#class = "course-general-info hide-for-medium-up"]/div[#class = "headline"]/strong/text() | \
//div[#class = "course-general-info hide-for-medium-up"]/div[#class = "headline" and not (strong)][1]/text()').extract()

Related

Issues with preceding sibiling/parent/ancestor

<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
Note: there are over 20 productHolder classes on the same page.
I am able to get the price data, how can i used parent or preceding sibling to get the href.
I use the following code to get price:
rawPrice = response.xpath("//*[contains(text(),'$')]/text()")[counter].extract()
I've spent 2 hours trying to use preceding sibling, parent and even changing the code to use other values but, I run issues elsewhere.
Any help is appreciated, cheers!
Were you looking for something like:
from io import StringIO
from lxml import etree
html = """
<div class='productHolder'>
<a href="https://ap.com" class="tea-time-with-ap">
<div class="aptime-8" dataInfo="name">Hammer</div>
<div class="aptime-9" dataInfo="price">$980</div>
</div>
</div>
</a>
</div>
"""
root = etree.parse(StringIO(html), etree.HTMLParser())
print(root.xpath('//*[contains(text(),"$")]/../#href')[0])
Result:
https://ap.com
Of course you can easily build from this:
item = root.xpath('//*[contains(text(),"$")]')
print(item[0].text)
print(item[0].xpath('../#href')[0])
Result:
$980
https://ap.com

How to select this element with Scrapy XPATH?

Only requirement: it needs to refer to the thread-navigation class, because that page has many other pagination elements
<section id="thread-navigation" class="group">
<div class="float-left">
<div class="pagination talign-mleft">
<span class="pages">Pages (6):</span>
<span class="pagination_current">1</span>
2
3
4
5
6
Next ยป //<--- this one
</div>
</div>
</section>
I was trying something like this:
r.xpath('//*[#class="thread-navigation" and contains (., "Next")]').get()
But it always returns None
Thank you
You are not referring to an #class attribute, but rather to an #id attribute with the value thread-navigation. So try this XPath-1.0 expression:
r.xpath('//a[ancestor::*/#id="thread-navigation" and contains (text(), "Next")]/#href').get()
Its result is
I want this text?page=2
This xpath:
'//section[#id="thread-navigation"]//a/#href'

Parsing through response created with XPath

Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()

Simple dom document iteration

I have an HTML as so:
<html>
<body>
<div class="somethingunneccessary"></div>
<div class="container">
<div>
<p>text1</p>
<p>text2</p>
<p>text3</p>
</div>
<div>
<p>text4/p>
<p>text5</p>
<p>text6</p>
</div>
<div>
<p>text7</p>
<p>text8</p>
<p>text9</p>
</div>
<div>
<p>text10</p>
<p>text11</p>
<p>text12</p>
</div>
<div>
<p>text13</p>
<p>text14</p>
<p>text15</p>
</div>
</div>
</body>
</html>
What I'm trying to accomplish is the following:
1./ Loop over the div elements within the div having a class container.
2./ During the iteration I want to grab the text from the 3rd p tag.
The looping part is essential instead of just slicing out the p tags by themselves
I've got some code done but it doesn't do looping:
$doc=new DOMDocument();
$doc->loadHTML($htmlsource);
$xpath = new DOMXpath($doc);
$commentxpath = $xpath->query("/html/body/div[2]/div[5]/p[3]");
$commentdata = $commentxpath->item(0)->nodeValue;
How do I loop through each inner div element and extract the 3rd p tag.
Like I said, the looping is essential.
During the iteration I want to grab the text from the 3rd p tag
Try:
"//div[#class='container']/div/p[3]"
This should return all third p in all div inside of div with class container.
You may have to query over attributes: php xpath get attribute value
$xpath->query("/html/body/div[#class='container']");
Just try
/html/body/div/div//p
That should return only the p elements XD

XPath - Get textcontent() and HTML

Lets say I have the following HTML:
<div class="some-class">
<p> some paragraph</p>
<h2>a heading</h2>
</div>
I want to grab everything in <div class='some-class'>, including the HTML. The following only grabs the text:
$descriptions = $xpath->query("//div[contains(#class, 'some-class')]");
foreach($descriptions as $description)
print $description->textContent;
Whats the best way of getting the contained HTML tags as well?
Use this function - I've never found any built in function but this works well:
function getInnerHTML($node)
{
$innerHTML = "";
$children = $node->childNodes;
foreach ($children as $child) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($child,true));
$innerHTML .= $tmp_doc->saveHTML();
}
return $innerHTML;
}
I believe you are looking to retrieve the outerXml - have a look at DOMDocument::saveXML. Or have I misunderstood you - do you just need the xml serialization of the <div> element and its attribute axis?
Edit I mean do you want:
<div class="some-class">
<p> some paragraph</p>
<h2>a heading</h2>
</div>
or just
<div class="some-class" />
?

Resources