I get trouble to get specific texts which are located between two tags.
I mean, want to get Text after em tag. I want to get this. and also text after this p tag. I also want to get this..
is there any way of doing that?
thanks in advance.
<article>
<h1 id='h1'>Heading 1</h1>
<img src='mypath/pictures/pic.jpg'></img>
<p></p>
<div id='div1'>
<time datetime='2016'>2016</time>
</div>
<br></br>
<em>my location, TN, United States</em>
Text after em tag. I want to get this.
<p></p>
text after this p tag. I also want to get this.
<div id='div2'>
</div>
</article>
you can get the following sibling texts by using
following-sibling::text()
so to get all the em after text
//em/following-sibling::text()[1]
the same will be for p tag, and then join them
string-join(em/following-sibling::text()[1] | p/following-sibling::text()[1] , ',')
I hope this could help!
Related
Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
I am trying to scrape "Description" from this HTML structure
<div class="menu-index-page__item-content">
<h6 class="menu-index-page__item-title">
<span> Item title </span>
</h6>
<p class="menu-index-page__item-desc">
<span>
<span>
<span>Description</span>
</span>
</span>
Each tag has an element with it that I don't know how to handle:
data-reactid=".3wrqgx5340.3.5.0.4:$523105.2.$3959254.$menuItemContent.1.0"
Each data-reactid is different. So if I target this attribute I will scrape stuff I don't want.
I've tried .search .xpath, using tags and classes but nothing seems to work.
Is there a way to say: give me the p tag that has a class="menu-index-page__item-desc" and scrape the 3rd span from there?
You can get the required value via xpath
//text()[contains(.,'Description')]
You code and xpath:
I trying to parse a webpage and get all the content inside a div tag named div1. I tried ('div[#class="div1"]') which gives me the content below
<div class="div1">
<p>
something something <br>
abc<br>
def
</p>
</div>
However, I am trying to get everything that is inside the div tag, not including the div tag as shown below
<p>
something something <br>
abc<br>
def
</p>
Try changing your xpath to
div[#class="div1"]/child::*
Quote from https://www.w3.org/TR/xpath/#location-paths:
child::* selects all element children of the context node
For one thing, you're looking for #id when it's #class
<div id="info" class="">
<span>
<span class="pl"> author</span>:
<a class="" href="/search/author"Peter</a>
</span><br/>
<span class="pl">publisher:</span> god cor<br/>
<span class="pl">year:</span> 2011-6<br/>
<span class="pl">page:</span> 360<br/>
<span class="pl">price:</span> 39.50<br/>
From the above HTML tags, i want to extract those numbers with XPath.How can i do that?
Thanks.
The XPath for each number is (in order as shown above) :
//*[#id="info"]/a/text()[2] --> 2011-6
//*[#id="info"]/a/text()[3] -->360
//*[#id="info"]/a/text()[4] --> 39.5
You can know the XPath for any tag by just opening the html file in Chrome, right clicking on the view and choosing "inspect". When you find the tag you want, just right click on it and choose Copy-> Copy XPath.
HTML structure looks like this:
<div class="Parent">
<div id="A">more tags and text</div>
<div id="B">more tags and text</div>
more tags
<p> and text </p>
</div>
I would like to extract text just from the parent and the tags apart from the A and B children.
I have tried
/div[#class='Parent']//text()
which extracts text from all the descendant nodes, so a made a constraint like /div[#class='Parent']//text()[not(self::div)]
but it did not change a thing.
Thanks for any advice
/div[#class='Parent']/*[not(self::div and (#id='A' or #id='B'))]//text() | /div[#class='Parent']/text()