Extract text inside anchor tag using xpath - xpath

I am trying to ascertain how many pages are there for any search result on a site so that i can scrape data for all the pages using lxml and xpath.
There is a pagination tab with the following structure:
Page: 1 2 3 ... 7 next
the html content for the same being something like
<ul class="ulclass">
<li></li>
<li>
<span> You are on the first page</span>
"1"
</li>
<li>
<a href="link to second page">
<span></span>
"2"
</a>
</li>
<li>
</li>
...
<li>
<a href="link to last page">
<span></span>
"7"
</a>
</li>
My approach is to extract the page numbers 1,2,3,7 so that i can repeat the web scraping 7 times for every page 'cause otherwise it just scrapes the first result of the page.
I have written the following xpath, but it doesnot return correct page numbers.
xpath('//ul[#class="ulclass"]/li/a/text())

If I expand your example to form this,
<ul class="ulclass">
<li><span>You are on the first page</span>"1"</li>
<li><span></span>"2"</li>
<li><span></span>"3"</li>
<li><span></span>"4"</li>
<li><span></span>"5"</li>
<li><span></span>"6"</li>
<li><span></span>"7"</li>
</ul>
then using scrapy in Python I can get this:
>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('..//ul[#class="ulclass"]/li/a/text()').extract()
['"2"', '"3"', '"4"', '"5"', '"6"', '"7"']

Related

After parsing a valid expression, there is still more data in the expression pageCount

I'm making an E-commerce website and in the products section (inside admin), I was trying to display only 10 products per page. I'm new to Spring and while writing the code, I encountered an error (given in title) when trying to add the next page button. However, the code works fine with the Previous button and all the page numbers. Here's my code for the pagnation section:
<nav class="mt-3" th:if="${count > perPage}">
<ul class="pagination">
<li class="page-item" th:if="${page > 0}">
<a th:href="#{${#httpServletRequest.requestURI}} + '?page=__${page-1}__'" class="page-link">Previous</a>
</li>
<li class="page-item" th:each="number: ${#numbers.sequence(0, pageCount-1)}" th:classappend="${page==number} ? 'active' : ''">
<a th:href="#{${#httpServletRequest.requestURI}} + '?page=__${number}__'" class="page-link" th:text="${number+1}"></a>
</li>
<li class="page-item" th:if="${page pageCount-1}">
<a th:href="#{${#httpServletRequest.requestURI}} + '?page=__${page+1}__'" class="page-link">Next</a>
</li>
</ul>
</nav>
The first 2 li's work fine and I get the list of pages and also the previous button. But on adding the Next button, I get the error mentioned above.
First of all, please always provide the actual error message. Otherwise we are just guessing.
My guess is that th:if expects a boolean expression and what you have doesn't look like boolean to me: th:if="${page pageCount-1}"
Change that to something like page == pageCount-1, but again depends on what you want to display there

How to properly get the value contained inside a section using XPath?

having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL

Proper xpath Syntax for Extracting Two Text Values

I am trying to scrape a web page for NAME OF COMPANY and CITY AND STATE OF COMPANY shown below.
I have an xpath code snippet that identifies both text elements at the same time:
// span[starts-with(#class,"text-align")]/text()[2]
This xpath snippet pulls the first text value (COMPANY NAME). How do I get the second text element (CITY,STATE)?
A snip of the web page code looks like this:
<div>
<ul class="pv-top-card-v3--experience-list">
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="position_see_more" data-ember-action="" data-ember-action-172="172">
<img src="https://media.licdn.com/dms/image/C4E0BAQFhA8h46hvabA/company-logo_100_100/0?e=1582761600&v=beta&t=VAeZqaGu3Lu6Ol_n5kiiI74FSRuSOZA1ggAI5qTVRjE" id="ember173" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember174" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE NAME OF A COMPANY
<!----></span>
</a>
</li>
<li>
<a class="pv-top-card-v3--experience-list-item" href="#" data-control-name="education_see_more" data-ember-action="" data-ember-action-176="176">
<img src="https://media.licdn.com/dms/image/C560BAQEr2uQX-x2EwQ/company-logo_100_100/0?e=1582761600&v=beta&t=aDbYLUDMvlS4DpwOLjOaQj3Dj60C_cYLC5UUvGoyld0" id="ember177" class="EntityPhoto-square-1 flex-shrink-zero ember-view">
<span id="ember178" class="text-align-left ml2 t-14 t-black t-bold full-width lt-line-clamp lt-line-clamp--multi-line ember-view" style="-webkit-line-clamp: 2"> THIS IS THE CITY AND STATE OF COMPANY
<!----></span>
</a>
</li>
</ul>
</div>
The xpath string is picking up the two span elements using class. I can't use the span id attributes because they are dynamic and change with each page (one page per company).
Can someone advise how I extract the desired text?
Thanks.
point to the li level.
//ul/li[2]/a/span[starts-with(#class,"text-align")]

xPath expression is not finding text inside HTML

I have following XML:
<div>
<ul>
<li>
<a>
Logout 1
</a>
</li>
<li>
<a>
Logout 2
</a>
</li>
<li>
<a>
Logout 3
</a>
</li>
<li>
<a>
Logout 4
</a>
</li>
</ul>
</div>
And I want to check if a a tag with the text Logout 4exists. I do this with the following expression:
/div/ul/li/a[text() = 'Logout 4']
Which doesnt seem to work, anyone can tell me what I am doing wrong?
I am testing my xPath on this site btw: http://www.xpathtester.com/xpath
Your XPath didn't return any result because the inner text of the a element has leading and trailing spaces, which you can clear using normalize-space() :
/div/ul/li/a[normalize-space() = 'Logout 4']
demo
or, if you really want to evaluate only the first child text node within a :
/div/ul/li/a[normalize-space(text()) = 'Logout 4']

Select visible xpath in list

I am trying to get the error message off of a page from a site. The list contains several possible errors so i can't check by id; but I do know that the one with display:list-item is the one I want. This is my rule but doesn't seem to work, what is wrong with it? What I want returned is the error text in the element.
//*[#id='errors']/ul/li[contains(#style,'display:list-item')]
Example dom elements:
<div id="errors" class="some class" style="display: block;">
<div class="some other class"></div>
<div class="some other class 2">
<span class="displayError">Please correct the errors listed in red below:</span>
<ul>
<li style="display:none;" id="invalidId">Enter a valid id</li>
<li style="display:list-item;" id="genericError">Something bad happened</li>
<li style="display:none;" id="somethingBlah" ............ </li>
....
</ul>
</div>
The correct XPath should be:
//*[#id='errors']//ul/li[contains(#style,'display:list-item')]
After //*[#id='errors'] you need an extra /, because <ul> is not directly beneath it. Using // again scans all underlying elements for <ul>.
If you are capable to not use // it would be better and faster and less consuming.

Resources