I have been trying to figure this out for a while and can't get my head around it. I have tried using following-sibling but it's not working for me. The classes are really generic across the board. I was trying to use the text within the <strong> tag to identify then pull the sibling content:
<div class="generic-class">
<p class="generic-class2">
<strong>Content title</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title2</strong>
"
Needed Content "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title3</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title4</strong>
"
Dont Need "
<br>
</p>
</div>
I tried using the below but with no success, I did then realise that the text is actually in the <p> tag so it's not a sibling.:
normalize-space(//*[#class="generic-class"]/p/strong/following-sibling::text())
Would there be a way of me finding the text in the <strong> tag "Content title2" and then getting the text in the parent?
Any help would be amazing, thanks!
This one should return "Needed Content":
normalize-space(//p/strong[.="Content title2"]/following-sibling::text())
Related
having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL
Using Scrapy, I want to extract some data from a HTML well-formed site. With XPath I am able to extract a list of items, but I am not able to extra data from the elements in the list, using XPath
All XPath's have been tested using XPather. I have tested the issue using a local file that contains the webpage, same issue.
Here goes:
# Get the webpage
fetch("https://www.someurl.com")
# The following gives me the expected items from the HTML
products = response.xpath("//*[#id='product-list-146620']/div/div")
The items are like this:
<div data-pageindex="1" data-guid="13157582" class="col ">
<div class="item item-card item-card--static">
<div class="item-card__inner">
<div class="item__image item__image--overlay">
<a href="/www.something.anywhere?ref_gr=9801" class="ratio_custom" style="padding-bottom:100%">
</a>
</div>
<div class="item__text-container">
<div class="item__name">
<a class="item__name-link" href="/c.aspx?ref_gr=9801">The text I want</a>
</div>
</div>
</div>
</div>
</div>
When using the following Xpath to extract "The text I want", i dont get anything:
XPATH_PRODUCT_NAME = "/div/div/div/div/div[contains(#class,'item__name')]/a/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
The output is empty, why?
Try the following code.
XPATH_PRODUCT_NAME = ".//div[#class='item__name']/a[#class='item__name-link']/text()"
products[0].xpath(XPATH_PRODUCT_NAME).extract()
I need span title text (RS_GPO) as my xpath output
Here is code:
<TD id="celleditableGrid07" nowrap="nowrap" style='padding:0px;' >`
<DIV class='stacked-row'>
<span id="form(202567).form(TITLE).text" >
<span title='RPS_AEM3'>RPS_AEM3</span>
</span>
</DIV>
<DIV class='stacked-row-bottom'>
<span id="form(202567).form(CONTENT).text" >
<span title='RS_GPO'>RS_GPO</span>
</span>
</DIV>
My intention for xpath is I want catch text “RS_GPO” in to a variable.
Because this is system generated text.
Thanks in Advance.
//span[#title='RS_GPO']
OR
//div[#class='stacked-row-bottom']/span[#id='form(202567).form(CONTENT).text']/span[#title='RS_GPO']
//span[contains(#id,'form(CONTENT).text')]/span
If you want the content of the title attribute instead of the element's text content, then:
//span[contains(#id,'form(CONTENT).text')]/span/#title
For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.
I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:
<html>
<body>
<h1>
<span class="text">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="text">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
I'd like to:
Extract the content of <h1..."text".
Insert (and concatenate) this extracted content into the content of <p..."text".
Only do this for the <p> tag that immediately follows the <h1> tag.
Do this for all of the <h1> tags on the page.
So this is what I want it to look like:
<html>
<body>
<p>
<span class="text">Go totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<p>
<span class="text">debugger should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle</span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
With the code starting off like this,
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
html_code := strings.NewReader(`code_example_above`)
doc, _ := goquery.NewDocumentFromReader(html_code)
I know that I can read <h1..."text" with:
h3_tag := doc.Find("h3 .text")
I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:
doc.Find("p .text").Before("h3 .text")
^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".
Then, I found out how to get a step closer to what I want:
doc.Find("p .text").First().Before("h3 .text")
^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).
I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)
My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.
If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.
My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.
EDIT
I found out something else I can do:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
nex := s.Next().Text()
fmt.Println(s.Text(), nex, "\n\n")
})
^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?
Or, as mattn pointed out, I could also use doc.Find("h1+p").
I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.
I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.
h1+p
This returns h1 tags which has p tag in neighbor.
I've recently decided to update a website by adding rich snippets - microdata.
The thing is I'm a newbie to this kind of things and I'm having a small question about this.
I'm trying to define the Organization as you can see from the code below:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
Now, my problems consists in the following: I'd like to also tag the LOGO in order to make a complete Organization profile, but the logo stands in the header of my page, and the div I've posted above stands in the footer and the style/layout of the page doesnt permit me to add the logo in here and also make it visible.
So, how can I solve this thing? What's the best solution?
Thanks.
You can use the itemref attribute.
Give your logo in the header an id and add the corresponding itemprop:
<img src="acme-logo.png" alt="ACME Inc." itemprop="logo" id="logo" />
Now add itemref="logo" to your div in the footer:
<div class="block-content" itemscope itemtype="http://schema.org/Organization" itemref="logo">
…
</div>
If this is not possible in your case, you could "duplicate" the logo so that it’s included in your div, but not visible. Microdata allows meta and link elements in the body for this case. You should use the link element, as http://schema.org/Organization expects an URL for the logo property. (Alternatively, add it via meta as a separate ImageObject).
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
…
<link itemprop="logo" src="logo.png" />
…
</div>
Side note: I don’t think that you are using the hr element correctly in your example. If you simply want to display a horizontal line, you should use CSS (e.g. border-top on the p) instead.
Dan, you could simply add in the logo schema with this code:
<img itemprop="logo" src="http://www.example.com/logo.png" />
So in your example, you could simply tag it as:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<img itemprop="logo" src="http://www.example.com/logo.png" />
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
I believe that should work for your particular case and it won't actually show the logo and you wouldn't have to mark up the logo separately. Hope that helps.