goquery- Concatenate a tag with the one that follows - go

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.
I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:
<html>
<body>
<h1>
<span class="text">Go </span>
</h1>
<p>
<span class="text">totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<h1>
<span class="text">debugger </span>
</h1>
<p>
<span class="text">should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle </span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
I'd like to:
Extract the content of <h1..."text".
Insert (and concatenate) this extracted content into the content of <p..."text".
Only do this for the <p> tag that immediately follows the <h1> tag.
Do this for all of the <h1> tags on the page.
So this is what I want it to look like:
<html>
<body>
<p>
<span class="text">Go totally </span>
<span class="post">kicks </span>
</p>
<p>
<span class="text">hacks </span>
<span class="post">its </span>
</p>
<p>
<span class="text">debugger should </span>
<span class="post">be </span>
</p>
<p>
<span class="text">called </span>
<span class="post">ogle</span>
</p>
<h3>
<span class="statement">true</span>
</h3>
</body>
<html>
With the code starting off like this,
package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
html_code := strings.NewReader(`code_example_above`)
doc, _ := goquery.NewDocumentFromReader(html_code)
I know that I can read <h1..."text" with:
h3_tag := doc.Find("h3 .text")
I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:
doc.Find("p .text").Before("h3 .text")
^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".
Then, I found out how to get a step closer to what I want:
doc.Find("p .text").First().Before("h3 .text")
^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).
I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)
My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.
If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.
My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.
EDIT
I found out something else I can do:
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
nex := s.Next().Text()
fmt.Println(s.Text(), nex, "\n\n")
})
^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?
Or, as mattn pointed out, I could also use doc.Find("h1+p").
I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.

I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.
h1+p
This returns h1 tags which has p tag in neighbor.

Related

How to properly get the value contained inside a section using XPath?

having the following HTML (snippet grabbed from the web page I wanted to scrape):
<div class="ulListContainer">
<section class="stockUpdater">
<ul class="column4">
<li>
<img src="1.png" alt="">
<strong>
Buy*
</strong>
<strong>
Sell*
</strong>
</li>
<li>
<header>
$USD
</header>
<span class="">
20.90
</span>
<span class="">
23.15
</span>
</li>
</ul>
<ul>...</ul>
</section>
</div>
how do I get the 2nd li 1st span value using XPath? The result should be 20.90.
I have tried the following //div[#class="ulListContainer"]/section/ul[1]/li[2]/span[1] but I am not getting any values. I must said this is being used from a Google Sheet and using the function IMPORTXML (not sure what version of XPath it does uses) can I get some help?
Update
Apparently Google Sheets does not support such "complex" XPath expression since it seems to work fine:
Update 1
As requested I've shared the Google Sheet I am using to test this, here is the link
What you need is :
=IMPORTXML(A1;"//li[contains(text(),'USD')]/span[1]")
Removing section from your original XPath will work too :
=IMPORTXML(A1;"//div[#class='ulListContainer']/ul[1]/li[2]/span[1]")
Try this:
=IMPORTXML("URL","//span[1]")
Change URL to the actual website link/URL

Xpath Sibling Text

I have been trying to figure this out for a while and can't get my head around it. I have tried using following-sibling but it's not working for me. The classes are really generic across the board. I was trying to use the text within the <strong> tag to identify then pull the sibling content:
<div class="generic-class">
<p class="generic-class2">
<strong>Content title</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title2</strong>
"
Needed Content "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title3</strong>
"
Dont Need "
<br>
</p>
</div>
<div class="generic-class">
<p class="generic-class2">
<strong>Content title4</strong>
"
Dont Need "
<br>
</p>
</div>
I tried using the below but with no success, I did then realise that the text is actually in the <p> tag so it's not a sibling.:
normalize-space(//*[#class="generic-class"]/p/strong/following-sibling::text())
Would there be a way of me finding the text in the <strong> tag "Content title2" and then getting the text in the parent?
Any help would be amazing, thanks!
This one should return "Needed Content":
normalize-space(//p/strong[.="Content title2"]/following-sibling::text())

Thymeleaf switch block returns incorrect value

I have a switch block in my thymeleaf page where I show an image depending on the reputation score of the user:
<h1>
<span th:text="#{user.reputation} + ${reputation}">Reputation</span>
</h1>
<div th:if="${reputation lt 0}">
<img th:src="#{/css/img/troll.png}"/>
</div>
<div th:if="${reputation} == 0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
<div th:if="${reputation gt 0} and ${reputation le 5}">
<img th:src="#{/css/img/samwise.png}"/>
</div>
<div th:if="${reputation gt 5} and ${reputation le 15}">
<img th:src="#{/css/img/frodo.png}"/>
</div>
<div th:if="${reputation gt 15}">
<img th:src="#{/css/img/gandalf.jpg}"/>
</div>
This statement always returns smeagol (so reputation 0), eventhough the reputation of this user is 7: example
EDIT:
I was wrong, the image showing was a rogue line:
<!--<img th:src="#{/css/img/smeagol.jpg}"/>-->
but I commented it out. Now there is no image showing.
EDIT2:
changed my comparators (see original post) and now I get the following error:
The value of attribute "th:case" associated with an element type "div" must not contain the '<' character.
EDIT3:
Works now, updated original post to working code
According to the documentation, Thymeleaf's switch statement works just like Java's - and the example suggests the same.
In other words: you cannot do
<th:block th:switch="${reputation}">
<div th:case="${reputation} < 0">
[...]
but would need to do
<th:block th:switch="${reputation}">
<div th:case="0">
[...]
which is not what you want.
Instead, you will have to use th:if, i.e. something like this:
<div th:if="${reputation} < 0">
<img th:src="#{/css/img/troll.png}"/>
</div>
Change
<div th:case="0">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>
to
<div th:case="${reputation == 0}">
<img th:src="#{/css/img/smeagol.jpg}"/>
</div>

import.io selecting css class with xpath that contain certain value/character

<div id="mDetails">
<span class="textLabel">Bar Number:</span>
<p class="profileText">YYYYYYYYYYYYYYYYYYYY</p>
<span class="textLabel">Address:</span>
<p class="profileText">YYYYYYYYYYYYYYYYYYY<br>YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY<br>United States</p>
<span class="textLabel">Phone:</span>
<p class="profileText">123465798</p>
<span class="textLabel">Fax:</span>
<p class="profileText">987654321</p>
<span class="textLabel">Email:</span>
<p class="profileText">regina#rbr3.com</p>
<span class="textLabel">County:</span>
<p class="profileText">YYYYYYYYYYYYYYY</p>
<span class="textLabel">Circuit:</span>
<p class="profileText">YYYYYYYYYY</p>
<span class="textLabel">Admitted:</span>
<p class="profileText">00/00/0000</p>
<span class="textLabel">History:</span>
<p class="profileText">YYYYYYYYYYYYYYYYY</p>
im trying to select the email only if its available cause when i use //*[#class="profileText"]it returns everything with this class , i want only to return when # is present in the value.
With the adjustment to the input XML to change both <br> to <br/> (otherwise it's not valid XML) the following XPath selects all p elements that have the class profileText and contains #:
//p[#class='profileText'][contains(.,'#')]
returns
<p class="profileText">regina#rbr3.com</p>
In case you only want to get the value, you can use string():
string(//p[#class='profileText'][contains(.,'#')])
returns
regina#rbr3.com
Note that string() would only return the value of the first match, while the first XPath returning the p elements returns all matches.

Rich Snippets : Microdata itemprop out of the itemtype?

I've recently decided to update a website by adding rich snippets - microdata.
The thing is I'm a newbie to this kind of things and I'm having a small question about this.
I'm trying to define the Organization as you can see from the code below:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
Now, my problems consists in the following: I'd like to also tag the LOGO in order to make a complete Organization profile, but the logo stands in the header of my page, and the div I've posted above stands in the footer and the style/layout of the page doesnt permit me to add the logo in here and also make it visible.
So, how can I solve this thing? What's the best solution?
Thanks.
You can use the itemref attribute.
Give your logo in the header an id and add the corresponding itemprop:
<img src="acme-logo.png" alt="ACME Inc." itemprop="logo" id="logo" />
Now add itemref="logo" to your div in the footer:
<div class="block-content" itemscope itemtype="http://schema.org/Organization" itemref="logo">
…
</div>
If this is not possible in your case, you could "duplicate" the logo so that it’s included in your div, but not visible. Microdata allows meta and link elements in the body for this case. You should use the link element, as http://schema.org/Organization expects an URL for the logo property. (Alternatively, add it via meta as a separate ImageObject).
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
…
<link itemprop="logo" src="logo.png" />
…
</div>
Side note: I don’t think that you are using the hr element correctly in your example. If you simply want to display a horizontal line, you should use CSS (e.g. border-top on the p) instead.
Dan, you could simply add in the logo schema with this code:
<img itemprop="logo" src="http://www.example.com/logo.png" />
So in your example, you could simply tag it as:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<img itemprop="logo" src="http://www.example.com/logo.png" />
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
I believe that should work for your particular case and it won't actually show the logo and you wouldn't have to mark up the logo separately. Hope that helps.

Resources