I have this piece of html:
<p class=MsoNormal style='margin-bottom:12.0pt'><span style='font-size:8.5pt;
font-family:"Arial",sans-serif;mso-fareast-font-family:"Times New Roman"'>1. A
következő feldolgozása szükséges: <br>
<br>
1000806457 bevásárlókosár kiegészítése 107,28 EUR értékkel <br>
<br>
Kattintson a következő nyomógombra a rendszerbe való bejelentkezéshez és online
engedélyezéshez: <br>
<a
href="%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20https:/cip7a.reh.rehau.de:8431/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html?sap-client=100#ShoppingCartItem-approve ">Bejelentk.
</a><br>
<br>
2. Áttekintés 1000806457, XXX 26.10.2021 11:03 sz. bevásárlókosárhoz <o:p></o:p></span></p>
I tried several ways to get this text: '2. Áttekintés 1000806457, XXX 26.10.2021 11:03 sz. bevásárlókosárhoz', fe. with this code:
var HTMLPList = from p in htmlEmail.DocumentNode.SelectNodes("//p[contains(#style,'margin-bottom:12.0pt')]").Cast<HtmlNode>()
select new { pText = p.InnerText };
But I get always null error. So, the question is: how to get the text above? I tried some 'contains' method, where the text contains 'Áttekintés', but no success.
Thanks.
If you are not specifically looking for a xpath, here is one with css selectors (which is what I am used working with):
You want the last text node type within the child nodes of the paragraph element.
var doc = new HtmlDocument();
doc.LoadHtml(#"<p class=MsoNormal style='margin-bottom:12.0pt'>
<span style=""font-size:8.5pt; font-family:'Arial',sans-serif;mso-fareast-font-family:'Times New Roman'"">
1. A következő feldolgozása szükséges: <br>
<br>
1000806457 bevásárlókosár kiegészítése 107, 28 EUR értékkel<br>
<br>
Kattintson a következő nyomógombra a rendszerbe való bejelentkezéshez és online
engedélyezéshez: <br>
<a href= '%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20https:/cip7a.reh.rehau.de:8431/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html?sap-client=100#ShoppingCartItem-approve ' > Bejelentk.
</a><br>
<br>
2.Áttekintés 1000806457, XXX 26.10.2021 11:03 sz.bevásárlókosárhoz<o:p></o:p>
</span>
</p>");
//Using HtmlAgilityPack.CssSelectors.NetCore
var p = doc.QuerySelector("p span")
.ChildNodes
.Where(c => c.NodeType == HtmlNodeType.Text && !string.IsNullOrEmpty(c.InnerText.Trim()))
.LastOrDefault();
Console.WriteLine(p.InnerText.Trim());
Related
<div class="sResMain">
<b>
dogukan1905
</b>
<img src="http://eu.ipstatic.net/images/male.gif" width="11" height="11" class="sResSex">
20
<br>
<div class="sResMainTxt">
<div class="sResTxtField">I study at aircraft technology...</div></div></div>
I want to select number(20) between img and br tag. However I couldn't.
From what you posted, the text that you are trying to parse belongs to <div class="sResMain">. Moreover this is the only text that <div class="sResMain"> has. There is a method in Jsoup that will return the text that belongs (immediate textnode child) to a node. Try ownText() of Element.
Document doc = Jsoup.parse(htmlStr);
Elements elements = doc.select(".sResMain");
for(Element e : elements) {
String text = e.ownText();
System.out.println(text);
}
This is my HTML file data:
<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>
</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>
</div>
</div>
</div>
from above the code i need to fetch the following tag class values.
coursetitle
coursetitle href link
pircetag
timeline inline-block
uinversity
description
instructor name
but coursetitle is available in two places but i need only once. same instructor name does not contain any specifi tag to fecth.
my xpath queries are:
novoedData = HtmlXPathSelector(response)
courseTitle = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/text()').extract()
courseDetailLink = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/h2[re:test(#class, "coursetitle")]/a/#href').extract()
courseInstructorName = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/text()').extract()
coursePriceType = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/figure[re:test(#class, "pricetag")]/text()').extract()
courseShortSummary = novoedData.xpath('//div[re:test(#class, "hovered row-fluid")]/div[re:test(#class, "span10")]/p[re:test(#class, "description")]/text()').extract()
courseUniversity = novoedData.xpath('//div[re:test(#class, "row-fluid")]/div[re:test(#class, "span10")]/div[re:test(#class, "university")]/text()').extract()
but the number of values in each list variable is difference:
len(courseTitle) = 40 (two times because of repetition)
len(courseDetailLink) = 40 (two times because of repetition)
len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
len(coursePriceType) = 20 (correct count no repetition)
len(courseShortSummary)= 20 (correct count no repetition)
len(courseUniversity) = 20 (correct count no repetition)
kindly modify my xpath query to solve my problem. thanks in advance..
you dont need that re:test, simply do:
>>> s = sel.xpath('//div[#class="row-fluid"]/div[#class="span10"]')[0]
>>> len(s)
1
>>> s.xpath('h2[#class="coursetitle"]/a/#href').extract()
[u'https://novoed.com/hc']
also note that once s is set on the right place you can just continue from it.
Before explaining, I am using VB.net and HtmlAgilityPack.
I have the below html, all three sections have the same format. I am using htmlagilitypack to extract the data from the Title and Date. My code extracts the title correctly but the date is only extracted from the first instance and repeated 3 times:
HtmlAgilityPack code:
For Each h4 As HtmlNode In docnews.DocumentNode.SelectNodes("//h4[(#class='title')]")
Dim date1 As HtmlNode = docnews.DocumentNode.SelectSingleNode("//span[starts-with(#class, 'date ')]")
Dim newsdate As String = date1.InnerText
MessageBox.Show(h4.InnerText)
MessageBox.Show(newsdate)
Next
I thought being in each h4, I get its associated date accordingly...
HTML code:
<div class="article-header" style="" data-itemid="920729" data-source="ABC" data-preview="Text 1">
<h4 class="title">Text for Mr. A</h4>
<div class="byline">
<span class="date timestamp"><span title="29 November 2013">29-11-2013</span></span>
<span class="source" title="AGE">18</span>
</div>
<div class="preview">Text 1 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920720" data-source="ABC" data-preview="Text 2">
<h4 class="title">Text for Mr. B</h4>
<div class="byline">
<span class="date timestamp"><span title="27 November 2013">27-11-2013</span></span>
<span class="source" title="AGE">25</span>
</div>
<div class="preview">Text 2 Preview</div>
</div>
<div class="article-header" style="" data-itemid="920719" data-source="ABC" data-pre+view="Text 3">
<h4 class="title">Text for Mr. C</h4>
<div class="byline">
<span class="date timestamp"><span title="22 October 2013">22-10-2013</span></span>
<span class="source" title="AGE">20</span>
</div>
<div class="preview">Text 3 Preview</div>
</div>
Final Output should be:
Text for Mr. A
29-11-2013
Text for Mr. B
27-11-2013
Text for Mr. C
22-10-2013
What I am getting with my code:
Text for Mr. A
29-11-2013
Text for Mr. B
29-11-2013
Text for Mr. C
29-11-2013
Any help is much appreciated.
You need to anchor your second XPath to look 'below' the h4:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(#class, 'date ')]")
^^^^^^^^^ ^^^
The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.
This is my first attempt parsing a webpage using Nokogiri.
I am trying to extract the addresses from a webpage and store them in a CSV file. So far, I've only been able to extract the City, State, and Zip fields.
I don't know how to extract the facility name, address, phone, numbers, and company information. The address may contain one or two street components.
For the phone, there may be one or more phone numbers. The phone numbers may be regular numbers or fax numbers, but they are only indicated in the text as opposed to a tag. For the company, I'd like to be able to extract the URL and the name.
Each address on the page is enclosed as follows:
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
Here's my very basic set-up.
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end
Storing the information in separate arrays here seems very clunky. I'd basically like to make a row entry in a CSV table for each occurrence of the address node in the source document, and then populate it with fields if they exist:
Facility St_1 St_2 City State Zip Phone Fax URL Company
======== ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx xxxx xxxx xxxxx xxxx xxxxx xxxx xxxxxxxx
xxxxxxxx xxxx xxxxx xxxx xxxxx xxxx xxxxx xxxx xxxx xxxxxxxx
Can someone help me?
You probably have some edge cases that this won't handle, but this takes care of your example. You'll need to change the doc to read from the real page instead of the data segment, and you'll need to change the csv to print to a file instead of display inline like I've done.
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(DATA.read)
CompanyInfo = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []
doc.css("div.address").each do |address_div|
facility = address_div.at_css('.address_header .header_name').text.strip
info = address_div.css('div.address_details .info')
street1, street2 = info.css('.street').map(&:text)
city = info.at_css('.city').text
state = info.at_css('.state').text
zip = info.at_css('.zip').text
phone, fax = info.css('.phone .tel').map(&:text)
url = info.at_css('.company a')['href']
company = info.at_css('.company a').text
company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end
csv = CSV.generate do |csv|
csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
company_infos.each do |company_info|
csv << company_info.to_a
end
end
csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,Company\nFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Name\n"
__END__
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
You're asking for a lot, but I'll get you started:
fields = %w{street1 street2 phone fax city state zip}
doc.search('div.address').each do |div|
address = {}
address['street1'], address['street2'] = *div.search('span.street').map(&:text)
address['phone'], address['fax'] = *div.search('span.tel').map(&:text)
['city', 'state', 'zip'].each{|f| address[f] = div.at("span.#{f}").text}
csv << fields.map{|f| address[f]}
end
This is the HTML I am parsing:
<div class="audio" id="audio59779184_153635497_-28469067_16663">
<table width="100%" cellspacing="0" cellpadding="0"><tbody><tr>
<td>
<a onclick="playAudioNew('59779184_153635497_-28469067_16663')"><div class="play_new" id="play59779184_153635497_-28469067_16663"></div></a>
<input id="audio_info59779184_153635497_-28469067_16663" type="hidden" value="http://cs5888.userapi.com/u59779184/audio/0fc0fc5d8799.mp3,245">
</td>
<td class="info">
<div class="duration fl_r" onmousedown="if (window.audioPlayer) audioPlayer.switchTimeFormat('59779184_153635497_-28469067_16663', event);">4:05</div>
<div class="audio_title_wrap">
<b>Don Omar feat. Lucenzo and Pallada</b> – <span id="title59779184_153635497_-28469067_16663"> Danza Kuduro (Dj Fleep Mashup)(21.05.12).ılııllı.♫♪Новая Клубная Музыка♫♪.ıllıılı.http://vkontakte.ru/public28469067 </span>
</div>
</td>
</tr></tbody></table>
<div class="player_wrap">
<div class="playline" id="line59779184_153635497_-28469067_16663"><div></div></div>
<div class="player" id="player59779184_153635497_-28469067_16663" ondragstart="return false;" onselectstart="return false;">
<table width="100%" border="0" cellspacing="0" cellpadding="0"><tbody><tr id="audio_tr59779184_153635497_-28469067_16663" valign="top">
<td style="padding: 0px; width: 100%; position: relative;">
<div class="audio_white_line" id="audio_white_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);"></div>
<div class="audio_load_line" id="audio_load_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);"><!-- --></div>
<div class="audio_progress_line" id="audio_progress_line59779184_153635497_-28469067_16663" onmousedown="audioPlayer.prClick(event);">
<div class="audio_pr_slider" id="audio_pr_slider59779184_153635497_-28469067_16663"><!-- --></div>
</div>
</td>
<td id="audio_vol59779184_153635497_-28469067_16663" style="position: relative;"></td>
</tr></tbody></table>
</div>
</div>
</div>
And the code I'm using:
require 'watir'
require 'nokogiri'
require 'open-uri'
ff = Watir::Browser.new
ff.goto 'http://vk.com/wall-28469067_16663'
htmlSource = ff.html
doc = Nokogiri::HTML(htmlSource, nil, 'UTF-8')
doc.xpath('//div[#class="audio"]/#id').each do |idSongs|
divSong = doc.css('div#'+idSongs)
aa = idSongs.text
link = doc.xpath("//input[#id='#{aa}']//#value")
puts link
puts '========================='
end
ff.close
If I write:
aa = 'audio_info59779184_153625626_-28469067_16663'
puts link returns a good result of "http://cs5333.userapi.com/u14251690/audio/bcf80f297520.mp3,217".
Why is it, if aa = idSongs.text
does puts link return " " ?
To answer the question asked, link returns "", because it's an empty NodeSet. In other words, Nokogiri didn't find what you were looking for. A NodeSet behaves like an Array, so when you try to puts an empty array you get "".
Because it's a NodeSet you should iterate over it, as you would an array. (The same is true of your doc.css, which would also return a NodeSet.)
The reason it's empty is because Nokogiri can't find what you want. You're looking for the contents of aa which are:
"audio59779184_153635497_-28469067_16663"
Substituting that into "//input[#id='#{aa}']" gives:
"//input[#id='audio59779184_153635497_-28469067_16663']"
but should be:
"//input[#id='audio_info59779184_153635497_-28469067_16663']"
Searching for that finds content:
doc.search("//input[#id='audio_info59779184_153635497_-28469067_16663']").size => 1
Short answer to: "Why is it, if aa = idSongs.text does puts link return " " ?" Because you're trying to find an input element that has the same dom id as the div you've already matched on, which doesn't exist and therefore Nokogiri just gives you an empty string.
It looks like they reuse the audio identifier in several places, so to make your code more versatile probably extract that out and then prefix your selections with whatever you are needing to access... As such:
doc.xpath('//div[#class="audio"]/#id').each do |idSongs|
divSong = doc.css('div#'+idSongs)
aa = idSongs.text
identifier = (match = aa.match(/^audio(.*)$/)) ? match[1] : ""
link = doc.xpath("//input[#id='audio_info#{identifier}']//#value")
puts link
puts '========================='
## now if you want:
title = doc.xpath("//input[#id='title#{identifier}']//#value")
puts title
end