XPath Query to select hyperlink - xpath

The following is a subset of xml from a twitter atom feed:
<entry>
<id>tag:search.twitter.com,2005:18232030105964545</id>
<published>2010-12-24T09:10:29Z</published>
<link type="text/html" rel="alternate" href="http://twitter.com/KTNKenya/statuses/18232030105964545"/>
<title>Synovate Poll: PM Raila Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... http://fb.me/yjmMbmBx</title>
<content type="html">Synovate Poll: PM <b>Raila</b> Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... <a href="http://fb.me/yjmMbmBx">http://fb.me/yjmMbmBx</a></content>
<updated>2010-12-24T09:10:29Z</updated>
<link type="image/png" rel="image" href="http://a3.twimg.com/profile_images/701825859/NEW_KTN_normal.png"/>
<google:location>nairobi, kenya</google:location>
<twitter:geo>
</twitter:geo>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>KTNKenya (KTN Kenya)</name>
<uri>http://twitter.com/KTNKenya</uri>
</author>
</entry>
From the <title>...</title> element, i need to select the hyperlink http://fb.me/yjmMbmBx via an XPath query. How do I do it? Is it possible?
*I'm an XPath newbie.
Thanks.

You have two options:
Use <title> (xpath: "/entry/title/text()") and get the URL yourself (e.g. using regex or finding the last instance of "http://" in the string.
Get the data first:
/entry/content[#type="html"]/text()
Then you need to parse this as HTML and extract any tags, and use the href attribute of those tags. How you do this last part depends on the language/environment you are doing this in.
Update: Added basic example code for option 1 above, as requested:
xmlpp::Element *node = parser.get_document()->get_root_node();
xmlpp::NodeSet results = node->find("/entry/title/text()");
xmlpp::ContentNode* content = dynamic_cast<xmlpp::ContentNode*>(results.front());
std::string text = content->get_content();
std::string link = "";
int res = text.rfind("http://");
if(res == text.npos)
res = text.rfind("https://");
if(res != text.npos)
link = text.substr(res);

With atom prefix bound to http://www.w3.org/2005/Atom namespace URI, use:
/atom:feed/atom:entry/atom:title[contains(.,'http://')]
This selects every atom:title element child of atom:entry, having the string "http://" contained in its string value.

Related

extract Xpath for string in a div class

I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(#class, "symbol")]
return element not no text of "GGRM.JK"
Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(#class, "symbol")]/#class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(#class, "symbol")]/a/#href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK
The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/#href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/#class, ": \'"), "\'")')
print(fromClass)

Unable to get nokogiri xmlelement text value

I have the following XML content.
<info>
<meta name="alias">alias1</meta>
<meta name="score">.60</meta>
</info>
<info>
<meta name="alias">alias2</meta>
<meta name="score">.50</meta>
</info>
I need to get back for each value, but having difficulty doing so.
doc.xpath("//info").each do |info_entry|
info_entry.xpath("meta").each do |meta_entry|
if meta_entry['name'] == 'alias'
the_alias = meta_entry.xpath('text()').text
elsif meta_entry['name'] == 'score'
score = meta_entry.xpath('text()').text
end
// add struct containing alias and score to list
end
end
However, I'm not geting anything from text. I've tried many different things: inner_text, inner_html, content, value, nothing works. I've tried meta_entry.at, meta_entry.search, and so on.
Is there something I'm missing? Any advice would be appreciated.
You need to get rid of xpath('text()'). And you can get rid of the conditional and build a workable data structure as you go, like this:
def meta_contents(doc)
doc.xpath("//info").map do |info_entry|
info_entry.xpath("meta").map do |meta_entry|
[meta_entry["name"], meta_entry.text]
end.to_h
end
end
>> meta_contents(doc)
#> [{"alias"=>"alias1", "score"=>".60"}, {"alias"=>"alias2", "score"=>".50"}]

Nokogiri: How to get node name with namespace prefix

i trying (for testing purpose) to parse Google merchant XML feed, defined as:
<?xml version="1.0" encoding="UTF-8"?>
<feed xml:lang="cs" xmlns="http://www.w3.org/2005/Atom" xmlns:g="http://base.google.com/ns/1.0">
<link rel="alternate" type="text/html" href="http://www.example.com"/>
<link rel="self" type="application/atom+xml" href="http://www.example.com/cs/feed/google.xml"/>
<title>EasyOptic</title>
<updated>2014-08-01T16:31:11Z</updated>
<entry>
<title>Sluneční Brýle Producer 1 133a code_color_1 Color 1 133a RayBan</title>
<link href="http://www.example.com/cs/katalog/price-category-1-style-1-optical-glasses-producer-1-rayban-133a-code_color_1-color-1"/>
<summary>Moc krásný a velmi levný produkt</summary>
<updated>2014-08-01T16:31:11Z</updated>
<g:id>EO111</g:id>
<g:condition>new</g:condition>
<g:price>100 Kč</g:price>
<g:availability>in stock</g:availability>
<g:image_link>http://www.example.com/images/fallback/default.png</g:image_link>
<g:additional_image_link>http://www.example.com/images/fallback/default.png</g:additional_image_link>
<g:brand>Producer 1</g:brand>
<g:mpn>EO111</g:mpn>
<g:gender>female</g:gender>
<g:google_product_category>Apparel & Accessories > Clothing Accessories > Sunglasses</g:google_product_category>
<g:product_type>Sluneční Brýle </g:product_type>
</entry>
<entry>
<title>Sluneční Brýle Producer 1 133a code_color_1 Color 1 133a RayBan</title>
<link href="http://www.example.com/cs/katalog/price-category-1-style-1-optical-glasses-producer-1-rayban-133a-code_color_1-color-1"/>
<summary>Moc krásný a velmi levný produkt</summary>
<updated>2014-08-01T16:31:10Z</updated>
<g:id>EO111</g:id>
<g:condition>new</g:condition>
<g:price>100 Kč</g:price>
<g:availability>in stock</g:availability>
<g:image_link>http://www.example.com/images/fallback/default.png</g:image_link>
<g:additional_image_link>http://www.example.com/images/fallback/default.png</g:additional_image_link>
<g:brand>Producer 1</g:brand>
<g:mpn>EO111</g:mpn>
<g:gender>female</g:gender>
<g:google_product_category>Apparel & Accessories > Clothing Accessories > Sunglasses</g:google_product_category>
<g:product_type>Sluneční Brýle </g:product_type>
</entry>
</feed>
with this ruby script:
require 'nokogiri'
def have_node_with_children(body, path_type, path, children_names)
doc = Nokogiri::XML(body)
case path_type
when :xpath
nodes = doc.xpath(path)
when :css
nodes = doc.css(path)
else
nodes = doc.xpath(path)
end
nodes.each do |node|
nchildren_names=[]
for child in node.children
nchildren_names << child.name unless child.to_s.strip =="" #nokogiri takes formating spaces as blank node with name "text"
end
puts("demanded_nodes: #{children_names.sort.join(", ")} , nodes found: #{nchildren_names.sort.join(", ")} ")
missing = children_names - nchildren_names
over = nchildren_names - children_names
puts("Missing: #{missing.sort.join(", ")} , Over: #{over.sort.join(", ")} ")
end
end
EXPECTED_ENTRY_NODES=[
'title',
'link',
'summary',
'updated',
'g:id',
'g:condition',
'g:price',
'g:availability',
'g:image_link',
'g:additional_image_link',
'g:brand',
'g:mpn',
'g:gender',
'g:google_product_category',
'g:product_type'
]
file=File.open('google.xml')
have_node_with_children(file.read,:xpath,'//xmlns:entry',EXPECTED_ENTRY_NODES)
It find node 'entry' (thanks for this tip ).
But when collecting it's children method child.name returns name without namespace prefix (e.g.: <'g:brand'>.name => 'brand'.
So comparsion with demanded fields fail.
Do anybody have tip hot to get node name with/and it's namespace prefix?
If I delete namespace definitions all work fine, but I cannot change the original XML.
I use this test in rspec request test, so another namespaces with maybe indentical base node names can appear.
xml_doc = Nokogiri::XML(xml)
xml_doc.xpath("//xmlns:entry").each do |entry|
entry.xpath("./*").each do |element| #Step through all Element nodes that are direct children of <entry>
prefix = element.namespace.prefix
puts prefix ? "#{element.namespace.prefix}:#{element.name}"
: element.name
end
break #only show output for the first <entry>
end
--output:--
title
link
summary
updated
g:id
g:condition
g:price
g:availability
g:image_link
g:additional_image_link
g:brand
g:mpn
g:gender
g:google_product_category
g:product_type
Now about this:
for child in node.children
A well grounded rubyist does not ever use a for-loop...because a for_loop just calls each(), so rubyists call each() directly:
node.children.each do |child|

How can I reach this node with Nokogiri?

Here's the start of my html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<!--[if !mso]><style>v\\:* {behavior:url(#default#VML);}\no\\:* {behavior:url(#default#VML);}\nw\\:* {behavior:url(#default#VML);}\n.shape {behavior:url(#default#VML);}\n</style><![endif]--><style><!--\n/* Font Definitions */\n#font-face\n\t{font-family:"Cambria Math";\n\tpanose-1:2 4 5 3 5 4 6 3 2 4;}\n#font-face\n\t{font-family:Calibri;\n\tpanose-1:2 15 5 2 2 2 4 3 2 4;}\n#font-face\n\t{font-family:Tahoma;\n\tpanose-1:2 11 6 4 3 5 4 4 2 4;}\n/* Style Definitions */\np.MsoNormal, li.MsoNormal, div.MsoNormal\n\t{margin:0in;\n\tmargin-bottom:.0001pt;\n\tfont-size:12.0pt;\n\tfont-family:"Times New Roman","serif";}\na:link, span.MsoHyperlink\n\t{mso-style-priority:99;\n\tcolor:blue;\n\ttext-decoration:underline;}\na:visited, span.MsoHyperlinkFollowed\n\t{mso-style-priority:99;\n\tcolor:purple;\n\ttext-decoration:underline;}\np\n\t{mso-style-priority:99;\n\tmso-margin-top-alt:auto;\n\tmargin-right:0in;\n\tmso-margin-bottom-alt:auto;\n\tmargin-left:0in;\n\tfont-size:12.0pt;\n\tfont-family:"Times New Roman","serif";}\nspan.EmailStyle18\n\t{mso-style-type:personal-reply;\n\tfont-family:"Calibri","sans-serif";\n\tcolor:#1F497D;}\n.MsoChpDefault\n\t{mso-style-type:export-only;\n\tfont-size:10.0pt;}\n#page WordSection1\n\t{size:8.5in 11.0in;\n\tmargin:1.0in 1.0in 1.0in 1.0in;}\ndiv.WordSection1\n\t{page:WordSection1;}\n--> </style>
<!--[if gte mso 9]><xml>\n<o:shapedefaults v:ext="edit" spidmax="1026" />\n</xml><![endif]--> <!--[if gte mso 9]> <xml>\n<o:shapelayoutv:ext="edit">\n<o:idmapv:ext="edit"data="1"/>\n</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><p> </p></span></p>
<p class="MsoNormal"><a name="_MailEndCompose"><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><p> </p></span></a></p>
<div><div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in"><p class="MsoNormal"><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> EMAIL SENDER NAME [mailto:EMAILADDRESS#FAKE.COM] <br><b>Sent:</b>!! DATE I NEED TO GRAB HERE !! <br><b>To:</b> EMAIL ADDRESS HERE <br><b>Subject:</b> SUBJECT LINE HERE <p></p></span></p></div></div>
I need to grab the date the email was sent. Here's what I've tried:
label_tag_name = 'div div p span br b'
if label_tag = #doc.at_css(%Q{#{label_tag_name}:contains("#{label}:")})
#attributes[field] = label_tag.text.gsub("#{label}:",'').gsub("\\n", "").strip
end
I also tried some shorter paths in the label_tag_name, basically adding another HTML tag to the beginning.
Every time though, the sent date is coming back nil.
The bit of your source you're interested in is (I've removed attributes for clarity):
<div>
<div>
<p>
<b>
<span>From:</span>
</b>
<span> EMAIL SENDER NAME [mailto:EMAILADDRESS#FAKE.COM] <br>
<b>Sent:</b>!! DATE I NEED TO GRAB HERE !! <br>
<b>To:</b> EMAIL ADDRESS HERE <br>
<b>Subject:</b> SUBJECT LINE HERE <p></p>
</span></p></div></div>
Note that br tags in HTML are self closing, so it's pointless looking for child elements of them.
The target could be described with the css div div p span, but note that there are two nodes that match that, and at_css returns the first. You could use div div p>span to specify only spans that are immediate children on the p. The actual target is a text node inside this element (there's only one matching span in the document now). In particular, it's the next element after the first b tag. So if we expand the css selector to div div p>span b, we can use the Nokogiri next method to get the target string:
date_string = #doc.at_css('div div p>span b').next
If you want the other fields, you could use css instead of at_css:
date_string = #doc.css('div div p>span b')[0].next
to_string = #doc.css('div div p>span b')[1].next
subject_string = #doc.css('div div p>span b')[2].next
I'll leave getting the sender name for something for you to do!
There isn't much to navigate on in that document. Use a selector that gets you to the closest point reliably then grab the text with a regex:
> doc.css("div.WordSection1 p.MsoNormal span").text[/Sent:\n(.*)/, 1]
=> " !! DATE I NEED TO GRAB HERE !! To:"
I'd start with this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title></title>
</head>
<body>
<div class="WordSection1">
<div>
<div>
<b>Sent:</b>!! DATE I NEED TO GRAB HERE !!<br>
<b>To:</b> EMAIL ADDRESS HERE<br>
<b>Subject:</b> SUBJECT LINE HERE</span></p>
</div>
</div>
</div>
</body>
</html>
EOT
text = doc.at('div.WordSection1').text
sent_date = text[/Sent:(.+)To:/, 1].strip
puts sent_date
Which outputs this:
!! DATE I NEED TO GRAB HERE !!
The sample HTML is a mess so you can't easily see the particular trees you want in that forest. Strip out everything that isn't essential for navigation, then build your search.
And, while a parser is a great tool, sometimes it's easier to use it to get to the text you want, then grab the particular thing via a string search.

How to extract links, text and timestamp from webpage via Html Agility Pack

I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(#class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[#class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the <a> inside your <span class='Subject2'> - the first value is a href attribute value, the second is InnerText of the anchor. To get the third value you'll have to get the following sibling of the <span class='Subject2'> tag and get its InnerText.
See, this how you can do it:
var nodes = document.DocumentNode.SelectNodes("//span[#class='Subject2']//a");
foreach (var node in nodes)
{
if (node.Attributes["href"] != null)
{
var link = new XElement("link", node.Attributes["href"].Value);
var description = new XElement("description", node.InnerText);
var timeNode = node.SelectSingleNode(
"..//following-sibling::span[#class='time']");
if (timeNode != null)
{
var time = new XElement("pubDate", timeNode.InnerText);
Response.Write(link);
Response.Write(description);
Response.Write(time);
}
}
}
this outputs something like:
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open</link>
<description>Description 1 text here</description>
<pubDate>2012-01-20 08:35</pubDate>
<link>/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open</link>
<description>Description 2 text here</description>
<pubDate>2012-01-20 09:35</pubDate>

Resources