XPath problem with multiple OR expressions like (a|b|c) [duplicate] - xpath

This question already has an answer here:
Logical OR in XPath? Why isn't | working?
(1 answer)
Closed 1 year ago.
I have simplified html:
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
I want to find only one and two, using conditions that the parent tag is main or support, and there is span or divafter it.
I wonder why that code does not work:
import lxml.html as HTML_PARSER
html = """
<html>
<main>
<span>one</span>
</main>
<not_important>
<div>skip_me</div>
</not_important>
<support>
<div>two</div>
</support>
</html>
"""
parent = '//main | //support'
child = '/span | /div'
doc = HTML_PARSER.fromstring(html)
print doc
xpath = '(%s)(%s)' % (parent, child)
print xpath
parsed = doc.xpath(xpath)
print parsed
I get an error Invalid expression. Why?
This (//main | //support) and this (/span | /div) xpaths are both correct.
Simple combo like (//main | //support)/span is also correct.
But why more complicated combination (//main | //support)(/span | /div) is not correct? How to resolve it?
In my real case //main, //support, /span and /div are really complicated xpaths, I want some general solution like (xpath1 | xpath2)(xpath3 | xpath4)

this will find it, however I'm not 100% sure if it's what you want:
//*[name() = 'main' or name() = 'support']/*[name() = 'span' or name() = 'div']/text()

Your XPath is not valid for XPath version 1 (the one that lxml use)
Try
xpath = '//div[parent::support]|//span[parent::main]'
or
parent = ['main', 'support']
child = ['span', 'div']
xpath = '//*[self::{0[0]} or self::{0[1]}]/*[self::{1[0]} or self::{1[1]}]'.format(parent, child)

You can use the self:: axis:
(//main | //support)[*[self::div or self::span]]

Related

extract Xpath for string in a div class

I have the below XPath
<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>
I would like to extract "GGRM.JK"from the HTML.
//div[contains(#class, "symbol")]
return element not no text of "GGRM.JK"
Since it seems you are using python, try the following:
import lxml.html as lh
data = """[your html above]"""
doc = lh.fromstring(data)
#version 1
target = doc.xpath('//div[contains(#class, "symbol")]/#class')[0]
print(target.split("'")[1])
#version 2
target2 = doc.xpath('//div[contains(#class, "symbol")]/a/#href')[0]
target2.split('=')[1]
In either case, the output should be
GGRM.JK
The shortest way to get the substing you want with xpath only, without postprocessing, is to use a functions substring-after and substring-before.
Here is an example, how to get 'GGRM.JK' from both class and href attributes.
import lxml.html as lh
htmlText = """<div class="sic_cell {symbol : 'GGRM.JK'}">
Gudang Garam Tbk.
</div>"""
htmlDom = lh.fromstring(htmlText)
fromHref = htmlDom.xpath('substring-after(//div/a/#href, "=")')
print(fromHref)
fromClass = htmlDom.xpath('substring-before(substring-after(//div/#class, ": \'"), "\'")')
print(fromClass)

xpath could not recognize predicate for a tag

I try to use scrapy xpath to scrape a page, but it seems it cannot capture the tag with predicates when I use a for loop,
# This package will contain the spiders of your Scrapy project
from cunyfirst.items import CunyfirstSectionItem
import scrapy
import json
class CunyfristsectionSpider(scrapy.Spider):
name = "cunyfirst-section-spider"
start_urls = ["file:///Users/haowang/Desktop/section.htm"]
def parse(self, response):
url = response.url
yield scrapy.Request(url, self.parse_page)
def parse_page(self, response):
n = -1
for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
print(response.xpath("//a[#name ='MTG_CLASSNAME$10']/text()"))
n += 1
class_num = section.xpath('text()').extract_first()
# print(class_num)
classname = "MTG_CLASSNAME$" + str(n)
date = "MTG_DAYTIME$" + str(n)
instr = "MTG_INSTR$" + str(n)
print(classname)
class_name = response.xpath("//a[#name = classname]/text()")
I am looking for a tags with name as "MTG_CLASSNAME$" + str(n), with n being 0,1,2..., and I am getting empty output from my xpath query. Not sure why...
PS.
I am basically trying to scrape course and their info from https://hrsa.cunyfirst.cuny.edu/psc/cnyhcprd/GUEST/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?FolderPath=PORTAL_ROOT_OBJECT.HC_CLASS_SEARCH_GBL&IsFolder=false&IgnoreParamTempl=FolderPath%252cIsFolder&PortalActualURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentURL=https%3a%2f%2fhrsa.cunyfirst.cuny.edu%2fpsc%2fcnyhcprd%2fGUEST%2fHRMS%2fc%2fCOMMUNITY_ACCESS.CLASS_SEARCH.GBL&PortalContentProvider=HRMS&PortalCRefLabel=Class%20Search&PortalRegistryName=GUEST&PortalServletURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsp%2fcnyepprd%2f&PortalURI=https%3a%2f%2fhome.cunyfirst.cuny.edu%2fpsc%2fcnyepprd%2f&PortalHostNode=ENTP&NoCrumbs=yes
with filter applied: Kingsborough CC, fall 18, BIO
Thanks!
Well... I've visited the website you put in the question description, I used element inspection and searched for "MTG_CLASSNAME" and I got 0 matches...
So I will give you some tools:
In your settings.py set that:
LOG_FILE = "log.txt"
LOG_STDOUT=True
then print the response body ( response.body ) where you should ( in the top of parse_page function in this case ) and search it in log.txt
Check there if there is what you are looking for.
If there is, use this https://www.freeformatter.com/xpath-tester.html (
or similar ) to check your xpath statement.
In addition, change for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]"):
by for section in response.xpath("//a[contains(#name,'MTG_CLASS_NBR')]").extract():, this will raise an error when you get the data that you are looking for.

Include jekyll / liquid template data in a YAML variable?

I am using the YAML heading of a markdown file to add an excerpt variable to blog posts that I can use elsewhere. In one of these excerpts I refer to an earlier blog post via markdown link markup, and I use the liquid template data variable {{ site.url }} in place of the base URL of the site.
So I have something like (trimmed it somewhat)
---
title: "Decluttering ordination plots in vegan part 2: orditorp()"
status: publish
layout: post
published: true
tags:
- tag1
- tag2
excerpt: In the [earlier post in this series]({{ site.url }}/2013/01/12/
decluttering-ordination-plots-in-vegan-part-1-ordilabel/ "Decluttering ordination
plots in vegan part 1: ordilabel()") I looked at the `ordilabel()` function
----
However, jekyll and the Maruku md parser don't like this, which makes me suspect that you can't use liquid markup in the YAML header.
Is it possible to use liquid markup in the YAML header of pages handled by jekyll?
If it is, what I am I doing wrong in the example shown?
If it is not allowed, who else can I achieve what I intended? I am currently developing my site on my laptop and don't want to hard code the base URL as it'll have to change when I am ready to deploy.
The errors I am getting from Maruku are:
| Maruku tells you:
+---------------------------------------------------------------------------
| Must quote title
| ---------------------------------------------------------------------------
| the [earlier post in this series]({{ site.url }}/2013/01/12/decluttering-o
| --------------------------------------|-------------------------------------
| +--- Byte 40
and
| Maruku tells you:
+---------------------------------------------------------------------------
| Unclosed link
| ---------------------------------------------------------------------------
| the [earlier post in this series]({{ site.url }}/2013/01/12/decluttering-or
| --------------------------------------|-------------------------------------
| +--- Byte 41
and
| Maruku tells you:
+---------------------------------------------------------------------------
| No closing ): I will not create the link for ["earlier post in this series"]
| ---------------------------------------------------------------------------
| the [earlier post in this series]({{ site.url }}/2013/01/12/decluttering-or
| --------------------------------------|-------------------------------------
| +--- Byte 41
Today I ran into a similar problem. As a solution I created the following simple Jekyll filter-plugin which allows to expand nested liquid-templates in (e.g. liquid-variables in the YAML front matter):
module Jekyll
module LiquifyFilter
def liquify(input)
Liquid::Template.parse(input).render(#context)
end
end
end
Liquid::Template.register_filter(Jekyll::LiquifyFilter)
Filters can be added to a Jekyll site by placing them in the '_plugins' sub-directory of the site-root dir. The above code can be simply pasted into a yoursite/_plugins/liquify_filter.rb file.
After that a template like...
---
layout: default
first_name: Harry
last_name: Potter
greetings: Greetings {{ page.first_name }} {{ page.last_name }}!
---
{{ page.greetings | liquify }}
... should render some output like "Greetings Harry Potter!". The expansion works also for deeper nested structures - as long as the liquify filter is also specified on the inner liquid output-blocks. Something like {{ site.url }} works of course, too.
Update - looks like this is now available as a Ruby gem: https://github.com/gemfarmer/jekyll-liquify.
I don't believe it's possible to nest liquid variables inside YAML. At least, I haven't figure out how to do it.
One approach that will work is to use a Liquid's replace filter. Specifically, define a string that you want to use for the variable replacement (e.g. !SITE_URL!). Then, use the replace filter to switch that to your desired Jekyll variable (e.g. site.url) during the output. Here's a cut down .md file that behaves as expected on my jekyll 0.11 install:
---
layout: post
excerpt: In the [earlier post in this series](!SITE_URL!/2013/01/12/)
---
{{ page.excerpt | replace: '!SITE_URL!', site.url }}
Testing that on my machine, the URL is inserted properly and then translated from markdown into an HTML link as expected. If you have more than one item to replace, you can string multiple replace calls together.
---
layout: post
my_name: Alan W. Smith
multi_replace_test: 'Name: !PAGE_MY_NAME! - Site: [!SITE_URL!](!SITE_URL!)'
---
{{ page.multi_replace_test | replace: '!SITE_URL!', site.url | replace: '!PAGE_MY_NAME!', page.my_name }}
An important note is that you must explicitly set the site.url value. You don't get that for free with Jekyll. You can either set it in your _config.yml file with:
url: http://alanwsmith.com
Or, define it when you call jekyll:
jekyll --url http://alanwsmith.com
If you need to replace values in data/yml from another data/yml file, I wrote plugin. It's not so elegant but works :
I did some code improvements. Now it catch all occurrences in one string and work with nested values.
module LiquidReplacer
class Generator < Jekyll::Generator
REGEX = /\!([A-Za-z0-9]|_|\.){1,}\!/
def replace_str(str)
out = str
str.to_s.to_enum(:scan, REGEX).map {
m = Regexp.last_match.to_s
val = m.gsub('!', '').split('.')
vv = $site_data[val[0]]
val.delete_at(0)
val.length.times.with_index do |i|
if val.nil? || val[i].nil? || vv.nil? ||vv[val[i]].nil?
puts "ERROR IN BUILDING YAML WITH KEY:\n#{m}"
else
vv = vv[val[i]]
end
end
out = out.gsub(m, vv)
}
out
end
def deeper(in_hash)
if in_hash.class == Hash || in_hash.class == Array
_in_hash = in_hash.to_a
_out_hash = {}
_in_hash.each do |dd|
case dd
when Hash
_dd = dd.to_a
_out_hash[_dd[0]] = deeper(_dd[1])
when Array
_out_hash[dd[0]] = deeper(dd[1])
else
_out_hash = replace_str(dd)
end
end
else
_out_hash = replace_str(in_hash)
end
return _out_hash
end
def generate(site)
$site_data = site.data
site.data.each do |data|
site.data[data[0]] = deeper(data[1])
end
end
end
end
place this code in site/_plugins/liquid_replacer.rb
in yml file use !something.someval! like as site.data.something.someval but without site.data part.
example :
_data/one.yml
foo: foo
_data/two.yml
bar: "!one.foo!bar"
calling {{ site.data.two.bar }} will produce foobar
=======
OLD CODE
======
module LiquidReplacer
class Generator < Jekyll::Generator
REGEX = /\!([A-Za-z0-9]|_|\.){1,}\!/
def generate(site)
site.data.each do |d|
d[1].each_pair do |k,v|
v.to_s.match(REGEX) do |m|
val = m[0].gsub('!', '').split('.')
vv = site.data[val[0]]
val.delete_at(0)
val.length.times.with_index do |i|
vv = vv[val[i]]
end
d[1][k] = d[1][k].gsub(m[0], vv)
end
end
end
end
end
end
Another approach would be to add an IF statement to your head.html.
Instead of using page.layout like I did on my example below, you could use any variable from the page YAML header.
<title>
{% if page.layout == 'post' %}
Some text with {{ site.url }} variable
{% else %}
{{ site.description | escape }}
{% endif %}
</title>

How to extract links, text and timestamp from webpage via Html Agility Pack

I am using Html Agility Pack and are trying to extract the links and link text from the following html code. The webpage is fetched from a remote page and the saved locally as a whole. Then from this local webpage I am trying to extract the links and link text. The webpage naturally has other html code like other links text, etc inside its page but is removed here for clarity.
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open">
Description 1 text here</span> <span class="time">2012-01-20 08:35</span></a><br>
<span class="Subject2"><a href="/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open">
Description 2 text here</span> <span class="time">2012-01-20 09:35</span></a><br>
But the above are the most unique content to work from when trying to extract the links and linktext.
This is what I would like to see as the result
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305</link>
<title>Description 1 text here</title>
<pubDate>Wed, 20 Jan 2012 07:35:00 +0100</pubDate>
<link>/some/today.nsf/0/ EC8A39XXXX264X5BC125798B0029E312</link>
<title>Description 2 text here</title>
<pubDate> Wed, 20 Jan 2012 08:35:00 +0100</pubDate>
This is my code so far:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[starts-with(#class, 'Subject2')]")
(lnks.Name == "a" &&
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0)
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
Time = lnks. Attributes["time"].Value
};
foreach (var link in linksOnPage)
{
// Loop through.
Response.Write("<link>" + link.Url + "</link>");
Response.Write("<title>" + link.Text + "</title>");
Response.Write("<pubDate>" + link.Time + "</pubDate>");
}
And its not working, I am getting nothing.
So any suggestions and help would be highly appreciated.
Thanks in advance.
Update: I have managed to get the syntax correct now, in order to select the links from the above examples: With the following code:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//span[#class='Subject2']//a")
This selects the links nicely with url and text, but how do I go about also getting the time stamp?
That is, select the timestamp out of this:
<span class="time">2012-01-20 09:35</span></a>
which follows each link. And have that output with each link inside the output loop according to the above? Thanks for any help in regards to this.
Your HTML example is malformed, that's why you get unexpected results.
To find your first and second values you'll have to get the <a> inside your <span class='Subject2'> - the first value is a href attribute value, the second is InnerText of the anchor. To get the third value you'll have to get the following sibling of the <span class='Subject2'> tag and get its InnerText.
See, this how you can do it:
var nodes = document.DocumentNode.SelectNodes("//span[#class='Subject2']//a");
foreach (var node in nodes)
{
if (node.Attributes["href"] != null)
{
var link = new XElement("link", node.Attributes["href"].Value);
var description = new XElement("description", node.InnerText);
var timeNode = node.SelectSingleNode(
"..//following-sibling::span[#class='time']");
if (timeNode != null)
{
var time = new XElement("pubDate", timeNode.InnerText);
Response.Write(link);
Response.Write(description);
Response.Write(time);
}
}
}
this outputs something like:
<link>/some/today.nsf/0/EC8A39D274864X5BC125798B0029E305?open</link>
<description>Description 1 text here</description>
<pubDate>2012-01-20 08:35</pubDate>
<link>/some/today.nsf/0/EC8A39XXXX264X5BC125798B0029E312?open</link>
<description>Description 2 text here</description>
<pubDate>2012-01-20 09:35</pubDate>

XPath Query to select hyperlink

The following is a subset of xml from a twitter atom feed:
<entry>
<id>tag:search.twitter.com,2005:18232030105964545</id>
<published>2010-12-24T09:10:29Z</published>
<link type="text/html" rel="alternate" href="http://twitter.com/KTNKenya/statuses/18232030105964545"/>
<title>Synovate Poll: PM Raila Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... http://fb.me/yjmMbmBx</title>
<content type="html">Synovate Poll: PM <b>Raila</b> Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... <a href="http://fb.me/yjmMbmBx">http://fb.me/yjmMbmBx</a></content>
<updated>2010-12-24T09:10:29Z</updated>
<link type="image/png" rel="image" href="http://a3.twimg.com/profile_images/701825859/NEW_KTN_normal.png"/>
<google:location>nairobi, kenya</google:location>
<twitter:geo>
</twitter:geo>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>KTNKenya (KTN Kenya)</name>
<uri>http://twitter.com/KTNKenya</uri>
</author>
</entry>
From the <title>...</title> element, i need to select the hyperlink http://fb.me/yjmMbmBx via an XPath query. How do I do it? Is it possible?
*I'm an XPath newbie.
Thanks.
You have two options:
Use <title> (xpath: "/entry/title/text()") and get the URL yourself (e.g. using regex or finding the last instance of "http://" in the string.
Get the data first:
/entry/content[#type="html"]/text()
Then you need to parse this as HTML and extract any tags, and use the href attribute of those tags. How you do this last part depends on the language/environment you are doing this in.
Update: Added basic example code for option 1 above, as requested:
xmlpp::Element *node = parser.get_document()->get_root_node();
xmlpp::NodeSet results = node->find("/entry/title/text()");
xmlpp::ContentNode* content = dynamic_cast<xmlpp::ContentNode*>(results.front());
std::string text = content->get_content();
std::string link = "";
int res = text.rfind("http://");
if(res == text.npos)
res = text.rfind("https://");
if(res != text.npos)
link = text.substr(res);
With atom prefix bound to http://www.w3.org/2005/Atom namespace URI, use:
/atom:feed/atom:entry/atom:title[contains(.,'http://')]
This selects every atom:title element child of atom:entry, having the string "http://" contained in its string value.

Resources