knime xpath node multiple tag selection - xpath

I am trying to extract xml codes from html source. source is like this;
.
.
.
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
Actually i need parent child relation so i need to extract node cell with xpath node first. But i couldn't achive to get range of xml code from "h5" to "/ul". So i need "h5" and "ul" tags together. Output must be like this;
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
I searched tons of links and tried everything but none of these xpath codes worked;
/.../*[self::dns:h5 or self::dns:ul]
/.../*[self::dns:h5|self::dns:ul]
/.../*[self::h5 or self::ul]
Any idea, thanks.

If you use Python, you can do this
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
if item.tag == 'h5':
lastName = item.text
else:
links = item.getElementsByTag('a')
print (lastName,links)
result:
A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n '}]
B []
C []
D []

Related

How to extract the value of an neighbour attribute node via XPath?

I have two different web pages and I want to extract some value using XPath.
What request can extract 2386028 from the first page and at the same time can extract 4019606 from the second page? I need one request that can universally extract that values.
First page fragment:
<ul class="g-ul b-properties">
<li class="b-properties__header">General</li>
<li class="b-properties__item">
<span class="b-properties__label">
<span>VendorCode</span>
</span>
<span class="b-properties__value">2386028</span>
</li>
<li class="b-properties__item">...</li>
<li class="b-properties__item">...</li>
<li class="b-properties__item">...</li>
and second page fragment:
<div class="b-properties-holder" id="tab_3">
<ul class="g-ul b-properties">
<li class="b-properties__header">General</li>
<li class="b-properties__item">
<span class="b-properties__label">
<span>Trademark</span>
</span>
<span class="b-properties__value">
<a class="link b-properties-link" href="/trademark/moist-diane/?sort=-date&currency=USD">Moist Diane</a>
</span>
</li>
<li class="b-properties__item">
<span class="b-properties__label">
<span>VendorCode</span>
</span>
<span class="b-properties__value">4019606</span>
</li>
<li class="b-properties__item">...</li>
<li class="b-properties__item">...</li>
You can select the <li> element, which has <span class="b-properties__label">element that contains <span> with value VendorCode, and then get value of <span class="b-properties__value"> under that <li> element.
For example:
//li[span[#class="b-properties__label"]/span="VendorCode"]/span[#class="b-properties__value"]/text()
Alternatively, you can select the <span class="b-properties__label">element , which has <span> with value VendorCode, and get its following sibling.
//span[#class="b-properties__label" and span="VendorCode"]/following-sibling::span/text()

Scrapy and XPath - Select links and link text between sections

Scrapy is really powerful tool but sometimes it’s frustrating when it comes to XPath.
From the following html, I want to extract the links and the link texts (Title 1, Title 2 etc.) between <b>January 2017</b> and <b>February 2017</b> and group them per “Part”.
The actual html.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scrapy</title>
</head>
<body>
<hr size=1>
<h2 style="margin-top: 36px; margin-bottom: 24px">
Abcd efgh for 2017
</h2>
Part 1 |
Part 2 |
Part 3 |
Part 4 |
A very bold title
<hr size="1" style="margin-top: 36px; margin-bottom: 24px">
<a name="part1"></a>
<h3>Part 1</h3>
<ul>
</ul>
<a name="part2"></a>
<h3>Part 2</h3>
<ul>
</ul>
<a name="part3"></a>
<h3>Part 3</h3>
<ul>
</ul>
<a name="part4"></a>
<h3>Part 4</h3>
<ul>
</ul>
<div style="margin-top: 36px; margin-bottom: 24px">
<a name="non_rep"></a>
<h3>Abcd efgh</h3>
</div>
<b>January 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>February 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>March 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>April 2017</b>
...
...
So on so forth
</body>
</html>
The result should be:
January 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/a/2.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/22.htm
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/a/3.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/33.htm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
February 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/b/2.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/22.htm
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/b/3.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/33.htm
I tried //text()[following-sibling::b/text()='January 2017']/following::a[contains(#href, 'cgi-bin')]/text() and similar spells to no avail.
How should I approach?
The whole setup is a bit nasty because the tree structure is very flat. However we can see that it follows this pattern: <b> node with text with <ul> right below it with data.
So we can find all we want with some loops and following-sibling::ul[1] xpath.
It's a bit ugly because of triple loops but if you ignore that it's pretty simple:
# any <b> node that contains 201x (a year)
nodes = response.xpath("//b[re:test(text(),'201\d')]")
for node in nodes:
# get date node data
name = node.xpath('text()').extract_first()
parts = node.xpath('following-sibling::ul[1]//li/b')
for part in parts:
# the same with part node data
part_name = part.xpath('text()').extract_first()
links = part.xpath("../following-sibling::ul[1]//a")
for link in links:
# finally, we have date, part and link data! Put it together.
item = dict()
item['date_name'] = name
item['part_name'] = part_name
item['link_name'] = link.xpath('text()').extract_first()
item['link_url'] = link.xpath('#href').extract_first()
yield item

Parsing HTML - Only showing one item not the list

protected async override void OnNavigatedTo(NavigationEventArgs e)
{
base.OnNavigatedTo(e);
string htmlPage = "";
using (var client = new HttpClient())
{
// htmlPage = await client.GetStringAsync("http://m.buses.co.uk/stop.aspx?stopid=6884");
//htmlPage = await client.GetStringAsync("http://www.imdb.com/movies-in-theaters/");
htmlPage = await client.GetStringAsync("http://m.buses.co.uk/destinations.aspx");
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
List<Movie> movies = new List<Movie>();
foreach (var div in htmlDocument.DocumentNode.SelectNodes("//div[starts-with(#class, 'menu')]"))
{
Movie newMovie = new Movie();
// newMovie.Cover = div.SelectSingleNode(".//div[#class='image']//img").Attributes["src"].Value;
// newMovie.Title = div.SelectSingleNode(".//h4[#itemprop='name']").InnerText.Trim();
// newMovie.Summary = div.SelectSingleNode(".//div[#class='outline']").InnerText.Trim();
// newMovie.Summary = div.SelectSingleNode(".//div[#class='services']").InnerText.Trim();
newMovie.Summary = div.SelectSingleNode(".//a[starts-with(#href, 'place.aspx')]").InnerText.Trim();
movies.Add(newMovie);
}
lstMovies.ItemsSource = movies;
}
I am trying to get the list of Popular Destinations, the Names of the places, below is the part which I am interested in. I am able with the code above to get the first place - Amex Stadium. But it's not displaying anymore than that.
<div class="menu">
<ul>
<li>
<a href="place.aspx?placeid=1154">
Amex Stadium
</a>
</li>
<li>
<a href="place.aspx?placeid=1136">
Brighton Marina
</a>
</li>
<li>
<a href="place.aspx?placeid=907">
Brighton Station
</a>
</li>
<li>
<a href="place.aspx?placeid=910">
Brighton University Moulsecoomb
</a>
</li>
<li>
<a href="place.aspx?placeid=916">
Churchill Square
</a>
</li>
<li>
<a href="place.aspx?placeid=918">
Coldean
</a>
</li>
<li>
<a href="place.aspx?placeid=924">
County Hospital
</a>
</li>
<li>
<a href="place.aspx?placeid=943">
Eastbourne
</a>
</li>
<li>
<a href="place.aspx?placeid=957">
George Street Hove
</a>
</li>
<li>
<a href="place.aspx?placeid=965">
Hangleton
</a>
</li>
<li>
<a href="place.aspx?placeid=972">
Hollingbury
</a>
</li>
<li>
<a href="place.aspx?placeid=993">
Lewes
</a>
</li>
<li>
<a href="place.aspx?placeid=997">
Longhill School
</a>
</li>
<li>
<a href="place.aspx?placeid=1006">
Mile Oak
</a>
</li>
<li>
<a href="place.aspx?placeid=1011">
Newhaven
</a>
</li>
<li>
<a href="place.aspx?placeid=1134">
North Street
</a>
</li>
<li>
<a href="place.aspx?placeid=1020">
Old Steine
</a>
</li>
<li>
<a href="place.aspx?placeid=1026">
Patcham
</a>
</li>
<li>
<a href="place.aspx?placeid=1028">
Peacehaven
</a>
</li>
<li>
<a href="place.aspx?placeid=1035">
Portslade Station
</a>
</li>
<li>
<a href="place.aspx?placeid=1042">
Queens Park
</a>
</li>
<li>
<a href="place.aspx?placeid=1047">
Rottingdean
</a>
</li>
<li>
<a href="place.aspx?placeid=1057">
Seaford
</a>
</li>
<li>
<a href="place.aspx?placeid=1062">
Shoreham
</a>
</li>
<li>
<a href="place.aspx?placeid=1135">
St Peter's Church
</a>
</li>
<li>
<a href="place.aspx?placeid=1074">
Steyning
</a>
</li>
<li>
<a href="place.aspx?placeid=1076">
Sussex University
</a>
</li>
<li>
<a href="place.aspx?placeid=1080">
Tunbridge Wells
</a>
</li>
<li>
<a href="place.aspx?placeid=1082">
Uckfield
</a>
</li>
<li>
<a href="place.aspx?placeid=1091">
Westdene
</a>
</li>
<li>
<a href="place.aspx?placeid=1092">
Whitehawk
</a>
</li>
<li>
<a href="place.aspx?placeid=1095">
Woodingdean
</a>
</li>
</ul>
</div>
Your selections is wrong. The statement
SelectNodes("//div[starts-with(#class, 'menu')]")
only selects <DIV class="menu"> and since there is only one such DIV you only get one place. You should change it to:
SelectNodes("//div[starts-with(#class, 'menu')]/ul/li")
and then use:
newMovie.Summary = div.SelectSingleNode("a[starts-with(#href, 'place.aspx')]").InnerText.Trim();
Notice, I removed .// from this latter selection.

change header values based on variable codeigniter

I have the following in my header:
<div id="wrapper" class="homepage itemlist com_k2 category">
<div id="rt-header">
<div class="rt-container">
<div class="rt-grid-3 rt-alpha">
<div class="rt-block"></div>
</div>
<div class="rt-grid-9 rt-omega">
<div class="rt-fusionmenu">
<div class="nopill">
<div class="rt-menubar">
<ul class="menutop level1 ">
<li class="parent root f-main-parent firstItem f-mainparent-item"> <a class="daddy item bullet" href="www.domain.com"> <span> الرئيسية </span> </a> </li>
<li class="active root"> <a class="orphan item bullet" href="load_kitchen_list"> <span> المطبخ </span> </a> </li>
<li class="root"> <a class="orphan item bullet" href=""> <span style="font-size:medium;"> الملف الشخصي </span> </a> </li>
<li class="root"> <a class="orphan item bullet" href="main/contactus"> <span> للإتصال بنا</span> </a> </li>
</ul>
</div>
</div>
</div>
</div>
<div class="clear"></div>
</div>
</div>
Now if you investigate the elements, you will find that there is an active flag in one of them. This makes sure that the currently seen page is highlighted in navigation bar. I want to know how can I change this based on view loaded? this is how I load views:
$this->load->view('layout/header');
$this->load->view('home');
$this->load->view('layout/footer');
So in this case the main navigation tab should be active.
Regards,
I completed something similar recently. Probably not the best way, but it works for me.
In your controller add this before your views
$data['contact'] = 'active';
Then pass the $data variable from the controller to the view
$this->load->view('header', $data);
And in your view add this to the list item
<li class="contact <?php if(isset($contact)) {echo $contact; }">
from the controller:
set names for your menus and put the loaded menu or page name in a variable as the following:
$data['active'] = 'menu1';
$this->load->view('layout/header',$data);
$this->load->view('home');
$this->load->view('layout/footer');
in the view:
<div id="wrapper" class="homepage itemlist com_k2 category">
<div id="rt-header">
<div class="rt-container">
<div class="rt-grid-3 rt-alpha">
<div class="rt-block"></div>
</div>
<div class="rt-grid-9 rt-omega">
<div class="rt-fusionmenu">
<div class="nopill">
<div class="rt-menubar">
<ul class="menutop level1 ">
<li class="<?php if(isset($active) && $active == 'menu1') echo 'active'; ?> parent root f-main-parent firstItem f-mainparent-item"> <a class="daddy item bullet" href="www.domain.com"> <span> الرئيسية </span> </a> </li>
<li class="<?php if(isset($active) && $active == 'menu2') echo 'active'; ?> root"> <a class="orphan item bullet" href="load_kitchen_list"> <span> المطبخ </span> </a> </li>
<li class="<?php if(isset($active) && $active == 'menu3') echo 'active'; ?> root"> <a class="orphan item bullet" href=""> <span style="font-size:medium;"> الملف الشخصي </span> </a> </li>
<li class="<?php if(isset($active) && $active == 'menu4') echo 'active'; ?> root"> <a class="orphan item bullet" href="main/contactus"> <span> للإتصال بنا</span> </a> </li>
</ul>
</div>
</div>
</div>
</div>
<div class="clear"></div>
</div>
</div>
Give each menu item a distinct class or id then do it via jQuery or js
<li class="kitchen root"> <a class="orphan item bullet" href="load_kitchen_list"> <span> المطبخ </span> </a> </li>
<li class="anotherclass root"> <a class="orphan item bullet" href=""> <span style="font-size:medium;"> الملف الشخصي </span> </a> </li>
<li class="contact root"> <a class="orphan item bullet" href="main/contactus"> <span> للإتصال بنا</span> </a> </li>
Then for example in the contact page you'd simply do this via jQuery:
$(document).ready(function(){
$(".contact").addClass("active");
});

Xpath - Get parent class by matching two child nodes

I'd like to use xpath to select a link whose class="watchListItem", span="icon icon_checked", and h3="a test". I can use xpath to get either matching link and span, or link and h3, but not link, span, and h3.
Here's what I've tried:
//*[#class = 'watchListItem']/span[#class = 'icon icon_checked']
//*[#class= 'watchListItem']/h3[text()='AA']
I'm looking for something like this:
//*[#class = 'watchListItem']//*[span[#class = 'icon icon_checked'] and h3[text()='AA']]
<li>
<a class="watchListItem" data-id="thisid1" href="javascript:void(0);">
<span class="icon icon_checked"/>
<h3 class="itemList_heading">a test</h3>
</a>
</li>
<li>
<a class="watchListItem" data-id="thisid2" href="javascript:void(0);">
<span class="icon icon_unchecked"/>
<h3 class="itemList_heading">another test</h3>
</a>
</li>
<li>
<a class="watchListItem" data-id="thisid3" href="javascript:void(0);">
<span class="icon icon_checked"/>
<h3 class="itemList_heading">yet another test</h3>
</a>
</li>
You can use the child:: location paths like so:
//a[#class="watchListItem"
and child::span[#class="icon icon_checked"]
and child::h3[text()="another test"]]
This would select the anchor with data-id="thisid3".

Resources