Scrapy is really powerful tool but sometimes it’s frustrating when it comes to XPath.
From the following html, I want to extract the links and the link texts (Title 1, Title 2 etc.) between <b>January 2017</b> and <b>February 2017</b> and group them per “Part”.
The actual html.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scrapy</title>
</head>
<body>
<hr size=1>
<h2 style="margin-top: 36px; margin-bottom: 24px">
Abcd efgh for 2017
</h2>
Part 1 |
Part 2 |
Part 3 |
Part 4 |
A very bold title
<hr size="1" style="margin-top: 36px; margin-bottom: 24px">
<a name="part1"></a>
<h3>Part 1</h3>
<ul>
</ul>
<a name="part2"></a>
<h3>Part 2</h3>
<ul>
</ul>
<a name="part3"></a>
<h3>Part 3</h3>
<ul>
</ul>
<a name="part4"></a>
<h3>Part 4</h3>
<ul>
</ul>
<div style="margin-top: 36px; margin-bottom: 24px">
<a name="non_rep"></a>
<h3>Abcd efgh</h3>
</div>
<b>January 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>February 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>March 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>April 2017</b>
...
...
So on so forth
</body>
</html>
The result should be:
January 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/1.htm
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/a/2.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/22.htm
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/a/3.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/a/33.htm
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
February 2017
Part1 1
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/1.htm
Part1 2
Title: Title 1, link: /cgi-bin/o.pl?file=/b/2.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/22.htm
Part1 3
Title: Title 1, link: /cgi-bin/o.pl?file=/b/3.htm
Title: Title 1, link: /cgi-bin/o.pl?file=/b/33.htm
I tried //text()[following-sibling::b/text()='January 2017']/following::a[contains(#href, 'cgi-bin')]/text() and similar spells to no avail.
How should I approach?
The whole setup is a bit nasty because the tree structure is very flat. However we can see that it follows this pattern: <b> node with text with <ul> right below it with data.
So we can find all we want with some loops and following-sibling::ul[1] xpath.
It's a bit ugly because of triple loops but if you ignore that it's pretty simple:
# any <b> node that contains 201x (a year)
nodes = response.xpath("//b[re:test(text(),'201\d')]")
for node in nodes:
# get date node data
name = node.xpath('text()').extract_first()
parts = node.xpath('following-sibling::ul[1]//li/b')
for part in parts:
# the same with part node data
part_name = part.xpath('text()').extract_first()
links = part.xpath("../following-sibling::ul[1]//a")
for link in links:
# finally, we have date, part and link data! Put it together.
item = dict()
item['date_name'] = name
item['part_name'] = part_name
item['link_name'] = link.xpath('text()').extract_first()
item['link_url'] = link.xpath('#href').extract_first()
yield item
Related
this is the application html with drag and drop content
<li>
<a class='place_video help_fields_controller_place_video' data-kind='stimuli' data-type='place_video' href='#'>
Place Video
<b></b>
</a>
</li>
<li>
<a class='place_photo help_fields_controller_place_photo' data-kind='stimuli' data-type='place_photo' href='#'>
Place Photo
<b></b>
</a>
</li>
<li>
<a class='place_file help_fields_controller_place_file' data-kind='stimuli' data-type='place_file' href='#'>
Place File
<b></b>
</a>
</li>
<li>
<a class='place_link help_fields_controller_place_link' data-kind='stimuli' data-type='place_link' href='#'>
Survey Link
<b></b>
</a>
</li>
</ul>
<br style='clear:left'>
</div>
</div>
<div id='follow'>
<div class='box collapsable activity_controls js_collectors' href='/fields/new'>
<h3 class='clearfix box_header'>
<div class='controls'>
</div>
<a action='collapse' class='collapse' href='#'></a>
<span class='help_activities_controller_collectors'>
Collectors
</span>
</h3>
<div class='content'>
<ul id='sortable_collectors'>
<li>
<a class='paragraph_text help_fields_controller_paragraph_text' data-kind='collector' data-type='paragraph_text' href='#'>
Paragraph Text
<b></b>
</a>
</li>
<li>
<a class='single_line_text help_fields_controller_single_line_text' data-kind='collector' data-type='single_line_text' href='#'>
Single Line Text
<b></b>
</a>
</li>
<li>
<a class='website help_fields_controller_website' data-kind='collector' data-type='website' href='#'>
Website
<b></b>
</a>
</li>
<li>
<a class='date help_fields_controller_date' data-kind='collector' data-type='date' href='#'>
Date
<b></b>
</a>
</li>
<li>
<a class='request_photo help_fields_controller_request_photo' data-kind='collector' data-type='request_photo' href='#'>
Request Photo
<b></b>
</a>
</li>
<li>
<a class='request_file help_fields_controller_request_file' data-kind='collector' data-type='request_file' href='#'>
Request File
<b></b>
</a>
</li>
<li>
<a class='request_video help_fields_controller_request_video' data-kind='collector' data-type='request_video' href='#'>
Request Video
<b></b>
</a>
</li>
<li>
<a class='single_choice help_fields_controller_single_choice' data-kind='collector' data-type='single_choice' href='#'>
Single Choice
<b></b>
</a>
</li>
<li>
<a class='multiple_choice help_fields_controller_multiple_choice' data-kind='collector' data-type='multiple_choice' href='#'>
Multiple Choice
<b></b>
</a>
</li>
<li>
<a class='drop_down help_fields_controller_drop_down' data-kind='collector' data-type='drop_down' href='#'>
Drop Down
<b></b>
</a>
</li>
<li>
<a class='time help_fields_controller_time' data-kind='collector' data-type='time' href='#'>
Time
<b></b>
</a>
</li>
<li>
<a class='number help_fields_controller_number' data-kind='collector' data-type='number' href='#'>
Number
<b></b>
</a>
</li>
<li id='location-collector'>
<a class='location help_fields_controller_location' data-kind='collector' data-type='location' href='#'>
Location
<b></b>
</a>
</li>
</ul>
<br style='clear: left'>
</div>
</div>
</div>
</div>
</td>
<td class='work_area'>
<div id='activity_fields'>
<input type="hidden" name="activity[form_attributes][id]" id="activity_form_attributes_id" />
<ul class='empty_sort_table' id='sortable'>
<li class='blank-slate'>
Build your activity by drag and dropping stimuli and collectors from the palette.
</li>
</ul>
</div>
</td>
</tr>
</table>
I would like to drag the instructional text and drop on the activity builder area and the cypress code for it
it('Activity Builder: Activity Library, Activity & Scheduling', () => {
const dataTransfer = new DataTransfer()
//Activity Scheduling
leftMenuNavigation.clickActivityScheduling()
cy.findByRole('heading', { name: /activities & scheduling/i }).should(
'be.visible'
)
//creating activity
cy.findByText(/^New Activity$/i)
.should('be.visible')
.parent()
.click()
cy.findByText(/activity/i, {
selector: '.goog-menuitem-content'
}).click({ force: true })
cy.findByRole('link', { name: /instructional text/i }).trigger(
'dragstart',
{
dataTransfer
}
)
cy.findByRole('cell', {
name: /build your activity by drag and dropping stimuli and collectors from the palette\.
/i
})
.findByRole('listitem')
.trigger('drop')
.trigger('dragend')
})
})
})
I am trying to extract xml codes from html source. source is like this;
.
.
.
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
Actually i need parent child relation so i need to extract node cell with xpath node first. But i couldn't achive to get range of xml code from "h5" to "/ul". So i need "h5" and "ul" tags together. Output must be like this;
<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
I searched tons of links and tried everything but none of these xpath codes worked;
/.../*[self::dns:h5 or self::dns:ul]
/.../*[self::dns:h5|self::dns:ul]
/.../*[self::h5 or self::ul]
Any idea, thanks.
If you use Python, you can do this
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<h5>
<u>A</u>
</h5>
<ul class="listss">
<li>
<d>
<a href="link">
linktext
</a>
</d>
</li>
<li>
<d>
<a href="link2">
linktext2
</a>
</d>
</li>
</ul>
<h5>
<u>B</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>C</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>
<h5>
<u>D</u>
</h5>
<ul class="listss">
.\
.(SAME TAGS AS ABOVE)
./
</ul>'''
doc = SimplifiedDoc(html)
items = doc.children
lastName = None
for item in items:
if item.tag == 'h5':
lastName = item.text
else:
links = item.getElementsByTag('a')
print (lastName,links)
result:
A [{'href': 'link', 'tag': 'a', 'html': 'linktext\n '}, {'href': 'link2', 'tag': 'a', 'html': 'linktext2\n '}]
B []
C []
D []
How should you handle h tags in the main menu of your site if the main menu renders above the h1?
Is it ok to break heading hierarchy in this case to provide the additional semantics / hierarchy, or should a different element be used instead of heading tags?
For example:
You have menu that shows a secondary menu on hover / focus / click.
It uses a checkbox input to indicate the state of the menu so as to allow the menu to stay open.
The secondary menus use h2 > ul >li > a to organize and provide additional hierarchy to the overall menu.
Third level link lists follow the heading hierarchy, eg., h3 > ul > li > a
CodePen link for more clarity :: https://codepen.io/uxmfdesign/pen/yLBrMmd
<header>
<div class="main-menu__container">
<input
class="main-menu__toggle webaim-hidden"
type="checkbox"
id="main-menu-toggle-block-main-menu">
<!-- toggle open close -->
<label
class="main-menu__label"
for="main-menu-toggle-block-main-menu"
aria-label="Toggle Main menu"
>
Main menu
</label>
<nav class="main-menu__nav">
<div
class="main-menu-group"
data-main-menu-group="">
<!-- toggle open close of menu section -->
<input
id="main-menu-group-toggle-block-main-menu-0"
type="checkbox"
name="main-menu-group"
class="main-menu-group__toggle webaim-hidden"
data-main-menu-group-toggle="">
<label
for="main-menu-group-toggle-block-main-menu-0"
aria-label="Open Our services"
aria-haspopup="true"
data-icon=""
class="main-menu-group__toggle-label">
Section title
</label>
<div class="main-menu-group__wrapper">
<h2 class="main-menu-group__heading">
<span>Section title</span>
</h2>
<ul class="main-menu-nav-list">
<li class="main-menu-nav-list__item">
<h3 class="main-menu-nav-list__title">
<a class="main-menu-nav-list__title-link" href="/ca-domains">
Sub-section title
</a>
</h3>
<ul class="main-menu-subnav-list__menu">
<li class="main-menu-subnav-list__item main-menu-subnav-list__item--with-sub">
<a class="main-menu-subnav-list__link" href="/ca-domains/register-your-ca">
sub-page name
</a>
</li>
<li class="main-menu-subnav-list__item">
<a class="main-menu-subnav-list__link" href="/ca-domains/optimize-your-ca">
sub-page name
</a>
</li>
</ul>
</li>
<li class="main-menu-nav-list__item">
<h3 class="main-menu-nav-list__title">
<a class="main-menu-nav-list__title-link" href="/ca-domains">
Sub-section title
</a>
</h3>
<ul class="main-menu-subnav-list__menu">
<li class="main-menu-subnav-list__item main-menu-subnav-list__item--with-sub">
<a class="main-menu-subnav-list__link" href="/ca-domains/register-your-ca">
sub-page name
</a>
</li>
<li class="main-menu-subnav-list__item">
<a class="main-menu-subnav-list__link" href="/ca-domains/optimize-your-ca">
sub-page name
</a>
</li>
</ul>
</li>
</ul>
</div>
</div>
<div
class="main-menu-group"
data-main-menu-group="">
<!-- toggle open close of menu section -->
<input
id="main-menu-group-toggle-block-main-menu-0"
type="checkbox"
name="main-menu-group"
class="main-menu-group__toggle webaim-hidden"
data-main-menu-group-toggle="">
<label
for="main-menu-group-toggle-block-main-menu-0"
aria-label="Open Our services"
aria-haspopup="true"
data-icon=""
class="main-menu-group__toggle-label">
Section title
</label>
<div class="main-menu-group__wrapper">
<h2 class="main-menu-group__heading">
<span>Section title</span>
</h2>
<ul class="main-menu-nav-list">
<li class="main-menu-nav-list__item">
<h3 class="main-menu-nav-list__title">
<a class="main-menu-nav-list__title-link" href="/ca-domains">
Sub-section title
</a>
</h3>
<ul class="main-menu-subnav-list__menu">
<li class="main-menu-subnav-list__item main-menu-subnav-list__item--with-sub">
<a class="main-menu-subnav-list__link" href="/ca-domains/register-your-ca">
sub-page name
</a>
</li>
<li class="main-menu-subnav-list__item">
<a class="main-menu-subnav-list__link" href="/ca-domains/optimize-your-ca">
sub-page name
</a>
</li>
</ul>
</li>
<li class="main-menu-nav-list__item">
<h3 class="main-menu-nav-list__title">
<a class="main-menu-nav-list__title-link" href="/ca-domains">
Sub-section title
</a>
</h3>
<ul class="main-menu-subnav-list__menu">
<li class="main-menu-subnav-list__item main-menu-subnav-list__item--with-sub">
<a class="main-menu-subnav-list__link" href="/ca-domains/register-your-ca">
sub-page name
</a>
</li>
<li class="main-menu-subnav-list__item">
<a class="main-menu-subnav-list__link" href="/ca-domains/optimize-your-ca">
sub-page name
</a>
</li>
</ul>
</li>
</ul>
</div>
</div>
</nav>
</div>
</header>
<main>
<h1>
The problem heading is here
</h1>
</main>
I have the following html:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scrapy</title>
</head>
<body>
<table style="border: #ffffff 0px solid" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td align="center">
<div style="margin-top:7px;margin-bottom:7px;font-size:16px;font-weight:bold;font-color:white" width="100%">
Scrapy Rocks
</div>
</td>
</tr>
</table>
<table cellpadding="0" cellspacing="0" width="100%" style="margin-top:25px">
<tr>
<td align="left" valign="top"></td>
<td valign="top">
<font size="-1">
<div style="margin-right:10; margin-top:5; text-align: right">
AAA |
BBB |
CCC
</div>
</font>
</td>
</tr>
<tr>
<td align="left" valign="top">
<div>
<a href="http://example.com" target="_blank">
<img src="/images/a.jpg" border="0" vspace="0" width="100" height="100" valign="middle"/>
</a>
<a href="/index.html">
<img src="/images/aaa.gif" border="0" vspace="0" width="100" height="100" valign="middle"/>
</a>
</div>
</td>
<td valign="top">
<div style="margin-right:10; margin-top:5; text-align: right"></div>
</td>
</tr>
</table>
<hr size=1>
<h2 style="margin-top: 36px; margin-bottom: 24px">
Abcd efgh for 2017
</h2>
Part 1 |
Part 2 |
Part 3 |
Part 4 |
A very bold title
<hr size="1" style="margin-top: 36px; margin-bottom: 24px">
<a name="part1"></a>
<h3>Part 1</h3>
<ul>
</ul>
<a name="part2"></a>
<h3>Part 2</h3>
<ul>
</ul>
<a name="part3"></a>
<h3>Part 3</h3>
<ul>
</ul>
<a name="part4"></a>
<h3>Part 4</h3>
<ul>
</ul>
<div style="margin-top: 36px; margin-bottom: 24px">
<a name="non_rep"></a>
<h3>Abcd efgh</h3>
</div>
<b>January 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>February 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
<b>March 2017</b>
<ul>
<li>
<b>Part1 1</b>
</li>
<ul>
<li>
Title 1
</li>
<br>
<li>
Title 2
</li>
<br>
</ul>
<li>
<b>Part1 2</b>
</li>
<ul>
<li>
Title A
</li>
<br>
<li>
Title B
</li>
<br>
</ul>
<li>
<b>Part1 3</b>
</li>
<ul>
<li>
Some text 1
</li>
<br>
<li>
Some Text 2
</li>
</ul>
</ul>
</body>
</html>
What i need here is to extract the text between the body tags (using Scrapy xpath) but I don't want the tables text at all.
What I tried to get all the text was:
def parse(self, response):
"""
-*-
"""
item = DummyItem()
title = response.xpath('//title/text()').extract()
body = "\n ".join(
response.xpath(
'//body//*[not(self::script or self::style)]/text()'
).extract()
)
item['title'] = title
item['body'] = body
yield item
Whit the above stanza, I managed to extract all the text, tables inclusive, which I don't want.
Then I replaced the "body" with:
body = "\n ".join(
response.xpath(
'//body//*[not(self::table or self::script or self::style)]/text()'
).extract()
)
It didn't do the job. Still extracting the tables text.
Any ideas on how to tackle it?
You want "all text nodes that are not in a <table>", or "all text nodes that do not have a <table> ancestor".
That's /html/body//text()[not(ancestor::table)] in XPath.
text_nodes = response.xpath("/html/body//text()[not(ancestor::table)]").extract()
now you only need to strip whitespace from the resulting items and remove empty strings from the list.
body = "\n ".join(filter(None, map(str.strip, text_nodes)))
protected async override void OnNavigatedTo(NavigationEventArgs e)
{
base.OnNavigatedTo(e);
string htmlPage = "";
using (var client = new HttpClient())
{
// htmlPage = await client.GetStringAsync("http://m.buses.co.uk/stop.aspx?stopid=6884");
//htmlPage = await client.GetStringAsync("http://www.imdb.com/movies-in-theaters/");
htmlPage = await client.GetStringAsync("http://m.buses.co.uk/destinations.aspx");
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);
List<Movie> movies = new List<Movie>();
foreach (var div in htmlDocument.DocumentNode.SelectNodes("//div[starts-with(#class, 'menu')]"))
{
Movie newMovie = new Movie();
// newMovie.Cover = div.SelectSingleNode(".//div[#class='image']//img").Attributes["src"].Value;
// newMovie.Title = div.SelectSingleNode(".//h4[#itemprop='name']").InnerText.Trim();
// newMovie.Summary = div.SelectSingleNode(".//div[#class='outline']").InnerText.Trim();
// newMovie.Summary = div.SelectSingleNode(".//div[#class='services']").InnerText.Trim();
newMovie.Summary = div.SelectSingleNode(".//a[starts-with(#href, 'place.aspx')]").InnerText.Trim();
movies.Add(newMovie);
}
lstMovies.ItemsSource = movies;
}
I am trying to get the list of Popular Destinations, the Names of the places, below is the part which I am interested in. I am able with the code above to get the first place - Amex Stadium. But it's not displaying anymore than that.
<div class="menu">
<ul>
<li>
<a href="place.aspx?placeid=1154">
Amex Stadium
</a>
</li>
<li>
<a href="place.aspx?placeid=1136">
Brighton Marina
</a>
</li>
<li>
<a href="place.aspx?placeid=907">
Brighton Station
</a>
</li>
<li>
<a href="place.aspx?placeid=910">
Brighton University Moulsecoomb
</a>
</li>
<li>
<a href="place.aspx?placeid=916">
Churchill Square
</a>
</li>
<li>
<a href="place.aspx?placeid=918">
Coldean
</a>
</li>
<li>
<a href="place.aspx?placeid=924">
County Hospital
</a>
</li>
<li>
<a href="place.aspx?placeid=943">
Eastbourne
</a>
</li>
<li>
<a href="place.aspx?placeid=957">
George Street Hove
</a>
</li>
<li>
<a href="place.aspx?placeid=965">
Hangleton
</a>
</li>
<li>
<a href="place.aspx?placeid=972">
Hollingbury
</a>
</li>
<li>
<a href="place.aspx?placeid=993">
Lewes
</a>
</li>
<li>
<a href="place.aspx?placeid=997">
Longhill School
</a>
</li>
<li>
<a href="place.aspx?placeid=1006">
Mile Oak
</a>
</li>
<li>
<a href="place.aspx?placeid=1011">
Newhaven
</a>
</li>
<li>
<a href="place.aspx?placeid=1134">
North Street
</a>
</li>
<li>
<a href="place.aspx?placeid=1020">
Old Steine
</a>
</li>
<li>
<a href="place.aspx?placeid=1026">
Patcham
</a>
</li>
<li>
<a href="place.aspx?placeid=1028">
Peacehaven
</a>
</li>
<li>
<a href="place.aspx?placeid=1035">
Portslade Station
</a>
</li>
<li>
<a href="place.aspx?placeid=1042">
Queens Park
</a>
</li>
<li>
<a href="place.aspx?placeid=1047">
Rottingdean
</a>
</li>
<li>
<a href="place.aspx?placeid=1057">
Seaford
</a>
</li>
<li>
<a href="place.aspx?placeid=1062">
Shoreham
</a>
</li>
<li>
<a href="place.aspx?placeid=1135">
St Peter's Church
</a>
</li>
<li>
<a href="place.aspx?placeid=1074">
Steyning
</a>
</li>
<li>
<a href="place.aspx?placeid=1076">
Sussex University
</a>
</li>
<li>
<a href="place.aspx?placeid=1080">
Tunbridge Wells
</a>
</li>
<li>
<a href="place.aspx?placeid=1082">
Uckfield
</a>
</li>
<li>
<a href="place.aspx?placeid=1091">
Westdene
</a>
</li>
<li>
<a href="place.aspx?placeid=1092">
Whitehawk
</a>
</li>
<li>
<a href="place.aspx?placeid=1095">
Woodingdean
</a>
</li>
</ul>
</div>
Your selections is wrong. The statement
SelectNodes("//div[starts-with(#class, 'menu')]")
only selects <DIV class="menu"> and since there is only one such DIV you only get one place. You should change it to:
SelectNodes("//div[starts-with(#class, 'menu')]/ul/li")
and then use:
newMovie.Summary = div.SelectSingleNode("a[starts-with(#href, 'place.aspx')]").InnerText.Trim();
Notice, I removed .// from this latter selection.