xpath nested ul list - xpath

I am banging my head against a wall here, its probably something simple that I am missing.
I have a HTML un-ordered list (ul) like the following:
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>
Using xpath (version 1 compatible with Scrapy), how would i get the text out of all the li elements including the nested one?
Thanks for any help!

If you need xpath, use response.xpath('//ul//li/text()').extract().
If you can use css, it is shorter: response.css('ul li::text').extract()

Try with a simple xpath selector:
from scrapy.selector import Selector
selector = Selector(text="""
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>""")
print(selector.xpath('//li/text()').extract())
This outputs:
['Elm 1', 'Elm 2 - with children\n ', 'Nested Elm', 'Another Elm', '\n ']

Related

Unexpected pandoc behavior converting markdown list to html

Pandoc describes its behavior clearly here in the section "Compact and loose lists"
However the conversion of
# Test
- Item 1
- Item 2
- Subitem 1
- Subitem 2
results in
<h1 id="test">Test</h1>
<ul>
<li><p>Item 1</p></li>
<li><p>Item 2</p>
<ul>
<li>Subitem 1</li>
<li>Subitem 2</li>
</ul></li>
</ul>
My understanding is that the output should be
<h1 id="test">Test</h1>
<ul>
<li><p>Item 1</p></li>
<li>Item 2
<ul>
<li>Subitem 1</li>
<li>Subitem 2</li>
</ul></li>
</ul>
I'm using pandoc 2.10.1. Any thoughts?
This was changed in pandoc 2.7 in order to get pandoc's behavior more in line with that of CommonMark. The changelog contains this entry:
Markdown reader:
Improve tight/loose list handling (#5285). Previously the
algorithm allowed list items with a mix of Para and Plain, which
is never wanted.
The mentioned issue is #5285.
It seems that the documentation was not updated. This should be reported.

CKEditor bulletlist (and orderedlist) behavior wrong when I cut it?

I am using CKEditor and I love it.
but there's truble when I use bullet(ordered)list and cut it.
1.make list like below.
List item1
List item2
List item3
2.then, cut text from end of "ListItem3" to start of "ListItem1"(not delete but cut using cmd + x).
3.ul(or ol) tag is remain like below.
remained HTML like this
<ul>
<li>​​​​​​​
<ul>
<li>
<ul></ul>
</li>
</ul>
</li>
</ul>
can I avoid this behavior? or are there any workaroud?

xpath for locating li with text does not work

Using the xpath //ul//li[contains(text(),"outer")] to find a li in the outer ul does not work
<ul>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 1
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
<li>
<span> not unique text, </span>
<span> not unique text, </span>
outer ul li 2
<ul >
<li> inner ul li 1 </li>
<li> inner ul li 2 </li>
</ul>
</li>
</ul>
Any idea how to find a li with a specific text in the outer ul?
Thank you
This will work for you //ul//li[contains(.,"outer")]
I would expect that you only like to consider the text nodes which are direct child of the li. Therefore you are right with using text() (if you use contains(.,"outer") this will consider text form any children of li).
Therefore try this:
//ul/li[text()[contains(.,'outer')]]
Running this with Saxon, the original XPath expression gives:
XPTY0004: A sequence of more than one item is not allowed as the first argument of
contains() ("", "", ...)
Now, I guess Selenium is probably using XPath 1.0 rather than XPath 2.0, and in 1.0 the contains() function has "first item semantics" - it converts its argument to a string, which if the argument is a node-set containing more than one node, involves considering only the first node. And the first text node is probably whitespace.
If you want to test whether some child text node contains "outer", use
//ul//li[text()[contains(.,"outer")]]
Another reason for switching to XPath 2.0...
For above issue -
This solution will work
//ul//li[contains(.,"outer")]
"." Selects the current node

Begin ordered list from 0 in Markdown

I'm new to Markdown. I was writing something like:
# Table of Contents
0. Item 0
1. Item 1
2. Item 2
But that generates a list that starts with 1, effectively rendering something like:
# Table of Contents
1. Item 0
2. Item 1
3. Item 2
I want to start the list from zero. Is there an easy way to do that?
If not, I could simply rename all of my indices, but this is annoying when there are several items. Beginning a list from zero seems so natural to me, it's like beginning the index of an array from zero.
Simply: NO
Longer: YES, BUT
When you create ordered list in Markdown it is parsed to HTML ordered list, i.e.:
# Table of Contents
0. Item 0
1. Item 1
2. Item 2
Will create:
<h1>Table of Contents</h1>
<ol>
<li>Item 0</li>
<li>Item 1</li>
<li>Item 2</li>
</ol>
So as you can see, there is no data about starting number. If you want to start at certain number, unfortunately, you have to use pure HTML and write:
<ol start="0">
<li>Item 0</li>
<li>Item 1</li>
<li>Item 2</li>
</ol>
You can use HTML start tag:
<ol start="0">
<li> item 1</li>
<li> item 2</li>
<li> item 3</li>
</ol>
It's currently supported in all browsers: Internet Explorer 5.5+, Firefox 1+, Safari 1.3+, Opera 9.2+, Chrome 2+
Optionally you can use type tab for more sophisticated enumerating:
type="1" - decimal (default style)
type="a" - lower-alpha
type="A" - upper-alpha
type="i" - lower-roman
type="I" - upper-roman
Via html: use <ol start="0">
Via CSS:
ol {
counter-reset: num -1; // reset counter to -1 (any var name is possible)
}
ol li {
list-style-type: none; // remove default numbers
}
ol li:before {
counter-increment: num; // increment counter
content: counter(num) ". ";
}
FIDDLE
Update: Depends on the implementation.
The current version of CommonMark requires the start attribute. Some implementations already support this, e.g. pandoc and markdown-it. For more details see babelmark.

Watir-webdriver : Accessing elements using Indexing

I am trying to access a li element using indexing
<div class="item-list">
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<li class="views-row views-row-4 views-row-even">
<li class="views-row views-row-5 views-row-odd">
<li class="views-row views-row-6 views-row-even">
<li class="views-row views-row-7 views-row-odd">
<li class="views-row views-row-8 views-row-even">
<li class="views-row views-row-9 views-row-odd views-row-last">
</ul>
</div>
The code I am using is
#browser.div(:class,'item-list').ul.li(:index => 2)
The question is : These are elements on a page and I will be using a loop to access each element. I thought using indexing will solve the problem but when I write my code and execute it I get the following error
expected #<Watir::LI:0x2c555f80 located=false selector={:index=>2, :tag_name=>"li"}> to exist (RSpec::Expectations::ExpectationNotMetError)
How can I access these elements using Indexing.
If you've got class-naming that nice, forget indexing! Do a partial match on the "views-row" parameter:
#browser.li(:class => /views-row-1/)
This can easily be parameterized for looping (although I don't know what you're doing with the information so this loop will not be very exciting).
x = 0
until x==9
x+=1
puts #browser.li(:class => /views-row-#{x}/).text
end
You could also blindly loop through the li's contained in your div if you'd like:
#browser.div(:class,'item-list').lis.each do |li|
puts li.text
end
According to the Watir wiki, Watir supports the :index method on the li element. So unless it is a bug in watir-webdriver, I think the index should work.
You may want to try the watir mailing list to see if this is a problem for others.

Resources