Unexpected pandoc behavior converting markdown list to html - pandoc

Pandoc describes its behavior clearly here in the section "Compact and loose lists"
However the conversion of
# Test
- Item 1
- Item 2
- Subitem 1
- Subitem 2
results in
<h1 id="test">Test</h1>
<ul>
<li><p>Item 1</p></li>
<li><p>Item 2</p>
<ul>
<li>Subitem 1</li>
<li>Subitem 2</li>
</ul></li>
</ul>
My understanding is that the output should be
<h1 id="test">Test</h1>
<ul>
<li><p>Item 1</p></li>
<li>Item 2
<ul>
<li>Subitem 1</li>
<li>Subitem 2</li>
</ul></li>
</ul>
I'm using pandoc 2.10.1. Any thoughts?

This was changed in pandoc 2.7 in order to get pandoc's behavior more in line with that of CommonMark. The changelog contains this entry:
Markdown reader:
Improve tight/loose list handling (#5285). Previously the
algorithm allowed list items with a mix of Para and Plain, which
is never wanted.
The mentioned issue is #5285.
It seems that the documentation was not updated. This should be reported.

Related

xpath nested ul list

I am banging my head against a wall here, its probably something simple that I am missing.
I have a HTML un-ordered list (ul) like the following:
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>
Using xpath (version 1 compatible with Scrapy), how would i get the text out of all the li elements including the nested one?
Thanks for any help!
If you need xpath, use response.xpath('//ul//li/text()').extract().
If you can use css, it is shorter: response.css('ul li::text').extract()
Try with a simple xpath selector:
from scrapy.selector import Selector
selector = Selector(text="""
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>""")
print(selector.xpath('//li/text()').extract())
This outputs:
['Elm 1', 'Elm 2 - with children\n ', 'Nested Elm', 'Another Elm', '\n ']

How do I get the inner html content in this xpath expression?

I have some HTML code
<li><h3>Number Theory - Even Factors</h3>
<p lang="title">Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?</p>
<ol class="xyz">
<li>1183</li>
<li>1200</li>
<li>1050</li>
<li>840</li>
</ol>
<ul class="exp">
<li class="grey fleft">
<span class="qlabs_tooltip_bottom qlabs_tooltip_style_33" style="cursor:pointer;">
<span>
<strong>Correct Answer</strong>
Choice (A).</br>1183
</span>
Correct answer
</span>
</li>
<li class="primary fleft">
Explanatory Answer
</li>
<li class="grey1 fleft">Factors - Even numbers</li>
<li class="orange flrt">Medium</li>
</ul>
</li>
In the HTML snippet above, I am trying to extract the <p lang="title"> Notice how it has <sup></sup> and <sub></sub> tags being used inside.
My Xpath expression .//p[#lang="title"]/text() does not retrieve the sub and sup contents. How do I get this output below
Desired Output
Number N = 2<sup>6</sup>*5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?
XPath
You can simply get innerHTML with node() as below:
//p[#lang="title"]/node()
Note that it returns an array of nodes
Python
You can get required innerHTML with below Python code
from BeautifulSoup import BeautifulSoup
def innerHTML(element):
"Function that receives element and returns its innerHTML"
return element.decode_contents(formatter="html")
html = """<html>
<head>...
<body>...
Your HTML source code
..."""
soup = BeautifulSoup(html)
paragraph = soup.find('p', { "lang" : "title" })
print(innerHTML(paragraph))
Output:
'Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?'

Wrap lines with tag using different logic in sublime text 2

I have hundreds of list items to code. each list item contains title and description in 2 lines. so what i need to do is wrap 2 lines with a tag. is there any way to do so using sublime text 2? i am using windows OS.
this is the output needed:
<ul>
<li>
this is the title
this is the descrpition
</li>
<li>
this is the title
this is the descrpition
</li>
</ul>
raw text looks like this:
this is title
this is description
this is title
this is description
=====
i have tried using ctrl+shift+G and using ul>li* but unfortunately it wraps each line with <li>
if it is possible with sublime text, i actually need this type of structure:
<ul>
<li>
<span class="title">this is the title</span>
<span class="description">this is the descrpition</span>
</li>
<li>
<span class="title">this is the title</span>
<span class="description">this is the descrpition</span>
</li>
</ul>
How about a two step process using find and replace?
I am assuming that:
your original text is not indented at all;
your indentation is two spaces; and
you will handle the wrap with <ul> and resultant indentation yourself after this is done.
Original state:
this is title
this is description
this is title
this is description
Step one
Ensuring you have enabled regular expression matching do a find and replace using these values.
FIND WHAT :: ((.*\n){1,2})
REPLACE WITH :: <li>\n\1</li>\n
Result:
<li>
this is title
this is description
</li>
<li>
this is title
this is description
</li>
Step two
Ensuring you have enabled regular expression matching do a find and replace using these values.
FIND WHAT :: (<li>\n)(.*)\n(.*)
REPLACE WITH :: \1 <span class="title">\2</span>\n <span class="description">\3</span>
Result:
<li>
<span class="title">this is title</span>
<span class="description">this is description</span>
</li>
<li>
<span class="title">this is title</span>
<span class="description">this is description</span>
</li>
What do you think?
Close enough to be useful?

XPath and negation searches

I have the following code sample in an xmlns root:
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
I would like to perform a different operation on the first item in <li> depending on the type of tag it resides in, or no tag, i.e. the first <li> in the sample.
EDIT:
My logic in pursuing the task turns out to be incorrect.
How do I query a <li> that has no descendants as in the first list item?
I tried negation:
#doc.xpath("//xmlns:ol[#class='stan']//xmlns:li/xmlns:*[1][not(p|pre)]")
That gives me the exact opposite for what I think I am asking for.
I think I am making the expression more complicated since I can't find the right solution.
UPDATE:
Navin Rawat has answered this one in the comments. The correct code would be:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li[not(xmlns:*)]")
CORRECTION:
The correct question involves both an XPath search and a Nokogiri method.
Given the above xhtml code, how do I search for first descendant using xpath? And how do I use xpath in a conditional statement, e.g.:
#doc.xpath("//xmlns:ol[#class='stan']/xmlns:li").each do |e|
if e.xpath("e has no descendants")
perform task
elsif e.xpath("e first descendant is <p>")
perform second task
elsif e.xpath("e first descendant is <pre>")
perform third task
end
end
I am not asking for complete code. Just the part in parenthesis in the above Nokogiri code.
Pure XPath answer...
If you have the following XML :
<ol class="stan">
<li>Item one.</li>
<li>
<p>Paragraph one.</p>
<p>Paragraph two.</p>
</li>
<li>
<pre>Preformated one.</pre>
<p>Paragraph one.</p>
</li>
</ol>
And want to select <li> that has no child element as in the first list item, use :
//ol/li[count(*)=0]
If you have namespaces problem, please give to whole XML (with the root element and namespaces declaration) so that we can help you dealing with it.
EDIT after our discussion, here is your final tested code :):
#doc.xpath("//xmlns:ol[#class='footnotes']/xmlns:li").each do |e|
if e.xpath("count(*)=0")
puts "No children"
elsif e.xpath("count(*[1]/self::xmlns:p)=1")
puts "First child is <p>"
elsif e.xpath("count(*[1]/self::xmlns:pre)=1")
puts "First child is <pre>"
end
end

Begin ordered list from 0 in Markdown

I'm new to Markdown. I was writing something like:
# Table of Contents
0. Item 0
1. Item 1
2. Item 2
But that generates a list that starts with 1, effectively rendering something like:
# Table of Contents
1. Item 0
2. Item 1
3. Item 2
I want to start the list from zero. Is there an easy way to do that?
If not, I could simply rename all of my indices, but this is annoying when there are several items. Beginning a list from zero seems so natural to me, it's like beginning the index of an array from zero.
Simply: NO
Longer: YES, BUT
When you create ordered list in Markdown it is parsed to HTML ordered list, i.e.:
# Table of Contents
0. Item 0
1. Item 1
2. Item 2
Will create:
<h1>Table of Contents</h1>
<ol>
<li>Item 0</li>
<li>Item 1</li>
<li>Item 2</li>
</ol>
So as you can see, there is no data about starting number. If you want to start at certain number, unfortunately, you have to use pure HTML and write:
<ol start="0">
<li>Item 0</li>
<li>Item 1</li>
<li>Item 2</li>
</ol>
You can use HTML start tag:
<ol start="0">
<li> item 1</li>
<li> item 2</li>
<li> item 3</li>
</ol>
It's currently supported in all browsers: Internet Explorer 5.5+, Firefox 1+, Safari 1.3+, Opera 9.2+, Chrome 2+
Optionally you can use type tab for more sophisticated enumerating:
type="1" - decimal (default style)
type="a" - lower-alpha
type="A" - upper-alpha
type="i" - lower-roman
type="I" - upper-roman
Via html: use <ol start="0">
Via CSS:
ol {
counter-reset: num -1; // reset counter to -1 (any var name is possible)
}
ol li {
list-style-type: none; // remove default numbers
}
ol li:before {
counter-increment: num; // increment counter
content: counter(num) ". ";
}
FIDDLE
Update: Depends on the implementation.
The current version of CommonMark requires the start attribute. Some implementations already support this, e.g. pandoc and markdown-it. For more details see babelmark.

Resources