Customising Pandoc writer element output - pandoc

Is it possible to customise element outputs for a pandoc writer?
Given reStructuredText input
.. topic:: Topic Title
Content in the topic
Using the HTML writer, Pandoc will generate
<div class="topic">
<p><strong>Topic Title</strong></p>
<p>Content in the topic</p>
</div>
Is there a supported way to change the html output? Say, <strong> to <mark>. Or adding another class the parent <div>.
edit: I've assumed the formatting is the responsibility of the writer, but it's also possible it's decided when the AST is created.

This is what pandoc filters are for. Possibly the easiest way is to use Lua filters, as those are built into pandoc and don't require additional software to be installed.
The basic idea is that you'd match on an AST element created from the input, and produce raw output for your target format. So if all Strong elements were to be output as <mark> in HTML, you'd write
function Strong (element)
-- the result will be the element's contents, which will no longer be 'strong'
local result = element.content
-- wrap contents in `<mark>` element
result:insert(1, pandoc.RawInline('html', '<mark>'))
result:insert(pandoc.RawInline('html', '</mark>'))
return result
end
You'd usually want to inspect pandoc's internal representation by running pandoc --to=native YOUR_FILE.rst. This makes it easier to write a filter.
There is a similar question on the pandoc-discuss mailing list; it deals with LaTeX output, but is also about handling of custom rst elements. You might find it instructional.
Nota bene: the above can be shortened by using a feature of pandoc that outputs spans and divs with a class of a known HTML element as that element:
function Strong (element)
return pandoc.Span(element.content, {class = 'mark'})
end
But I think it's easier to look at the general case first.

Related

Finding elements: different syntax -> different results - Explanation needed

When I use:
cy.get('b').contains('xdz') // find 1 element
but when I use:
cy.get('b:contains("xdz")') // find 2 elements
Can someone explain me what is the difference?
cy.get('b').contains('xdz') is invoking a Cypress command, which is designed to only return a single element. This is by design so that you can narrow a search by text content.
cy.get('b:contains("xdz")') is using a jquery pseudo-selector :contains() to test the text inside element <b> and is designed to return all matching elements.
Pseudo-selectors are extensions to the CSS selector syntax that apply jQuery methods during the selection. In this case :contains(sometext) is shorthand for $el.text().contains('sometext'). Becuase it's part of the selector, it returns all matching elements.
It's worth while understanding jquery selector variations, as this example illustrates - it can give you different results for different situations.
contains('xdz') is a cypress command which always yields only the first element containing the text. You can read more about it from this Github Thread.
:contains("xdz") is a Jquery command and it returns all elements containing the text. You can read more about it from the Jquery Docs.

Capybara / Ruby - Trying to get only the Text from all ambiguous css selector and convert it to string

I'm trying to get all Texts from a specific CSS Selector that are ambiguous in the HTML. I would like to access these ambiguous css and get the Text and then return all that info.
I've figured out how to find all ambiguous selectors but I dont know how to get just the text from each selector.
The ambiguous selector is (it finds 3 matchers)
.list-card-title .js-card-name
I've already tried commands like:
arr = Array(3)
arr = find_all('.list-card-title.js-card-name').to_a
puts arr.to_s
When I use puts arr
I got the following output
[#<Capybara::Node::Element tag="span" path="/HTML/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[3]/DIV[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/A[1]/DIV[3]/SPAN[1]">, #<Capybara::Node::Element tag="span" path="/HTML/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[3]/DIV[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/A[2]/DIV[3]/SPAN[1]">, #<Capybara::Node::Element tag="span" path="/HTML/BODY[1]/DIV[2]/DIV[2]/DIV[1]/DIV[2]/DIV[3]/DIV[1]/DIV[1]/DIV[3]/DIV[1]/DIV[1]/DIV[1]/DIV[2]/A[3]/DIV[3]/SPAN[1]">]
To get the text of elements you need to call text on each of the elements. In your case the easiest way to do that would be
find_all('.list-card-title.js-card-name').map(&:text)
which will return an array of the text contained in each of the elements. If you then want all of that concatenated into one string you could do
find_all('.list-card-title.js-card-name').map(&:text).join
Note: you have tagged your questions with automated-tests, are you actually testing an app/site, or are you instead doing web scraping? If you are testing you'd be much better off writing your tests using Capybaras expectation/assertion methods (and the :text options they accept) rather than finding elements, extracting/manipulating contained text and then doing something (I assume asserting on) with that.

Forcing string interpolation in Jade

I am trying to use Jade to do some string interpolation + i18n
I wrote a custom tag
mixin unsubscribe
a(title='unsubscribe_link', href='#{target_address}/',
target='_blank', style='color:#00b2e2;text-decoration:none;')
= __("Click here")
Then I got the following to work
p
| #[+unsubscribe] to unsubscribe
However, in order to support i18n I would also like to wrap the the whole string in a translation block the function is called with __().
But when I wrap the string in a code block it no longer renders the custom tag.
p
| #{__("#[+unsubscribe] to unsubscribe")}
p
= __("#[+unsubscribe] to unsubscribe")
will output literally [+unsubscribe] to unsubscribe. Is there a way to force the returned string from the function?
Edit 1
As has been pointed out, nesting the "Click here" doesn't really make sense, since it will be creating separate strings.
My goal with all this is really to create a simplified text string that can be passed off to a translation service:
So ideally it should be:
"#[+unsubscribe('Click here')] to unsubscribe"
and I would get back
"Klicken Sie #[+unsubscribe hier] um Ihr auszutragen"
My reasoning for this is that because using something like gettext will match by exact strings, I would like to abstract out all the logic behind the tag.
What you really want to achieve is this:
<p>
<a href='the link' title='it should also be translated!'
target='_blank' class='classes are better'>Click here</a> to unsubscribe
</p>
And for some reason you don't want to include tags in the translation. Well, unfortunately separating 'Click here' from 'to unsubscribe' will result in incorrect translations for some languages - the translator needs a context. So it is better to use the tag.
And by the way: things like __('Click here') doesn't allow for different translation of the string based on context. I have no idea what translation tool you're using, but it should definitely use identifiers rather than English texts.
Going back to your original question, I believe you can use parametrized mixin to do it:
mixin unsubscribe(title, target_address, click_here, to_unsubscribe)
a(title=title, href=target_address, target='_blank', style='color:#00b2e2;text-decoration:none;')= click_here
span= to_unsubscribe
This of course will result in additional <span> tag and it still does not solve the real issue (separating "Click here" from "to unsubscribe") and no way to re-order this sentence, but... I guess the only valid option would be to have interpolation built-in into translation engine and writing out unescaped tag. Otherwise you'd need to redesign the page to avoid link inside the sentence.

Ruby Nokogiri - How to prevent Nokogiri from printing HTML character entities

I have a html which I am parsing using Nokogiri and then generating a html out of this like this
htext= File.open(input.html).read
h_doc = Nokogiri::HTML(htmltext)
/////Modifying h_doc//////////
File.open(output.html, 'w+') do |file|
file.write(h_doc)
end
Question is how to prevent NOkogiri from printing HTML character entities (< >, & ) in the final generated html file.
Instead of HTML character entities (< > & ) I want to print actual character (< ,> etc).
As an example it is printing the html like
<title><%= ("/emailclient=sometext") %></title>
and I want it to output like this
<title><%= ("/emailclient=sometext")%></title>
So... you want Nokogiri to output incorrect or invalid XML/HTML?
Best suggestion I have, replace those sequences with something else beforehand, cut it up with Nokogiri, then replace them back. Your input is not XML/HTML, there is no point expecting Nokogiri to know how to handle it correctly. Because look:
<div>To write "&", you need to write "&amp;".</div>
This renders:
To write "&", you need to write "&".
If you had your way, you'd get this HTML:
<div>To write "&", you need to write "&".</div>
which would render as:
To write "&", you need to write "&".
Even worse in this scenario, say, in XHTML:
<div>Use the <script> tag for JavaScript</div>
if you replace the entities, you get undisplayable file, due to unclosed <script> tag:
<div>Use the <script> tag for JavaScript</div>
EDIT I still think you're trying to get Nokogiri to do something it is not designed to do: handle template HTML. I'd rather assume that your documents normally don't contain those sequences, and post-correct them:
doc.traverse do |node|
if node.text?
node.content = node.content.gsub(/^(\s*)(\S.+?)(\s*)$/,
"\\1<%= \\2 %>\\3")
end
end
puts doc.to_html.gsub('<%=', '<%=').gsub('%>', '%>')
You absolutely can prevent Nokogiri from transforming your entities. Its a built in function even, no voodoo or hacking needed. Be warned, I'm not a nokogiri guru and I've only got this to work when I'm actuing directly on a node inside document, but I'm sure a little digging can show you how to do it with a standalone node too.
When you create or load your document you need to include the NOENT option. Thats it. You're done, you can now add entities to your hearts content.
It is important to note that there are about half a dozen ways to call a doc with options, below is my personal favorite method.
require 'nokogiri'
noko_doc = File.open('<my/doc/path>') { |f| Nokogiri.<XML_or_HTML>(f, &:noent)}
xpath = '<selector_for_element>'
noko_doc.at_<css_or_xpath>(xpath).set_attribute('I_can_now_safely_add_preformatted_entities!', '&&&&&')
puts noko_doc.at_xpath(xpath).attributes['I_can_now_safely_add_preformatted_entities!']
>>> &&&&&
As for as usefulness of this feature... I find it incredibly useful. There are plenty of cases where you are dealing with preformatted data that you do not control and it would be a serious pain to have to manage incoming entities just so nokogiri could put them back the way they were.

performance issue of watir table object processing. How to make Nokogiri html table into array?

The following works but is always very slow, seemingly halting my scraping program and its Firefox or Chrome browser for even whole minutes per page:
pp recArray = $browser.table(:id,"recordTable").to_a
Getting the HTML table's text or html source is fast though:
htmlcode = $browser.table(:id,"recordTable").html # .text shows only plaintext portion like lynx
How might I be able to create the same recArray (each element from a <TR>) using for example a Nokogiri object holding only that table's html?
recArray = Nokogiri::HTML(htmlcode). ??
I wrote a blog post about that a few days ago: http://zeljkofilipin.com/watir-nokogiri/
If you have further questions, ask.
You want each tr in the table?
Nokogiri::HTML($browser.html).css('table[#id="recordTable"] > tr')
This gives a NodeSet which can be more useful than Array. Of course there's still to_a
Thought it would be useful to sum up all the steps here and there:
The question was how to produce the same array object filled with strings from the page's text content that a Watir::Webdriver Table #to_a might produce, but much faster:
recArray = Nokogiri::HTML(htmlcode). **??**
So instead of this as I was doing before:
recArray=$browser.table(:class, 'detail-table w-Positions').to_a
I send the whole page's html as a string to Nokogiri to let it do the parsing:
recArray=Nokogiri::HTML($browser.html).css('table[#class="detail-table w-Positions"] tr').to_a
Which found me the rows of the table I want and put them into an array.
Not done yet since the elements of that array are still Nokogiri (Table Row?) types, which barfed when I attempted things like .join(",") (useful for writing into a .CSV file or database for instance)
So the following iterates through each row element, turning each into an array of pure Ruby String types, containing only the text content of each table cell stripped of html tags:
recArray= recArray.map {|row| row.css("td").map {|c| c.text}.to_a } # Could of course be merged with above to even longer, nastier one-liner
Each cell had previously also been a Nokogiri Element type, done away with the .text mapping.
Significant speedup achieved.
Next I wonder what it would take to simply override the #to_a method of every Watir::Webdriver Table object globally in my Ruby code files....
(I realize that may not be 100% compatible but it would spare me so much code rewriting. Am willing to try in my personal.lib.rb include file.)

Resources