Smart Quotes and Ligatures in pandoc - pandoc

I have a file text.txt which contains very basic latex/markdown. For example, it might be the following.
Here is some basic maths: $f(x) = ax + b$ defines a straight line, often called a "linear" function---but it's not _actually_ a linear function, eg $f(0) \ne 0$.
I would like to convert this into html using WebTeX. However, I don't want smart quotes---" should be outputted as basic straight lines, not curved on either end---or smart dashes------ should be literally three dashes, not an em-dash.
It seems that the smart option is good for this: pandoc manual, github 1, github 2. However, I can't quite work out the correct syntax. I have tried, for example, the following.
pandoc text.txt -f markdown-smart -t markdown-smart -s --webtex -o tex.html
Unfortunately this doesn't work.
I solved this while writing the question, so I'll post the answer below! (Spoiler alert: simply remove -t markdown-smart.)

Simply remove -t markdown-smart.
pandoc text.txt -f markdown-smart -s --webtex -o tex.html
I believe that this -t is saying "to markdown without smart". We are not trying to output markdown, but rather html. If the version with -t is viewed, then one sees that the code for embedding various images is included. If this is pasted into a markdown editor, then it should show up.
To get html, simply remove this.

Related

Pandoc/PowerShell option to use to avoid ​�? in place of the closing double quotes

I use the following Powershell script to convert the raw Markdown-Plain-Text in my clipboard into pastable things that can be used in an arbitrary browser. I use it most heavily for writing emails in Gmail, and for Google Docs.
paste.exe | pandoc -f markdown -t HTML | Set-Clipboard -AsHtml ; echo 'Conversion done.'
It has been working amazingly well, except for its conversion of the closing double quotes.
When I type, I do not distinguish between the Opening and the Closing quotation marks;
Either, it is pandoc that wanted to help, but screw up a bit,
Or, it is the Set-Clipboard Powershell command that needs a bit more of attention.
Experts, please advise what "magic flag" to put, so that I can avoid manually cleaning up the �? markers all over the place.
You can disable Pandoc's smart extension, which is enabled by default for markdown, latex, and context output.
pandoc -f markdown-smart -t HTML
Note that you "disable" an extension by appending -EXTENSION to the format, where EXTENSION is the extension name. Therefore the format is markdown-smart. Conversely you can enable an extension with +EXTENSION. So you might read markdown-smart as "markdown minus smart".
As an aside, the name smart is likely borrowed from SmartyPants, a postprocessor to the original Markdown implementation which replaces straight quotes with curly quotes among other things. I found the extension by opening the Pandoc User Guide and searching for smart. Now you know. ;-)

Converting from docx to markdown how to get rid of span underline in links?

Since a recent pandoc update (now I'm at 2.2.1) the links in a docx document are converted to [<span class="underline">graphic novel hero</span>](https://www.amazon.com/exec/obidos/ASIN/1596432594/braipick-20) adding a unneeded span to link labels. Is there any black magic (besides adding a sed call to the pipeline) to get rid of them and returning to pure commonmark?
The pandoc options I use are: pandoc -f docx --atx-headers --wrap=none --extract-media=. -t commonmark-smart myFile.docx
Thanks for clarifying!
If you use -t commonmark the spans that the docx-reader generates are converted to raw HTML, so you could use:
pandoc -t commonmarkd-raw_html
Alternatively, use the markdown-writer, which is more flexible in terms of extensions (but as of 2018 not yet 100%-commonmark-compliant):
pandoc -t markdown-bracketed_spans-raw_html-native_spans
See the MANUAL for more details.

Strange characters appearing in bash variable expansion

Trying to do the following on contos7 works as I expect:
pod_in_question=$(curl -u uname:password -k very.cluster.com/api/v1/namespaces/default/pods/ | grep -i '"name": "myapp-' | cut -d '"' -f 4)
echo "$pod_in_question"
curl -u uname:password -k -X DELETE "very.cluster.com/api/v1/namespaces/default/pods/${pod_in_question}"
However, trying the same thing on MacOS (10.12.1) yields:
curl: (3) [globbing] bad range in column 92
When I try to curl the last line with a -g option it substitutes with a malformed name such as: myapp-\\x1b[m\\x1b[Kl1eti\
The echo statement would always execute just fine and show something like myapp-v7454 which I later want to put into the last curl statement. So where are these other characters coming from?
A robust solution - Basic cURL CLI debugging.
This answer is revised after it's been identified that the issue for the OP relates to curl applying color output.
There's a proposed answer which explains clearly what the embedded special characters meant, and instructions to override the grep behaviour to not output color. Certainly this is a good practise for grep use in piping. There are however a number of best practises that can help diagnose this or a similar issue with cURL and ultimately lead to the most robust solution.
Re-creating the problem
Assuming it's a JSON Content-Type, we use echo {'"name": "myapp-7414"'} to simulate the output from cURL
We filter the text and set a variable with it that we use in a cURL command
We force grep to output color, since it doesn't normally by default when outputting to a tty.
Recreation:
myvar=$(echo {'"name": "myapp-7414"'} | grep --color=always -i '"name": "myapp-' | cut -d '"' -f 4)
curl "https://www.google.com/${myvar}"
Output:
curl: (3) [globbing] bad range in column 32
First up:
'{}' are special characters to cURL, period.
The best practise for URL syntax in cURL:
If Variable Expansion is required:
Apply the -g switch to disable potential globbing done by cURL
Otherwise:
Use $variable as part of a "quoted" url string, instead of ${variable}
Second: In addition to -g, we add --libcurl /tmp/libcurl so we can get some insight into what cURL is seeing.
   Recreation with -g and --libcurl:
curl -g --libcurl /tmp/libcurl "https://www.google.com/${myvar}"
Output:
<p>Your client has issued a malformed or illegal request <ins>That’s all we know.
Perfect, at least now everything is getting to the server and back! Let's see what cURL sent out to the server:
cat /tmp/libcurl
Surely enough we find this line: (note the bold part).
curl_easy_setopt(hnd, CURLOPT_URL, "https://www.google.com/myapp-\033[m7414");
So we know that:
The shell is doing something strange with our variable.
cURL knows not to try glob once we send the -g switch. That way - If there is an error with the shell variable, we can actually see what it is. We shouldn't be debugging a globbing error if we're not trying to use URL Ranges.
The special characters are colors. They represent the --color=always that we added to simulate the OPs environment.
At this point. Since it looks like we're working with JSON data, why not just use a widely available, high performance JSON parsing tool. That has a number of benefits, including:
Not relying on any environment that could affect string filtering
Can request the data we want (aka. "name")
The app name "myapp" can change and we won't have to re-write the code to retrieve it.
It's cleaner and accounts for things I haven't considered yet.
If we used jq for example (while we're at it, we don't need the -g switch because we don't need '{}' for the variable because we're already double " the URL):
myvar=$(echo {'"name": "myapp-7414"'} | jq -r .name)
curl --libcurl /tmp/libcurl "https://www.google.com/$myvar"
Now we get:
<p>The requested URL /myapp-7414 was not found on this server. That’s all we know.
Great. It's all working now. It should be obvious that the test URL here being www.google.com is obviously not going to know was myapp-7414 was.
So we've gone from :
Globbing bad range, to:
Malformed URL, to:
URL not found on server.
We could also as suggested elsewhere change the grep output and override it to --color=never (As I have noted: If grep has to be used, the --color=never is a great way to use it as a best practise when piping strings, period.). However, given the portability issues already experienced because of string filtering, and the fact that we are already handed structured data on a plate that can be parsed reliably, the more robust solution would be to do just that, if possible.
The substitution you showed at the last part looks like one of your calls injected ANSI escape sequences. It's possible that grep isn't detecting non-TTY output and is colorizing.
On a terminal that supports ANSI escape sequences, your particular codes might not be visible. The codes ^E[m^E[K set the screen mode and clear the current line. That's why you thought the echo command proved your data was correct.
You can examine the raw data with:
echo "$pod_in_question" | hexdump -C
And you should see there are other characters in there which did not appear in your terminal before. When you put these "invisible" codes into the URL, curl tries to encode them and then fails when it encounters a control character (ESC).
The solution is to add the argument --color=never to your grep call, which will disable colorization.

Grep every word from a file starting a pattern

So I have a file let's call "page.html". Within this file, there's some links/file paths I want to extract. I've been working in BASH trying to get this right but can't seem to do it. The words/links/paths I want to grab all start with "/funny/hello/there/". The goal is for all these words to go to the terminal so I can use them.
This is kinda what I've tried so far, with no luck:
grep -E '^/funny/hello/there/` page.html
and
grep -Po '/funny/hello/there/.*?` page.html
Any help would be greatly appreciated, Thanks.
Here is sample data from the file:
`<td data-title="Blah" class="Blah" >
fdsksldjfah
</td>`
My output gives me all the different line that look like this:
fdsksldjfah
The "/fkljaskdjfl" are all something different though.
What I want the output to look like:
/funny/hello/there/fkljaskdjfl
/funny/hello/there/kfjasdflas
/funny/hello/there/kdfhakjasa
You can use this grep command:
grep -o "/funny/hello/there/[^'\"[:blank:]]*" page.html
However one should avid parsing HTML using shell utilities and use dedicated HTML dom parsers instead.

Can I set command line arguments using the YAML metadata

Pandoc supports a YAML metadata block in markdown documents. This can set the title and author, etc. It can also manipulate the appearance of the PDF output by changing the font size, margin width and the frame sizes given to figures that are included. Lots of details are given here.
I'd like to use the metadata block to remember the command line arguments that I'm supposed to be using, such as --toc and --number-sections. I tried this, adding the following to the top of my markdown:
---
title: My Title
toc: yes
number-sections: yes
---
Then I used the command line:
pandoc -o guide.pdf articheck_guide.md
This did produce a table of contents, but didn't number the sections. I wondered why this was, and if there is a way I can specify this kind of thing from the document so that I don't need to add it on the command line.
YAML metadata are not passed to pandoc as arguments, but as variables. When you call pandoc on your MWE, it does not produce this :
pandoc -o guide.pdf articheck_guide.md --toc --number-sections
as we think it would. rather, it calls :
pandoc -o guide.pdf articheck_guide.md -V toc:yes -V number-sections:yes
Why, then, does you MWE produces a toc? Because the default latex template makes use of a toc variable :
~$ pandoc -D latex | grep toc
$if(toc)$
\setcounter{tocdepth}{$toc-depth$}
So setting toc to any value should produce a table of contents, at least in latex output. In this template, there is no number-sections variables, so this one doesn't work. However, there is a numbersections variable :
~$ pandoc -D latex | grep number
$if(numbersections)$
Setting numbersections to any value will produce numbering in a latex output with the default template
---
title: My Title
toc: yes
numbersections: yes
---
The trouble with this solution is that it only works with some output format. I thought I had read somewhere on the pandoc mailing-list that we soon would be able to use metadata in YAML blocks as intended (ie. as arguments rather than variables), but I can't find it anymore, so maybe it won't happen very soon.
Have a look at panzer (GitHub repository).
This was recently announced and released by Mark Sprevak -- a piece of software, that adds the notion of 'styles' to Pandoc.
It's basically a wrapper around Pandoc. It exploits the concept of YAML metadata blocks to the maximum.
The 'styles' provide a way to set all options for a Pandoc document conversion process with one line ("I want this document be an article/CV/notes/letter.").
You can regard this as more general abstraction than Pandoc templates. Styles are combinations of...
...Pandoc command line options,
...metadata settings,
...templates,
...instructions to run filters, and
...instructions to run pre/postprocessors.
These settings can be customized on a per output type as well as a per document basis. Styles can be...
...combined and
...can bear inheritance relations to each other.
panzer styles simplify Makefiles: they bundle everything concerning the look of a document in one place -- the YAML metadata (a block in the Markdown file, or a separate file).
You just add one line of metadata (style: ...) to your document, and it will be treated as a letter/article/CV/notebook or whatever.

Resources