Organizing references by year in Pandoc when generating HTML - pandoc

I am relatively new with Pandoc and I am trying to generate an HTML file with my publications to put up on my website. I'd like to have the publication list numbered and organized by year first, with the most recent first and the oldest last.
I can get the numbering fine with the proper csl file, but I can't get the year sorting. The problem is that I'm not first author in all my publications, so what ends up happening is that they are organized alphabetically first and then by date, which is not what I want.
I can get the result I want when generating a PDF by using biblatex with the option sorting=ydnt (Year (Descending), Name, Title), but since Pandoc doesn't use biblatex to generate a list of references to HTML, I can't use this tactic here.
The only way I can see how to possibly solving this is to get a citation style in the Zotero style repo that does what I want, but I haven't been able to find one. So I'm trying to modify one to do it, but without success.
This answer teaches a way to change the sorting style, so I'm trying to manually change the sorting style of the Proceedings of the Royal Society B style. Specifically I'm changing
<sort>
<key variable="citation-number"/>
</sort>
to
<sort>
<key macro="issued" sort="descending"/>
<key macro="author"/>
</sort>
But that doesn't work (probably because that only changes the sorting of the text citations, not the reference list). I've tried a couple of other things, but I can't find something that works!
This doesn't matter much I guess, but I'm using Pandoc 2.7.3, citeproc version 0.16.2 and the file that I'm running on is:
---
bibliography: selectedpubs.bib
nocite: '#*'
linestretch: 1.5
fontsize: 12pt
output:
html:
output: pubpage.html
filter: pandoc-citeproc
csl: prsb2.csl
...
The file prsb2.csl is just the Proceedings of the Royal Society B csl.

You have the right idea, but misunderstood the linked thread. Instead of changing the sort keys for the citation, you'll want to add sorting to the bibliography, i.e.
<bibliography second-field-align="flush" et-al-min="11" et-al-use-first="10">
<sort>
<key macro="issued" sort="descending"/>
<key macro="author"/>
</sort>
<layout>
Instead of modifying a style, you could also use the APA-CV style that already exsits on the repository

Related

Inconsistent line spacing in RestructuredText document

I'm build RST files for my company's documentation. One irritating thing is that enumerated lists don't seem to have any consistency in terms of line spacing.
Is there a simple way to solve this?
Robert
It's a well known problem of docutils, the library on which Sphinx is built.
From Sphinx issue tracker on GitHub:
tk0miya wrote:
In my short investigation:
The behavior comes from docutils (base library of Sphinx).
In docutils.writers.html4css1.HTMLTranslator, docutils generates <p> tag if list includes any items excepting paragraphs and nested lists.
To fix this, set self.compact_simple in visit_list_item instead of visit_bullet_list and visit_enumerated_list.
But we have to know why docutils check whole of list.
Source: Spinx-Doc/Sphinx #2258 - Nested field lists inside list items cause unwanted space in HTML output
See related issues:
https://github.com/rtfd/sphinx_rtd_theme/issues/119
I'm unsure how to apply Paebbels answer, however I was able to get rid of the <p> tags by changing to the html4 writer by adding this line to my conf.py.
html4_writer = true
This obviously changes it to the html4 writer, so you'll need to determine whether this is acceptable or not.

Microdata for dictionary : can I use yandex

I'm willing to use microdata/microformat/etc. for the part of my website which is an online dictionary. Basically I just want to tag word and definition to help search engines to grab the most important data in every page belonging to the dictionary, and maybe have Google use them as "rich snippets" in results page.
Main problem is it's hard to find dedicated vocabulary for words and definitions (no problem for recipes, movies and hotels though) and I'm not sure if I have to use the "http://schema.org/Article" tree for my lexicographic work. (To my mind, it makes sense to tag something when it's specific enough).
I have found something interesting at Yandex, for words and encyclopedia, I want to ask what to do with. See there :
https://yandex.ru/support/webmaster/microdata/what-is-microdata.xml?lang=en
https://yandex.com/support/webmaster/microdata/term-definition-markup.xml
It looks like it is very close to my request. But I'm sorry I dont know what is Yandex... will it work with Google ?
I'm asking here if that page, from Yandex, is a working model, is still on use, what are the pros and cons ? Will Google be able to use the specific vocabulary from Yandex and understand my Yandex-tagged data ? is it worth using that vocabulary for an online dictionary, or is something else I have missed of better use ?
(http://webmaster.yandex.ru/vocabularies/term-def.xml, which should be the vocabulary url, gives me a 404).
One more question, please : am I allowed to write (duplicate) the most important data in the header, something like (I believe I am, because Google microdata testing tool prooves to be able to extract the data from that code) :
<html itemscope itemtype="http://webmaster.yandex.ru/vocabularies/term-def.xml">
<meta itemprop="term" content="My term" />
<meta itemprop="definition" content="My definition" />
Just to mention I was interested, though not happy with these close discussions :
https://webmasters.stackexchange.com/questions/55073/what-meta-tag-or-structured-data-should-i-use-for-a-dictionary-web-application
schema.org and an online dictionary
Yandex is Russia's version of Google, and typically they both recognize and honor each other's search engine result implementations.
These articles you are referencing are incredibly outdated; I recommend you seeking out fresher sources, preferably where the term being defined uses the proper HTML element.
Here's the Yandex URL that is 404ing, the Wayback Machine is your friend!
Back to fresher documentation/resources, in this case the correct element as of 2016-10-05 is the <dfn> element. I know you want added semantics, but semantics is the proper place to start, and I'd follow that up by marking the entire dictionary up within a Definition List element, and placing the definition wrapped in the definition element into the <dt>, and the definition's of the term in the corresponding <dd>s.
I wouldn't waste time trying to find the perfect ontology here; implement [rel="tag" Microformat on all of the definitions], you can always come back and add a more desired one.
I've written a blog post about this, but a much more valuable resource is HTML5 Doctor's Glossary impementation, More importantly, view source - view-source:http://html5doctor.com/element-index/ (why stackoverflow doesn't recognize 'view-source' schema is beyond me)
More References/Resources:
Microformats Definition Examples has some very interesting ideas/code snippets
Utilizing the Underused by Semantically Awesome Definition List - Written Prior to HTML5's Redefinition of <dl> but Relevant

Can sorting Japanese kanji words be done programmatically?

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.
I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.
Let me outline what I have found, as preamble to my question:
The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.
So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).
I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:
Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.
宜しくお願いします。
For Data, dig Google's Japanese IME (Mozc) data files here.
https://github.com/google/mozc/tree/master/src/data
There is lots of interesting data there, including IPA dictionaries.
Edit:
And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words
https://taku910.github.io/mecab/
and there is ruby bindings for that too.
https://taku910.github.io/mecab/bindings.html
and here is somebody tested, ruby with mecab with tagger -Oyomi
http://hirai2.blog129.fc2.com/blog-entry-4.html
just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.
We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.
So, what we did is:
We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
Uses mecab to translate the contents of that file into Japanese phonetic readings
(the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.
From time to time the client updates the source data, so we will need to do this whenever that happens.
As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.
UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.
Many thanks to all who helped me!
Nice to hear people are working with Japanese.
I think you're spot on with your assessment of the problem difficulty. I just asked one of the Japanese guys in my lab, and the way to do it seems to be as you describe:
Take a list of Kanji
Infer (guess) the yomigana
Sort yomigana by gojuon.
The hard part is obviously step two. I have two guys in my lab: 高橋 and 高谷. Naturally, when sorting reports etc. by name they appear nowhere near each other.
EDIT
If you're fluent in Japanese, have a look here: http://mecab.sourceforge.net/
It's a pretty popular tool, so you should be able to find English documentation too (the man page for mecab has English info).
I'm not familiar with MeCab, but I think using MeCab is good idea.
Then, I'll introduce another method.
If your app is written in Microsoft VBA, you can call "GetPhonetic" function. It's easy to use.
see : http://msdn.microsoft.com/en-us/library/aa195745(v=office.11).aspx
Sorting prefectures by its pronunciation is not common. Most Japanese are used to prefectures sorted by 「都道府県コード」.
e.g. 01:北海道, 02:青森県, …, 13:東京都, …, 27:大阪府, …, 47:沖縄県
These codes are defined in "JIS X 0401" or "ISO-3166-2 JP".
see (Wikipedia Japanese) :
http://ja.wikipedia.org/wiki/%E5%85%A8%E5%9B%BD%E5%9C%B0%E6%96%B9%E5%85%AC%E5%85%B1%E5%9B%A3%E4%BD%93%E3%82%B3%E3%83%BC%E3%83%89

XSL-FO fop. Long text flows into adjacent cells/block, obscuring stuff there

Could anyone suggest me a way to make long words (like serial numbers) to be wrapped? I tried some commercial software and there is no such issue. Is it a fop bug or probably there is a solution available?
I can't insert zero length space after each character of every word in document. This solution sounds insane for me.
You can specify the wrap-option attribute in your fo:block like so:
<fo:block wrap-option="wrap"> ... stuff </fo:block>
Here's the XSL-FO specification for this attribute:
XSL Definition:
Value: no-wrap | wrap | inherit
Initial: wrap
Applies to: fo:block, fo:inline, fo:page-number,
fo:page-number-citation
Inherited: yes
Percentages: N/A
Media: visual
Values have the following meanings:
no-wrap
No line-wrapping will be performed.
In the case when lines are longer than
the available width of the
content-rectangle, the overflow will
be treated in accordance with the
"overflow" property specified on the
reference-area.
wrap
Line-breaking will occur if the
line overflows the available block
width. No special markers or other
treatment will occur.
Specifies how line-wrapping
(line-breaking) of the content of the
formatting object is to be handled.
Implementations must support the
"no-wrap" value, as defined in this
Recommendation, when the value of
"linefeed-treatment" is "preserve".
You can also define the wrap-option attribute in an fo:table-cell
<fo:table-cell wrap-option="wrap"> ... </fo:table-cell>
and the fo:blocks within will inherit the property.
Zkoh's answer (wraping) will help you only if the text contains multiple words split by white spaces. In case of long words (as mensioned in question), hyphenation is way to go (as Daniel suggested).
There can be quite a few problems with hyphenation in FOP:
FOP is using hyphenation algorithms from TeX and because of some licencing issues, those algorithms (at least for some languages) are not part of standard FOP binary distribution (as stated here) and must be downloaded separately from OFFO web site. There are two kinds of hyphenation pattern files on the website. XML format (which needs to be compiled 1st to be used with FOP) and JAR file (already compiled). Be sure to download compiled version! Installation is straightforward and well documented - just drop the OFFO binary into FOP's lib folder and thats it...
Don't forget to specify language of your document and if needed, enable hyphenation on block level (its inherited so add it to the root element and you should be fine) - see FOP FAQ
Would hyphenation solve your problem? You should be able to enable hyphenation with a hyphenate="true" attribute. Placement of this attribute will depend on where you want to enable hyphenation.
Here's a link to FOP's hyphenation compliance: Apache FOP Compliance Page
Here's a link to the XSL spec: XSL Spec #hyphenate
If not, you may need to experiment with some keeps properties (like keep-together.within-line).
Use keep-together.within-column="always" instead of keep-together="always" of to keep long lines with in table cell.
The question is about serial numbers, not about dictionary words. Specifying hyphenate="true" is useful only when the hyphenation dictionary or hyphenation algorithm can successfully hyphenate the words in the text. Serial numbers would rarely generate sequences that can usefully be hyphenated as if they are words.
You can, of course, use XSLT to add zero-width spaces in text in table cells rather than doing it manually. StackOverflow likes duplicate questions (see https://stackoverflow.blog/2010/11/16/dr-strangedupe-or-how-i-learned-to-stop-worrying-and-love-duplication/), but, all the same, please see the answers in XSL-FO: Force Wrap on Table Entries.
For text overflow problem use keep-together="auto" attribute.
Text Overflow Issue
Fixed version after using keep-together="auto" attribute.

How to efficiently work with gettext PO files when making small edits to large text values

Looking for tips and/or tools on how to efficiently work with gettext PO files when making small edits to large msgid values.
Example: We have lots of multi-sentence/multi-paragraph messages that are stored in our PO message catalog files. If we make a very minor change to a message, perhaps editing a single sentence or even correcting punctuation, we lose our original translation when we run the msgmerge utility.
Rather than re-translate long messages (that have already gone through an editorial approval process) from scratch, our translators return to backup copies of their PO files and manually search for the text of the last msgid/msgstr translation pair which they then diff against the current msgid values to see what has changed, followed by a copy and paste of the last translation which they then edit to reflect the updated msgid value.
That's a lot of work! Certainly there must be a better way of handling this type of workflow?
Is there a best practice way to archive and find previous translations that are no longer in a PO file? One idea that comes to mind is to store a unique msg id in the text of our messages or in the comments that precede our message and use this id to retrieve previous msgid/msgstr translation pairs for review. Or are there PO editors or online services that make this process more efficient?
Thank you,
Malcolm
I've been looking for a way to make minor changes to msgids without disturbing existing translations - for instance, typo fixes in the source text. Here's a recipe I've just worked out that doesn't involve websites:
Use msgen from GNU gettext to generate an English-to-English po file:
msgen project.pot >corrections.po
Manually edit the msgstrs in "corrections.po" to reflect the typo fixes made in the source text, so we have a mapping from uncorrected to corrected strings. (I haven't thought about how to automate this bit.)
For each "real" translation (for example ca.po): abuse poswap from the Translate Toolkit (translate-toolkit in Ubuntu) to change the msgids:
poswap -i corrections.po -t ca.po -o ca.new.po
This does seem to lose header comments and obsolete strings from GNU gettext po files, but manually fixing those up is much less work than manually tweaking msgids in each translation (and could probably easily be scripted).
(Obviously, this should only be used in exceptional circumstances, where you're absolutely sure that none of the translators need the opportunity to re-review their translations.)
Virtaal's translation memory support can probably help with this. If your original units are in the translation memory, it will be shown (with differences) within a certain margin of change (based on Levenshtein distance). It will still contain the original (unmodified) translation, but at least the original text is more easily accessible and the differences highlighted.
I'm not 100% sure, but Pootle might also offer a web based solution. If you need any help, ask in #pootle on FreeNode.
The more general improvement is, of course, to separate/segment the units as far as possible.

Resources