Groff: how to successfully compile a manpage, containing Cyrillic in UTF-8? - utf-8

Colleagues, please tell me what arguments you need to set for groff so that it compiles the manpage writed in Cyrillic, in UTF-8 encoding.
I have a number of pages written in markdown and converted to nroff by the lowdown utility.
Groff version 1.19.2 for FreeBSD 11.4 amd64.
If I run with the options like 'groff -man -Tutf8', then he is very unhappy:
standard input:3: warning: can't find character with input code 157
standard input:3: warning: can't find character with input code 144
standard input:3: warning: can't find character with input code 151
standard input:3: warning: can't find character with input code 146
standard input:3: warning: can't find character with input code 152
standard input:3: warning: can't find character with input code 149
standard input:4: warning: can't find character with input code 143
How do I get groff to work with such pages?

If you use specific characters, you must use a font that contains them. Depending a bit on your system, that may be dejavu for example.
groff -man -Tutf8 -f dejavu

Related

pandoc convert to pdf fails when € is in a header

Windows 10 \euro in LaTeX
I have MiKTeX 2.9.6300 64-bit, and I can xelatex eurosym where eurosym.tex contains the following LaTeX code:
\documentclass{article}
\usepackage{eurosym}
\begin{document}
\section{2\euro{}} % (this line fails if eurosym package isn't called)
2€ % (this € symbol is lost if compiled with pdflatex instead of xelatex)
\end{document}
- output is as expected:
Pandoc on Win10 eurosym call fails
My local copy of pandoc-templates/default.latex correctly contains the call:
$if(euro)$
\usepackage{eurosym}
$endif$
And I have pandoc v1.19.2.1 installed, so I would expect pandoc eurosym.md -o eurosym.pdf --latex-engine=xelatex to work when eurosym.md contains just:
# 2€
2€
- but in fact I get:
! Improper alphabetic constant.
<to be read again>
\euro
l.70 \section{2\euro{}}
pandoc.exe: Error producing PDF
- why is the command \euro{} failing here in Pandoc?
same with Pandoc on Arch Linux
pandoc 1.19.2.1-90, calling texlive-bin 2016.41290-12, and using jgm's default.latex I get:
! Improper alphabetic constant.
<to be read again>
\euro
l.35 \section{2\euro{}}
pandoc: Error producing PDF
But I can get it to work as expected by commenting out line 76, \usepackage[unicode=true]{hyperref} (and then lines 80-98 and line 253, both \hypersetup{...} clauses, and line 99, \urlstyle{same} which require hyperref) of jgm's default.latex.
Discovering that there is a more current default.latex, I tried with that one, and Pandoc failed on \euro again.
Conclusion appear to be that somehow eurosym & hyperref are mutually incompatible in Pandoc
So I've created a New Issue: eurosym in Pandoc seems to be incompatible with hyperref #3801.
You must use \usepackage{eurosym} only with pdflatex. XeLaTeX is unicode-aware and you can just use the euro symbol directly.
The default pandoc template therefore wraps is in a an "if pdftex", which you might have missed when modifying your template:
\ifnum 0\ifxetex 1\fi\ifluatex 1\fi=0 % if pdftex
...
$if(euro)$
\usepackage{eurosym}
$endif$
...
\fi

Mac issue with file encoding

I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces
bash$ file -I filename
filename: text/plain; charset=unknown-8bit
Why is it telling me that the encoding is unknown? The first line of the file displays for me as
“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody
A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...
What's wrong? Why does file tell me the encoding is unknown, and what is it actually?
Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.
Maybe your version of file is really old.
You can use iconv to convert the file into the desired charset. E.G.
iconv --from-code=UTF8 --to-code=YOURTARGET
To get a list of supported targets, use the --list flag.

How to grep for exact hexadecimal value of characters

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned.
I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.
Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?
Sample Input:
STRING_OPEN
Open
æ–­å¼€
Ouvert
Abierto
Открыто
Abrir
Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:
断开
Открыто
I can't give out more sample input unfortunately as it's work related.
EDIT: Actually the below code snippet worked!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–­å¼€ and in the text file, it's show as 断开.
Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.
If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

Confusion with multibytecharacter set and `putc`

I was trying to understand how the ARGF#putc works. I was actually doing test with multibyte character sets
Here is the sample:
$stdout.putc 63 #<~~~ A
#?=> 63
$stdout.putc 191
#?=> 191
$stdout.putc 181
#?=> 181
$stdout.putc 166
#?=> 166
Now my question is,except line A - why does every statement printing ??
My Ruby version is:
D:\Rubyscript\My ruby learning days>ruby -v
ruby 2.0.0p0 (2013-02-24) [i386-mingw32]
It depends on the default encoding (or code page on windows) for your console. You can run chcp in cmd.exe to check.
ASCII characters contains characters or control characters from \x00 to \x7F. The multibyte character sets uses ISO-8859-1 encoding which contains characters in ASCII and \x80-\xFF. Well, inferred from your post, the default code page of your console isn't compatible with ISO-8859-1, so the console don't know how to represent those characters from \x80-\xFF.
You need to do some encoding conversion before printing it to your console.
putc 191.chr.force_encoding('ISO-8859-1').encode('UTF-8')
# UTF-8 is the default encoding used in my Linux environment
# you need to replace it with your console's default encoding

Reading ASCII-encoded files with Ruby 1.9 in a UTF-8 environment

I just upgraded from Ruby 1.8 to 1.9, and most of my text processing scripts now fail with the error invalid byte sequence in UTF-8. I need to either strip out the invalid characters or specify that Ruby should use ASCII encoding instead (or whatever encoding the C stdio functions write, which is how the files were produced) -- how would I go about doing either of those things?
Preferably the latter, because (as near as I can tell) there's nothing wrong with the files on disk -- if there are weird, invalid characters they don't appear in my editor...
What's your locale set to in the shell? In Linux-based systems you can check this by running the locale command and change it by e.g.
$ export LANG=en_US
My guess is that you are using locale settings which have UTF-8 encoding and this is causing Ruby to assume that the text files were created according to utf-8 encoding rules. You can see this by trying
$ LANG=en_GB ruby -e 'warn "foo".encoding.name'
US-ASCII
$ LANG=en_GB.UTF-8 ruby -e 'warn "foo".encoding.name'
UTF-8
For a more general treatment of how string encoding has changed in Ruby 1.9 I thoroughly recommend
http://blog.grayproductions.net/articles/ruby_19s_string
(code examples assume bash or similar shell - C-shell derivatives are different)

Resources