How can I print unicode characters in LaTeX? - utf-8

I need to print a bunch of unicode characters using LaTeX, and cannot find a solution.
Here is the simplest (not)working example:
\documentclass[10pt]{article}
\begin{document}
Test: $\beta$ βᵝᵦꞵ𝛃𝛽𝜷𝝱𝞫
\end{document}
The output is:
Test: β with XeLaTeX and LuaLaTex
With PdfLaTex I get the standard error:
Package inputenc error: Unicode character (...) not set up for use with LaTex
I am aware of the possibility to re-define all unicode characters to a single stndardized one, that of \beta. However, that is not the solution, as I need to print the characters exactly as displayed above or in any decent text editor.
The file I use is encoded in UTF-8. I am using TexMaker, also set-up for UTF-8.

Related

Generate UTF-8 character list

I have a UTF-8 file which I convert to ISO-8859-1 before sending the file to a consuming system that does not understand the UTF-8. Our current issue is that when we run the iconv process on the UTF-8 file, some characters are getting converted to '?'. Currently, for every failing character, we have been providing a fix.
I am trying to understand if it is possible to create a file which has all possible UTF-8 characters? The intent is to downgrade them using iconv and identify the characters that are getting replaced with '?'
Rather than looking at every possible Unicode character (over 140k of them), I recommend performing an iconv substitution and then seeing where your actual problems are. For example:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="<U+%04X>"
This will convert characters that aren't in ISO-8859-1 to a "<U+####>" syntax. You can then search your output for these.
If your data will be read by something that handles C-style escapes (\u####), you can also use:
iconv -f UTF-8 -t ISO-8859-1 --unicode-subst="\\u%04x"
An exhaustive list of all Unicode characters seems rather impractical for this use case. There are tens of thousands of characters in non-Latin scripts which don't have any obvious near-equivalent in Latin-1.
Instead, probably look for a mapping from Latin characters which are not in Latin-1 to corresponding homographs or near-equivalents.
Some programming languages have existing libraries for this; a common and simple transformation is to attempt to strip any accents from characters which cannot be represented in Latin-1, and use the unaccented variant if this works. (You'll want to keep the accent for any character which can be normalized to Latin-1, though. Maybe also read about Unicode normalization.)
Here's a quick and dirty Python attempt.
from unicodedata import normalize
def latinize(string):
"""
Map string to Latin-1, replacing characters which can be approximated
"""
result = []
for char in string:
try:
byte = normalize("NFKC", char).encode('latin-1')
except UnicodeEncodeError:
byte = normalize("NFKD", char).encode('ascii', 'ignore')
result.append(byte)
return b''.join(result)
def convert(fh):
for line in fh:
print(latinize(line), end='')
def main():
import sys
if len(sys.argv) > 1:
for filename in sys.argv[1:]:
with open(filename, 'r') as fh:
convert(fh)
else:
convert(sys.stdin)
if __name__ == '__main__':
main()
Demo: https://ideone.com/sOEBW9

Pandoc - is tex output with dollar sign math possible?

Is it possible to have .tex files output from pandoc have math mode with dollar signs ($)? The manual says:
LaTeX: It will appear verbatim surrounded by \(...\) (for inline math) or \[...\] (for display math).
I also found this Github issue from 2016 where the author says it could be selectable. Is there now a pandoc argument or another way of having the .tex output use dollar signs?
You can do this using a pandoc filter. E.g.:
-- Do nothing unless we are targeting TeX.
if not FORMAT:match('tex$') then return {} end
function Math (m)
local delimiter = m.mathtype == 'InlineMath' and '$' or '$$'
return pandoc.RawInline('tex', delimiter .. m.text .. delimiter)
end
Save to file dollar-math.lua, and pass it to pandoc via --lua-filter=dollar-math.lua.

Is Bash support Unicode 6.0?

When I use unicode 6.0 character(for example, 'beer mug') in Bash(4.3.11), it doesn't display correctly.
Just copy and paste character is okay, but if you use utf-16 hex code like
$ echo -e '\ud83c\udf7a',
output is '??????'.
What's the problem?
You can't use UTF-16 with bash and a unix(-like) terminal. Bash strings are strings of bytes, and the terminal will (if you have it configured correctly) be expecting UTF-8 sequences. In UTF-8, surrogate pairs are illegal. So if you want to show your beer mug, you need to provide the UTF-8 sequence.
Note that echo -e interprets unicode escapes in the forms \uXXXX and \UXXXXXXXX, producing the corresponding UTF-8 sequence. So you can get your beer mug (assuming your terminal font includes it) with:
echo -e '\U0001f37a'

Can a makefile contain UTF8 chracters?

I'm currently trying to figure out if anything can be done about dmake resulting in this error message on a makefile with a simple filename containing utf8 characters:
Name contains non-printable character [0xffffffe0]
In my research i've been unable to find any mention of whether GNU make or dmake are even supposed to be able to handle makefiles with UTF8 characters in them.
Thus my question is: Can a makefile can contain UTF8 characters and if that answer is known, where is that documented?
To answer myself:
GNU make can deal with UTF-8 just fine.
dmake, being a mostly abandoned reimplementation of make, can only deal with ASCII.
Make on windows does not work with UTF-8. You will get the "missing separator" even with a blank file. Use notepad.exe to convert the makefile to ANSI. NOTE: there is a little dropdown list box next to the save button.

No norwegian characters in LaTeX

I have translated a document from English to Norwegian in the LaTeX format, and while using norwegian special characters, I get an error using
\usepackage[utf8x]{inputenc}
to try and display the norwegian (scandinavian) special characters in PostScript/PDF/DVI format, saying
Package utf8x Error: MalformedUTF-8sequence.
So while that didn't work, I tried out another possible solution:
\usepackage{ucs}
\usepackage[norsk]babel
And when I tried to save that in Emacs I get this message:
These default coding systems were tried to encode text
in the buffer `lol.tex':
(utf-8-unix (905 . 4194277) (916 . 4194245) (945 . 4194278) (950
. 4194277) (954 . 4194296) (990 . 4194277) (1010 . 4194277) (1013
. 4194278) (1051 . 4194277) (1078 . 4194296) (1105 . 4194296))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these: \345 \305 \346 \345 \370 \345 \345 \346 \345 \370 ...
Thanks to Emacs I have the possibility to check out the properties of those characters and the first one tells me:
character: \345 (4194277, #o17777745, #x3fffe5)
preferred charset: eight-bit (Raw bytes 128-255)
code point: 0xE5
syntax: w which means: word
buffer code: #xE5
file code: not encodable by coding system utf-8-unix
display: not encodable for terminal
Which doesn't tell me much. When I try to build this with texi2dvi --dvipdf filename.text I get a perfectly fine PDF, all without the special norwegian characters.
When I am about to save Emacs also ask me:
"Select coding system (default raw-text):"
And I type in utf-8 to choose its coding system. I have also tried to choose default raw-text to see if I get some different result. But nothing.
At last I tried
\lstset{inputencoding=utf8x, extendedchars=\true}
... a code I came over while trying to google the solution to this problem. Which gives me this error:
Undefined control sequence.
So basically, I have tried every encoding option I have been able to find and nothing works. I am desperately trying to make this work since the norwegian translation must be published before the deadline.
As an additional information I may add that I found out later on that I only had the en_US.UTF-8 in my locale, so I added nb_NO.UTF-8 and nb_NO.ISO-8859-15 and ran locale-gen + reboot without any changes.
I hope I provided enough information to get some assistance, the characters in question is æ ø å.
Apparently your emacs is having a hard time saving the file as UTF-8 (which doesn't make much sense since it should be able to represent all characters using that encoding). You should try using another editor with multiple encoding support to save the file as UTF-8.
While you're unable to save the file in UTF-8, LaTeX will not be able to correctly read it, unless you specify your current file encoding as inputenc package parameter. You may want to try to, for instance, save the file as-is in emacs but specifying \usepackage[latin1]{inputenc} which should do the trick if emacs is writing the file using something in the *iso-8859-** family.
I solved this error by setting the coding system for saving file:
C-x C-m f utf-8-unix

Resources