I have to define a string which consists persian text. But by running javac on my program i see "unmappable character for encoding cp1252 " by trying javac -encoding ISO-8859-1 from here.
Now My program does not see any error but only numbers in text are shown and Persian characters disappear.For example my string is "من 2 کتاب و 3 کامپیوتر دارم" which means I have two books and three computers. The shown text is just : 2 3.
By the wat when I use netbeans I have no errors and all text is shown but javac make those problems.
Related
I have a file that I'm pretty sure is in a weird encoding. I've successfully converted similar files to utf-8 previously by assuming they were encoded in windows-1255 using iconv (iconv -f windows-1255 -t utf-8 $file) and this has worked successfully.
My current file contains a ß character that is throwing me off - iconv breaks when it hits this (with an "illegal input sequence" error). Is there a different kind of encoding I should be using?
WINDOWS-1255 (= Hebrew) does not know an Eszett (ß), so ICONV behaves correctly. Other legacy codepages that know that character on code point 00DF:
WINDOWS-1250 = Latin 2 / Central European
WINDOWS-1252 = Latin 1 / Western European
WINDOWS-1254 = Turkish
WINDOWS-1257 = Baltic
WINDOWS-1258 = Vietnamese
Only the document owner knows which codepage is the correct one. If it's one of the WINDOWS-125x at all.
I need to print a bunch of unicode characters using LaTeX, and cannot find a solution.
Here is the simplest (not)working example:
\documentclass[10pt]{article}
\begin{document}
Test: $\beta$ βᵝᵦꞵ𝛃𝛽𝜷𝝱𝞫
\end{document}
The output is:
Test: β with XeLaTeX and LuaLaTex
With PdfLaTex I get the standard error:
Package inputenc error: Unicode character (...) not set up for use with LaTex
I am aware of the possibility to re-define all unicode characters to a single stndardized one, that of \beta. However, that is not the solution, as I need to print the characters exactly as displayed above or in any decent text editor.
The file I use is encoded in UTF-8. I am using TexMaker, also set-up for UTF-8.
I am running Windows 7 and (have to) use Turbo Grep (Borland something) to search in a file.
I have 2 version of this file, one encoded in UTF-8 and one in ANSI.
If I run the following grep on the ANSI file, I get the expected results, but I get no results with the same statement on the UTF-8 file:
grep -ni "[äöü]" myfile.txt
[-n for line numbers, -i for ignoring cases]
The Turbo Grep Version is :
Turbo GREP 5.6 Copyright (c) 1992-2010 Embarcadero Technologies, Inc.
Syntax: GREP [-rlcnvidzewoqhu] searchstring file[s] or #filelist
GREP ? for help
Help for this command lists:
Options are one or more option characters preceded by "-", and optionally
followed by "+" (turn option on), or "-" (turn it off). The default is "+".
-r+ Regular expression search -l- File names only
-c- match Count only -n- Line numbers
-v- Non-matching lines only -i- Ignore case
-d- Search subdirectories -z- Verbose
-e Next argument is searchstring -w- Word search
-o- UNIX output format Default set: [0-9A-Z_]
-q- Quiet: supress normal output
-h- Supress display of filename
-u xxx Create a copy of grep named 'xxx' with current options set as default
A regular expression is one or more occurrences of: One or more characters
optionally enclosed in quotes. The following symbols are treated specially:
^ start of line $ end of line
. any character \ quote next character
* match zero or more + match one or more
[aeiou0-9] match a, e, i, o, u, and 0 thru 9 ;
[^aeiou0-9] match anything but a, e, i, o, u, and 0 thru 9
Is there a problem with the encoding of these charactes in UTF-8? Might there be a problem with Turbo Grep and UTF-8?
Thanks in advance
Yes there are a different w7 use UTF-16 little endian not UTF-8, UTF-8 is used in unix, linux and plan 9 for cite a few OS.
Jon Skeet explain:1
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default code page for my system" which is obtained via Encoding.Default, and is often Windows-1252
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-16 is more similar to ANSI so for this reason with ANSI work well.
if you use only ascii both encodings are usable, but with special characters as ä ö ü etc you need use UTF-16 in windows and UTF-8 in the others
I have translated a document from English to Norwegian in the LaTeX format, and while using norwegian special characters, I get an error using
\usepackage[utf8x]{inputenc}
to try and display the norwegian (scandinavian) special characters in PostScript/PDF/DVI format, saying
Package utf8x Error: MalformedUTF-8sequence.
So while that didn't work, I tried out another possible solution:
\usepackage{ucs}
\usepackage[norsk]babel
And when I tried to save that in Emacs I get this message:
These default coding systems were tried to encode text
in the buffer `lol.tex':
(utf-8-unix (905 . 4194277) (916 . 4194245) (945 . 4194278) (950
. 4194277) (954 . 4194296) (990 . 4194277) (1010 . 4194277) (1013
. 4194278) (1051 . 4194277) (1078 . 4194296) (1105 . 4194296))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these: \345 \305 \346 \345 \370 \345 \345 \346 \345 \370 ...
Thanks to Emacs I have the possibility to check out the properties of those characters and the first one tells me:
character: \345 (4194277, #o17777745, #x3fffe5)
preferred charset: eight-bit (Raw bytes 128-255)
code point: 0xE5
syntax: w which means: word
buffer code: #xE5
file code: not encodable by coding system utf-8-unix
display: not encodable for terminal
Which doesn't tell me much. When I try to build this with texi2dvi --dvipdf filename.text I get a perfectly fine PDF, all without the special norwegian characters.
When I am about to save Emacs also ask me:
"Select coding system (default raw-text):"
And I type in utf-8 to choose its coding system. I have also tried to choose default raw-text to see if I get some different result. But nothing.
At last I tried
\lstset{inputencoding=utf8x, extendedchars=\true}
... a code I came over while trying to google the solution to this problem. Which gives me this error:
Undefined control sequence.
So basically, I have tried every encoding option I have been able to find and nothing works. I am desperately trying to make this work since the norwegian translation must be published before the deadline.
As an additional information I may add that I found out later on that I only had the en_US.UTF-8 in my locale, so I added nb_NO.UTF-8 and nb_NO.ISO-8859-15 and ran locale-gen + reboot without any changes.
I hope I provided enough information to get some assistance, the characters in question is æ ø å.
Apparently your emacs is having a hard time saving the file as UTF-8 (which doesn't make much sense since it should be able to represent all characters using that encoding). You should try using another editor with multiple encoding support to save the file as UTF-8.
While you're unable to save the file in UTF-8, LaTeX will not be able to correctly read it, unless you specify your current file encoding as inputenc package parameter. You may want to try to, for instance, save the file as-is in emacs but specifying \usepackage[latin1]{inputenc} which should do the trick if emacs is writing the file using something in the *iso-8859-** family.
I solved this error by setting the coding system for saving file:
C-x C-m f utf-8-unix
I'm creating a grammar using JavaCC and have run across a small problem. I'm trying to allow for any valid character within the ASCII extended set to be recognized by the resulting compiler. After looking at the same JavaCC examples (primarily the example showing the JavaCC Grammer itself) I set up the following token to recognize my characters:
< CHARACTER:
( (~["'"," ","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["0"-"7"] ( ["0"-"7"] )?
| ["0"-"3"] ["0"-"7"] ["0"-"7"]
)
)
)
>
If I'm understanding this correctly it should be matching on the octal representation of all of the ASCII characters, from 0-377 (which covers all 256 characters in the Extended ASCII Set). This performs as expected for all keyboard characters (a-z, 0-9, ?,./ etc) and even for most special characters (© , ¬ ®).
However, whenever I attempt to parse the 'trademark' symbol (™) my parser continually throws an End of File exception, indicating that it is unable to recognize the symbol. Is there some obvious way that I can enhance my definition of a character to allow the trademark symbol to be accepted?
I had similar a issue for recognizing special symbols of a text file (either CP1252 or ISO-8859-1 encoded) which was read to a String before parsing. My solution was adding the UNICODE_INPUT to the grammar header:
options {
UNICODE_INPUT=true;
}
Worked like a breeze.
More information on JavaCC options: http://javacc.java.net/doc/javaccgrm.html
It turns out that what I wanted my grammar to do was to accept all valid Unicode characters and not ASCII characters, the ™ symbol is part of the Unicode specification and not in an ASCII extended character set. Changing my token for a valid character as outlined below solved my problem: (A valid unicode being of the format- U+00FF)
< CHARACTER:( (~["'"," ","\\","\n","\r"])
| ("\\"
( ["n","t","b","r","f","\\","'","\""]
| ["u","U"]["+"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]["0"-"9","a"-"f","A"-"F"]
)
) )>