LineNumberReader to support UTF-8 encoding - utf-8

When I am trying to read text from the file, the underlying text is not appearing correctly using LineNumberReader
The text trying to read from file -
¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹
Sample Code--
FileInputStream fis = null;
try {
fis = new FileInputStream("C:\\Users\\JavaUser4\\Desktop\\checkImort.txt");
InputStreamReader streamReader = new InputStreamReader(fis, "UTF-8");
LineNumberReader reader = new LineNumberReader(streamReader);
String sLine = reader.readLine();
System.out.println(sLine);
} catch (Exception ex) {
} finally {
try {
fis.close();
} catch (IOException ex) {
}
}
Output -
? ? ? ? ? ? $ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Please help to read this text correctly using LineNumberReader. I prefer to stay on LineNumberReader because I was using RandomAccessFile that was a perfect solution according to my requirement
Open a file containing UTF-8 encoded text.
Set Line Number from which we need to start reading a file.
Read 25 lines from text file.
Get last position of offset.
Exit.
Again open a file.
Set Line Number from where we need to start reading next 25 lines from the same file.
Read 25 lines from text file.
Get last offset.
And so on.
Drawback was RandomAccessFile was not supporting UTF-8 encoding and I moved to LineNumberReader but same is happening here. Please help.

You're doing the reading correctly (assuming the file is actually in UTF-8 encoding).
The problem is with the output.
The output stream you're writing to is probably configured to output as ISO-8859-1 or one of its variants (I guess you're running this on Windows, as this is a common problem on Windows).
Note that the output "?" is often caused by the fact that some character can't be represented in a given encoding. So your String contains the correct characters (you should be able to check that in a debugger), but the output stream can't write that.

Related

when using kendo.drawing.exportPDF , text pasted from Word shows wrong

All punctuations in the HTML that was brought over by pasting from MS word, show as a little square instead of " or ' The characters show as "FS" and "GS" in notepad++ and in plain HTML.
I tried to use the "DejaVu Sans" font but it did not help at all.
Any advice?
Ended replacing all offending characters.
String.prototype.replaceAll = function(str1, str2, ignore)
{
return this.replace(new RegExp(str1.replace(/([\,\!\\\^\$\{\}\[\]\(\)\.\*\+\?\|\<\>\-\&])/g, function(c){return "\\" + c;}), "g"+(ignore?"i":"")), str2);
};
Then for each cell of text:
function cleanHTML(input) {
var output = input.replaceAll('“','"');
output = output.replaceAll('”', '"');
output = output.replaceAll("’", "'");
output = output.replaceAll("‘", "'");
output = output.replaceAll("", "'");
output = output.replaceAll("", "-");
output = output.replaceAll("", "'");
return output;
}
Whatever other charter will show as square later, I will add a replace for it.
Hope someone will benefit from that.

can't write IP to text file without formatting issues

I'm having trouble reading an IP from a text file and properly writing it to another text file. It shows the written IP in the file as: "ÿþ1 9 2 . 1 6 8 . 1 1 0 . 4"
#Read the first line for the IP
def get_server_ip
File.open("d:\\ip_addr.txt") do |line|
a = line.readline()
b = a.to_s
end
end
#append the ip to file2
def append_ip
FileUtils.cp('file1.txt', 'file2.txt')
file_names = ['file2.txt']
file_names.each do |file_name|
text = File.read(file_name)
b = get_server_ip
new_contents = text.gsub('ip_here', b)
File.open(file_name, "w") {|file| file.puts new_contents }
end
end
I've tried .strip and .delete(' ') with no luck. Can anyone see the issue?
Thank you
The file was generated with Notepad on Windows. It is encoded as UTF-16LE.
The first two bytes in the file have the codes 0xFF and 0xFE; this is the Bytes Order Mark of UTF-16LE.
Each character is encoded on 2 bytes (16 bits), the least significant byte first (Less Endian order).
The spaces between the printable characters in the output are, in fact NUL characters (characters with code 0).
What you can do (apart from converting the file to a more decent format like UTF-8 or even ISO-8859-1) is to pass 'rb:BOM|UTF-16LE' as the second argument of File#open.
r tells File#open to open the file in read-only mode (which is also does by default);
b means "binary mode"; it is required by BOM|UTF-16;
:BOM|UTF-16LE tells Ruby to read and ignore the BOM if it is present in the file and to expect the rest of the file being encoded as UTF16-LE.
If you can, I recommend you to convert the file encoding using a decent editor (even Notepad can be used) to UTF-8 or ISO-8859-1 and all these problems vanish.

Visual Studio - Input string was not in a correct format

I have a part of my code here (file parser program) that gives me an error of: Input string was not in a correct format
For Each h1 As Char In PRIM_BIT.ToCharArray
rawbit = Convert.ToString(Convert.ToInt32(h1, 16), 2)
pribitval = pribitval & StrDup(4 - rawbit.Length, "0") & rawbit
Next
I tried to use int.TryParse, but it doesn't work. Is there a way to parse this?
Check the value of h1 when the error occurred.
h1 has to be a valid digit. Such as 0 ~ 9, a ~ f, A ~ F.
Also, h1 cannot be empty.
Edit:
If you want to bypass this and proceed, you can use Try...Catch... statement
For Each h1 As Char In PRIM_BIT.ToCharArray
Try
rawbit = Convert.ToString(Convert.ToInt32(h1, 16), 2)
pribitval = pribitval & StrDup(4 - rawbit.Length, "0") & rawbit
Catch ex As Exception
'Do something when error occurs. Or simply do nothing.
End Try
Next
I still recommend you to check what went wrong when the error occurred.

Write UTF-8 files from R

Whereas R seems to handle Unicode characters well internally, I'm not able to output a data frame in R with such UTF-8 Unicode characters. Is there any way to force this?
data.frame(c("hīersumian","ǣmettigan"))->test
write.table(test,"test.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
The output text file reads:
hiersumian <U+01E3>mettigan
I am using R version 3.0.2 in a Windows environment (Windows 7).
EDIT
It's been suggested in the answers that R is writing the file correctly in UTF-8, and that the problem lies with the software I'm using to view the file. Here's some code where I'm doing everything in R. I'm reading in a text file encoded in UTF-8, and R reads it correctly. Then R writes the file out in UTF-8 and reads it back in again, and now the correct Unicode characters are gone.
read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
myinputfile[1,1]
write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
myoutputfile[1,1]
Console output:
> read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
> myinputfile[1,1]
[1] hīersumian
Levels: hīersumian ǣmettigan
> write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
> read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
> myoutputfile[1,1]
[1] <U+FEFF>hiersumian
Levels: <U+01E3>mettigan <U+FEFF>hiersumian
>
This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:
"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".
options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
The following sequence successfully writes "ǣmettigan" to the text file:
t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)
getOption("encoding")
# [1] "native.enc"
Encoding(t2[,"a"]) <- "UTF-16"
write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)
It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.
Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.
I may be missing something OS-specific, but data.table appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):
t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
# a
# 1: hīersumian
# 2: ǣmettigan
I found a blog post that basically says its windows way of encoding text. Lots more detail in post. User should write the file in binary using
writeBin(charToRaw(x), con, endian="little")
https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

Is there a way to recover a byte array that represents a String, that is saved as a String with a different encoding

We have a database where we save byte arrays (HBase).
All our Strings are encoded as bytes, and we do the conversion manually.
However, some old data has been wrongfully saved, and I wonder if there's a way to recover them.
What happened is that we had some original text that was encoded, let's say, in ISO_8859_1
BUT, the process that saved these Strings as byte arrays did something similar to new String(original_bytes, UTF8).getBytes(UTF8)
(whereas original_bytes represent the String as ISO8859_1)
I can't find a way to recover the original_bytes array. Is it at actually possible ?
I tried to reproduce it using this simple Java sample code :
String s = "é";
System.out.println("s: " + s);
System.out.println("s.getBytes: " + Arrays.toString(s.getBytes()));
System.out.println("s.getBytes(UTF8): " + Arrays.toString(s.getBytes(Charsets.UTF_8)));
System.out.println("new String(s.getBytes()): " + new String(s.getBytes()));
System.out.println("new String(s.getBytes(), UTF-8): " + new String(s.getBytes(), Charsets.UTF_8));
byte [] iso = s.getBytes(Charsets.ISO_8859_1);
System.out.println("iso " + Arrays.toString(iso));
System.out.println("new String(iso)" + new String(iso));
System.out.println("new String(iso, ISO)" + new String(iso, Charsets.ISO_8859_1));
System.out.println("new String(iso).getBytes()" + Arrays.toString(new String(iso).getBytes()));
System.out.println("new String(iso).getBytes(ISO)" + Arrays.toString(new String(iso).getBytes(Charsets.ISO_8859_1)));
System.out.println("new String(iso, UTF8).getBytes()" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes()));
System.out.println("new String(iso, UTF8).getBytes(UTF8)" + Arrays.toString(new String(iso, Charsets.UTF_8).getBytes(Charsets.UTF_8)));
output: (on a computer with a default charset of UTF8)
s: é
s.getBytes: [-61, -87]
s.getBytes(UTF8): [-61, -87]
new String(s.getBytes()): é
new String(s.getBytes(), UTF-8): é
iso [-23]
new String(iso)�
new String(iso, ISO)é
new String(iso).getBytes()[-17, -65, -67]
new String(iso).getBytes(ISO)[63]
new String(iso, UTF8).getBytes()[-17, -65, -67]
new String(iso, UTF8).getBytes(UTF8)[-17, -65, -67]
new String(new String(iso).getBytes(), Charsets.ISO_8859_1) �
Unfortunately no, it's not possible in every case.
UTF-8 has quite a few byte sequences that are illegal and that will (usually) be replaced by some replacement character when decoded. When your original_bytes contained any of those byte sequences, then that information is lost for sure.
Your best bet is to do the reverse, which will probably get you as close to the original String as possible:
byte[] originalISOData = ...;
byte[] badUTF8 = new String(originalISOData, "UTF-8").getBytes("UTF-8");
byte[] partialReconstruction = new String(badUTF8, "ISO-8859-1");
tl;dr decoding non-UTF-8 data as UTF-8 is not generally a lossless operation. A valid UTF-8 decoder will replace all malformed byte sequences with replacement characters (or even abort the decoding, depending on the decoder and its settings).
you can use Bytes class provided by Hbase API.for example to convert a byte array into string you can use "Bytes.toString(byteArray)".

Resources