Write UTF-8 files from R - windows

Whereas R seems to handle Unicode characters well internally, I'm not able to output a data frame in R with such UTF-8 Unicode characters. Is there any way to force this?
data.frame(c("hīersumian","ǣmettigan"))->test
write.table(test,"test.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
The output text file reads:
hiersumian <U+01E3>mettigan
I am using R version 3.0.2 in a Windows environment (Windows 7).
EDIT
It's been suggested in the answers that R is writing the file correctly in UTF-8, and that the problem lies with the software I'm using to view the file. Here's some code where I'm doing everything in R. I'm reading in a text file encoded in UTF-8, and R reads it correctly. Then R writes the file out in UTF-8 and reads it back in again, and now the correct Unicode characters are gone.
read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
myinputfile[1,1]
write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
myoutputfile[1,1]
Console output:
> read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
> myinputfile[1,1]
[1] hīersumian
Levels: hīersumian ǣmettigan
> write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
> read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
> myoutputfile[1,1]
[1] <U+FEFF>hiersumian
Levels: <U+01E3>mettigan <U+FEFF>hiersumian
>

This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:
"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".
options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
The following sequence successfully writes "ǣmettigan" to the text file:
t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)
getOption("encoding")
# [1] "native.enc"
Encoding(t2[,"a"]) <- "UTF-16"
write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)
It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.
Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.

I may be missing something OS-specific, but data.table appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):
t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
# a
# 1: hīersumian
# 2: ǣmettigan

I found a blog post that basically says its windows way of encoding text. Lots more detail in post. User should write the file in binary using
writeBin(charToRaw(x), con, endian="little")
https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

Related

Ruby Zlib compression gives different outputs for the same input

I have this ruby method for compressing a string -
def compress_data(data)
output = StringIO.new
gz = Zlib::GzipWriter.new(output)
gz.write(data)
gz.close
compressed_data = output.string
compressed_data
end
When I call this method with the same input, I get different outputs at different times. I am trying to get the byte array for the compressed outputs and compare them.
The output is Different when I run the below -
input = "hello world"
output1 = (compress_data input).bytes.to_a
sleep 1
output2 = (compress_data input).bytes.to_a
if output1 == output2
puts 'Same'
else
puts 'Different'
end
The output is Same when I remove the sleep. Does the compression algorithm have something to do with the current time?
Option 1 - fixed mtime:
Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:
gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close
Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).
Option 2 - skip the header:
Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:
data = compress_data(input).bytes[10..-1]
Note that you do not need to call to_a on bytes. It is already an Array:
String.bytes -> an_array
Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

r: dprint: size of image of table alteration

I am using the dprint package with knitr , mainly so that I can highlight rows from a table, which I have got working, but the output image leaves a fairly large space for a footnote, and it is taking up unnecessary space.
Is there away to get rid of it?
Also since I am fairly new to dprint, if anybody has better ideas/suggestions as to how to highlight tables and make them look pretty without any footnotes... or ways to tidy up my code that would be great!
An example of the Rmd file code is below...
```{r fig.height=10, fig.width=10, dev='jpeg'}
library("dprint")
k <- data.frame(matrix(1:100, 10,10))
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"), frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki", fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki", fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold", fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
x <- dprint(~., data=k,footnote=NA, pg.dim=c(10,10), margins=c(0.2,0.2,0.2,0.2),
style=CBs, row.hl=row.hl(which(k[,1]==5), col='red'),
fit.width=TRUE, fit.height=TRUE,
showmargins=TRUE, newpage=TRUE, main="TABLE TITLE")
```
Thanks in advance!
I haven't used dprint before, but I see a couple of different things that might be causing problems:
The start of your code chunk has defined the image width and height, which dprint seems to be trying to use.
You are setting both fit.height and fit.width. I think only one of those is used (in other words, the resulting image isn't stretched to fit both height and width, but only the one that seems to make most sense, in this case, width).
After tinkering around for a minute, here's what I did that minimizes the footnote. However, I don't know if there is a more efficient way to do this.
```{r dev='jpeg'}
library("dprint")
k <- data.frame(matrix(1:100, 10,10))
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"),
frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki",
fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki",
fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold",
fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
x <- dprint(~., data=k, style=CBs, pg.dim = c(7, 4.5),
showmargins=TRUE, newpage=TRUE,
main="TABLE TITLE", fit.width=TRUE)
```
Update
Playing around to determine the sizes of the images is a total drag. But, if you run the code in R and look at the structure of x, you'll find the following:
str(x)
# List of 3
# $ cord1 : num [1:2] 0.2 6.8
# $ cord2 : Named num [1:2] 3.42 4.78
# ..- attr(*, "names")= chr [1:2] "" ""
# $ pagenum: num 2
Or, simply:
x$cord2
# 3.420247 4.782485
These are the dimensions of your resulting image, and this information can probably easily be plugged into a function to make your plots better.
Good luck!
So here's my solution...with some examples...
I've just copied and pasted my Rmd file to demonstrate how to use it.
you should be able to just copy and paste it into a blank Rmd file and then knit to HTML to see the results...
Ideally what I would have liked would have been to make it all one nice neat function rather than splitting it up into two (i.e. setup.table & print.table) but since chunk options can't be changed mid chunk as suggested by Yihui, it had to be split up into two functions...
`dprint` + `knitr` Examples to create table images
===========
```{r}
library(dprint)
# creating the sytle object to be used
CBs <- style(frmt.bdy=frmt(fontfamily="HersheySans"),
frmt.tbl=frmt(bty="o", lwd=1),
frmt.col=frmt(fontfamily="HersheySans", bg="khaki",
fontface="bold", lwd=2, bty="_"),
frmt.grp=frmt(fontfamily="HersheySans",bg="khaki",
fontface="bold"),
frmt.main=frmt(fontfamily="HersheySans", fontface="bold",
fontsize=12),
frmt.ftn=frmt(fontfamily="HersheySans"),
justify="right", tbl.buf=0)
# creating a setup function to setup printing a table (will probably put this function into my .Rprofile file)
setup.table <- function(df,width=10, style.obj='CBs'){
require(dprint)
table.style <- get(style.obj)
a <- tbl.struct(~., df)
b <- char.dim(a, style=table.style)
p <- pagelayout(dtype = "rgraphics", pg.dim = NULL, margins = NULL)
f <- size.simp(a[[1]], char.dim.obj=b, loc.y=0, pagelayout=p)
# now to work out the natural table width to height ratio (w.2.h.r) GIVEN the style
w.2.h.r <- as.numeric(f$tbl.width/(f$tbl.height +b$linespace.col+ b$linespace.main))
height <- width/w.2.h.r
table.width <- width
table.height <- height
# Setting chunk options to have right fig dimensions for the next chunk
opts_chunk$set('fig.width'=as.numeric(width+0.1))
opts_chunk$set('fig.height'=as.numeric(height+0.1))
# assigning relevant variables to be used when printing
assign("table.width",table.width, envir=.GlobalEnv)
assign("table.height",table.height, envir=.GlobalEnv)
assign("table.style", table.style, envir=.GlobalEnv)
}
# function to print the table (will probably put this function into my .Rprofile file as well)
print.table <- function(df, row.2.hl='2012-04-30', colour='lightblue',...) {
x <-dprint(~., data=df, style=table.style, pg.dim=c(table.width,table.height), ..., newpage=TRUE,fit.width=TRUE, row.hl=row.hl(which(df[,1]==row.2.hl), col=colour))
}
```
```{r}
# Giving it a go!
# Setting up two differnt size tables
small.df <- data.frame(matrix(1:100, 10,10))
big.df <- data.frame(matrix(1:800,40,20))
```
```{r}
# Using the created setup.table function
setup.table(df=small.df, width=10, style.obj='CBs')
```
```{r}
# Using the print.table function
print.table(small.df,4,'lightblue',main='table title string') # highlighting row 4
```
```{r}
setup.table(big.df,13,'CBs') # now setting up a large table
```
```{r}
print.table(big.df,38,'orange', main='the big table!') # highlighting row 38 in orange
```
```{r}
d <- style() # the default style this time will be used
setup.table(big.df,15,'d')
```
```{r}
print.table(big.df, 23, 'indianred1') # this time higlihting row 23
```

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Resources