Reading UTF-8 text files with ReadList

Reading UTF-8 text files with ReadList - wolfram-mathematica

Is it possible to use ReadList to read UTF-8 (or any other) encoded text files using ReadList[..., Word], or is it ASCII-only? If it's ASCII-only, is it possible to "fix" the encoding of the already read data with good performance (i.e. preserving the performance advantages of ReadList over Import)?
Import[..., CharacterEncoding -> "UTF8"] works but it's quite a bit slower than ReadList. $CharacterEncoding has no effect on ReadList
Download a sample UTF-8 encoded file here.
For testing performance on a large input, see the test file in this question.
Here are the timings of the answers on a large-ish text file:
Import
In[2]:= Timing[
data = Import[file, "Text"];
]
Out[2]= {5.234, Null}
Heike
In[4]:= Timing[
data = ReadList[file, String];
FromCharacterCode[ToCharacterCode[data], "UTF8"];
]
Out[4]= {4.328, Null}
Mr. Wizard
In[5]:= Timing[
string = FromCharacterCode[BinaryReadList[file], "UTF-8"];
]
Out[5]= {2.281, Null}

This seems to work
FromCharacterCode[ToCharacterCode[ReadList["raw.php.txt", Word]], "UTF-8"]
The timings I get for the linked test file are
FromCharacterCode[ToCharacterCode[ReadList["test.txt", Word]], "UTF-8"]); // Timing
(* ==> {0.000195, Null} *)
Import["test.txt", "Text"]; // Timing
(* ==> {0.01784, Null} *)

If I leave out Word, this works:
$CharacterEncoding = "UTF-8";
ReadList["UTF8.txt"]
This however is a failure, because the data is not read as strings.
Please try this on a larger file and report its performance:
FromCharacterCode[BinaryReadList["UTF8.txt"], "UTF-8"]

Related

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

I'm working on a ruamel.yaml (v0.17.4) based YAML reformatter (using the RoundTrip variant to preserve comments).
I want to allow a mix of block- and flow-style maps, but in some cases, I want to convert a flow-style map to use block-style.
In particular, if the flow-style map would be longer than the max line length^, I want to convert that to a block-style map instead of wrapping the line somewhere in the middle of the flow-style map.
^ By "max line length" I mean the best_width that I configure by setting something like yaml.width = 120 where yaml is a ruamel.yaml.YAML instance.
What should I extend to achieve this? The emitter is where the line-length gets calculated so wrapping can occur, but I suspect that is too late to convert between block- and flow-style. I'm also concerned about losing comments when I switch the styles. Here are some possible extension points, can you give me a pointer on where I'm most likely to have success with this?
Emitter.expect_flow_mapping() probably too late for converting flow->block
Serializer.serialize_node() probably too late as it consults node.flow_style
RoundTripRepresenter.represent_mapping() maybe? but this has no idea about line length
I could also walk the data before calling yaml.dump(), but this has no idea about line length.
So, where should I and where can I adjust the flow_style whether a flow-style map would trigger line wrapping?

What I think the most accurate approach is when you encounter a flow-style mapping in the dumping process is to first try to emit it to a buffer and then get the length of the buffer and if that combined with the column that you are in, actually emit block-style.
Any attempt to guesstimate the length of the output without actually trying to write that part of a tree is going to be hard, if not impossible to do without doing the actual emit. Among other things the dumping process actually dumps scalars and reads them back to make sure no quoting needs to be forced (e.g. when you dump a string that reads back like a date). It also handles single key-value pairs in a list in a special way ( [1, a: 42, 3] instead of the more verbose [1, {a: 42}, 3]. So a simple calculation of the length of the scalars that are the keys and values and separating comma, colon and spaces is not going to be precise.
A different approach is to dump your data with a large line width and parse the output and make a set of line numbers for which the line is too long according to the width that you actually want to use. After loading that output back you can walk over the data structure recursively, inspect the .lc attribute to determine the line number on which a flow style mapping (or sequence) started and if that line number is in the set you built beforehand change the mapping to block style. If you have nested flow-style collections, you might have to repeat this process.
If you run the following, the initial dumped value for quote will be on one line.
The change_to_block method as presented changes all mappings/sequences that are too long
that are on one line.
import sys
import ruamel.yaml
yaml_str = """\
movie: bladerunner
quote: {[Batty, Roy]: [
I have seen things you people wouldn't believe.,
Attack ships on fire off the shoulder of Orion.,
I watched C-beams glitter in the dark near the Tannhäuser Gate.,
]}
"""
class Blockify:
def __init__(self, width, only_first=False, verbose=0):
self._width = width
self._yaml = None
self._only_first = only_first
self._verbose = verbose
#property
def yaml(self):
if self._yaml is None:
self._yaml = y = ruamel.yaml.YAML(typ=['rt', 'string'])
y.preserve_quotes = True
y.width = 2**16
return self._yaml
def __call__(self, d):
pass_nr = 0
changed = [True]
while changed[0]:
changed[0] = False
try:
s = self.yaml.dumps(d)
except AttributeError:
print("use 'pip install ruamel.yaml.string' to install plugin that gives 'dumps' to string")
sys.exit(1)
if self._verbose > 1:
print(s)
too_long = set()
max_ll = -1
for line_nr, line in enumerate(s.splitlines()):
if len(line) > self._width:
too_long.add(line_nr)
if len(line) > max_ll:
max_ll = len(line)
if self._verbose > 0:
print(f'pass: {pass_nr}, lines: {sorted(too_long)}, longest: {max_ll}')
sys.stdout.flush()
new_d = self.yaml.load(s)
self.change_to_block(new_d, too_long, changed, only_first=self._only_first)
d = new_d
pass_nr += 1
return d, s
#staticmethod
def change_to_block(d, too_long, changed, only_first):
if isinstance(d, dict):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
# don't change keys if any value is changed
for v in d.values():
Blockify.change_to_block(v, too_long, changed, only_first)
if only_first and changed[0]:
return
if changed[0]: # don't change keys if value has changed
return
for k in d:
Blockify.change_to_block(k, too_long, changed, only_first)
if only_first and changed[0]:
return
if isinstance(d, (list, tuple)):
if d.fa.flow_style() and d.lc.line in too_long:
d.fa.set_block_style()
changed[0] = True
return # don't convert nested flow styles, might not be necessary
for elem in d:
Blockify.change_to_block(elem, too_long, changed, only_first)
if only_first and changed[0]:
return
blockify = Blockify(96, verbose=2) # set verbose to 0, to suppress progress output
yaml = ruamel.yaml.YAML(typ=['rt', 'string'])
data = yaml.load(yaml_str)
blockified_data, string_output = blockify(data)
print('-'*32, 'result:', '-'*32)
print(string_output) # string_output has no final newline
which gives:
movie: bladerunner
quote: {[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]}
pass: 0, lines: [1], longest: 186
movie: bladerunner
quote:
[Batty, Roy]: [I have seen things you people wouldn't believe., Attack ships on fire off the shoulder of Orion., I watched C-beams glitter in the dark near the Tannhäuser Gate.]
pass: 1, lines: [2], longest: 179
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
pass: 2, lines: [], longest: 67
-------------------------------- result: --------------------------------
movie: bladerunner
quote:
[Batty, Roy]:
- I have seen things you people wouldn't believe.
- Attack ships on fire off the shoulder of Orion.
- I watched C-beams glitter in the dark near the Tannhäuser Gate.
Please note that when using ruamel.yaml<0.18 the sequence [Batty, Roy] never will be in block style
because the tuple subclass CommentedKeySeq does never get a line number attached.

Write UTF-8 files from R

Whereas R seems to handle Unicode characters well internally, I'm not able to output a data frame in R with such UTF-8 Unicode characters. Is there any way to force this?
data.frame(c("hīersumian","ǣmettigan"))->test
write.table(test,"test.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
The output text file reads:
hiersumian <U+01E3>mettigan
I am using R version 3.0.2 in a Windows environment (Windows 7).
EDIT
It's been suggested in the answers that R is writing the file correctly in UTF-8, and that the problem lies with the software I'm using to view the file. Here's some code where I'm doing everything in R. I'm reading in a text file encoded in UTF-8, and R reads it correctly. Then R writes the file out in UTF-8 and reads it back in again, and now the correct Unicode characters are gone.
read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
myinputfile[1,1]
write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
myoutputfile[1,1]
Console output:
> read.table("myinputfile.txt",encoding="UTF-8")->myinputfile
> myinputfile[1,1]
[1] hīersumian
Levels: hīersumian ǣmettigan
> write.table(myinputfile,"myoutputfile.txt",row.names=F,col.names=F,quote=F,fileEncoding="UTF-8")
> read.table("myoutputfile.txt",encoding="UTF-8")->myoutputfile
> myoutputfile[1,1]
[1] <U+FEFF>hiersumian
Levels: <U+01E3>mettigan <U+FEFF>hiersumian
>

This "answer" serves rather the purpose of clarifying that there is something odd going on behind the scenes:
"hīersumian" doesn't even make it into the data frame it seems. The "ī"-symbol is in all cases converted to "i".
options("encoding" = "native.enc")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-8")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
options("encoding" = "UTF-16")
t1 <- data.frame(a = c("hīersumian "), stringsAsFactors=F)
t1
# a
# 1 hiersumian
The following sequence successfully writes "ǣmettigan" to the text file:
t2 <- data.frame(a = c("ǣmettigan"), stringsAsFactors=F)
getOption("encoding")
# [1] "native.enc"
Encoding(t2[,"a"]) <- "UTF-16"
write.table(t2,"test.txt",row.names=F,col.names=F,quote=F)
It is not going to work with "encoding" as "UTF-8" or "UTF-16" and also specifying "fileEncoding" will either lead to a defect or no output.
Somewhat disappointing as so far I managed to get all Unicode issues fixed somehow.

I may be missing something OS-specific, but data.table appears to have no problem with this (or perhaps more likely it's an update to R internals since this question was originally posed):
t1 = data.table(a = c("hīersumian", "ǣmettigan"))
tmp = tempfile()
fwrite(t1, tmp)
system(paste('cat', tmp))
# a
# hīersumian
# ǣmettigan
fread(tmp)
# a
# 1: hīersumian
# 2: ǣmettigan

I found a blog post that basically says its windows way of encoding text. Lots more detail in post. User should write the file in binary using
writeBin(charToRaw(x), con, endian="little")
https://tomizonor.wordpress.com/2013/04/17/file-utf8-windows/

ruby serialport gem, who is responsible to check parity errors?

gems
serialport (1.0.4)
Authors: Guillaume Pierronnet, Alan Stern, Daniel E. Shipton, Tobin
Richard, Hector Parra, Ryan C. Payne
Homepage: http://github.com/hparra/ruby-serialport/
Library for using RS-232 serial ports.
I am using this gem, and my device's specifications are as follows.
9600bps
7bits
1 stop bit
EVEN parity
When I receive data like below, the unpacked data is still with parity bit.
sp = SerialPort.new("/dev/serial-device", 9600, 7, 1, SerialPort::EVEN)
data = sp.gets
data.chars.each do |char|
puts char.unpack("B*")
end
ex. if sp receives a, the unpacked data is 11100001 instead of 01100001, because it's EVEN parity.
To convert the byte back the what it should be, I do like this
data = sp.gets #gets 11100001 for 'a' (even parity)
data.bytes.to_a.each do |byte|
puts (byte & 127).chr
end
now, to me, this is a way low-level. I was expecting the serialport gem was to do this parity check, but as far as I read its document, it doesn't say anything about parity check.
Am I missing a method that is already implemented in the gem, or is my work around above is nessesary since it's my responsibity to check the parity and find error?

SerialPort::ODD, SerialPort::MARK, SerialPort::SPACE
(MARK and SPACE are not supported on Posix)
Raise an argError on bad argument.
SerialPort::new and SerialPort::open without a block return an
instance of SerialPort. SerialPort::open with a block passes
a SerialPort to the block and closes it when the block exits
(like File::open).
** Instance methods **
modem_params() -> aHash
modem_params=(aHash) -> aHash
get_modem_params() -> aHash
set_modem_params(aHash) -> aHash
set_modem_params(baudrate [, databits [, stopbits [, parity]]])
Get and set the modem parameters. Hash keys are "baud", "data_bits",
"stop_bits", and "parity" (see above).
Parameters not present in the hash or set to nil remain unchanged.
Default parameter values for the set_modem_params method are:
databits = 8, stopbits = 1, parity = (databits == 8 ?
SerialPort::NONE : SerialPort::EVEN).
baud() -> anInteger
baud=(anInteger) -> anInteger
data_bits() -> 4, 5, 6, 7, or 8
data_bits=(anInteger) -> anInteger
stop_bits() -> 1 or 2
stop_bits=(anInteger) -> anInteger
parity() -> anInteger: SerialPort::NONE, SerialPort::EVEN,
SerialPort::ODD, SerialPort::MARK, or SerialPort::SPACE
parity=(anInteger) -> anInteger
Get and set the corresponding modem parameter.
flow_control() -> anInteger
flow_control=(anInteger) -> anInteger
Get and set the flow control: SerialPort::NONE, SerialPort::HARD,
SerialPort::SOFT, or (SerialPort::HARD | SerialPort::SOFT).
Note: SerialPort::HARD mode is not supported on all platforms.
SerialPort::HARD uses RTS/CTS handshaking; DSR/DTR is not
supported.
read_timeout() -> anInteger
read_timeout=(anInteger) -> anInteger
write_timeout() -> anInteger
write_timeout=(anInteger) -> anInteger
Get and set timeout values (in milliseconds) for reading and writing.
A negative read timeout will return all the available data without
waiting, a zero read timeout will not return until at least one
byte is available, and a positive read timeout returns when the
requested number of bytes is available or the interval between the
arrival of two bytes exceeds the timeout value.
Note: Read timeouts don't mix well with multi-threading.
Note: Under Posix, write timeouts are not implemented.
break(time) -> nil
Send a break for the given time.
time -> anInteger: tenths-of-a-second for the break.
Note: Under Posix, this value is very approximate.
signals() -> aHash
Return a hash with the state of each line status bit. Keys are
"rts", "dtr", "cts", "dsr", "dcd", and "ri".
Note: Under Windows, the rts and dtr values are not included.
rts()
dtr()
cts()
dsr()
dcd()
ri() -> 0 or 1
rts=(0 or 1)
dtr=(0 or 1) -> 0 or 1
Get and set the corresponding line status bit.
Note: Under Windows, rts() and dtr() are not implemented

PlotLegends` default options

I'm trying to redefine an option of the PlotLegends package after having loaded it,
but I get for example
Needs["PlotLegends`"]
SetOptions[ListPlot,LegendPosition->{0,0.5}]
=> SetOptions::optnf: LegendPosition is not a known option for ListPlot.
I expect such a thing as the options in the PlotLegends package aren't built-in to Plot and ListPlot.
Is there a way to redefine the default options of the PlotLegends package?

The problem is not really in the defaults for PlotLegends`. To see it, you should inspect the ListPlot implementation:
In[28]:= Needs["PlotLegends`"]
In[50]:= DownValues[ListPlot]
Out[50]=
{HoldPattern[ListPlot[PlotLegends`Private`a:PatternSequence[___,
Except[_?OptionQ]]|PatternSequence[],PlotLegends`Private`opts__?OptionQ]]:>
PlotLegends`Private`legendListPlot[ListPlot,PlotLegends`Private`a,
PlotLegend/.Flatten[{PlotLegends`Private`opts}],PlotLegends`Private`opts]
/;!FreeQ[Flatten[{PlotLegends`Private`opts}],PlotLegend]}
What you see from here is that options must be passed explicitly for it to work, and moreover, PlotLegend option must be present.
One way to achieve what you want is to use my option configuration manager, which imitates global options by passing local ones. Here is a version where option-filtering is made optional:
ClearAll[setOptionConfiguration, getOptionConfiguration, withOptionConfiguration];
SetAttributes[withOptionConfiguration, HoldFirst];
Module[{optionConfiguration}, optionConfiguration[_][_] = {};
setOptionConfiguration[f_, tag_, {opts___?OptionQ}, filterQ : (True | False) : True] :=
optionConfiguration[f][tag] =
If[filterQ, FilterRules[{opts}, Options[f]], {opts}];
getOptionConfiguration[f_, tag_] := optionConfiguration[f][tag];
withOptionConfiguration[f_[args___], tag_] :=
f[args, Sequence ## optionConfiguration[f][tag]];
];
To use this, first define your configuration and a short-cut macro, as follows:
setOptionConfiguration[ListPlot,"myConfig", {LegendPosition -> {0.8, -0.8}}, False];
withMyConfig = Function[code, withOptionConfiguration[code, "myConfig"], HoldAll];
Now, here you go:
withMyConfig[
ListPlot[{#, Sin[#]} & /# Range[0, 2 Pi, 0.1], PlotLegend -> {"sine"}]
]

LegendsPosition works in ListPlot without problems (for me at least). You don't happen to have forgotten to load the package by using Needs["PlotLegends"]`?

#Leonid, I added the possibility to setOptionConfiguration to set default options to f without having to use a short-cut macro.
I use the trick exposed by Alexey Popkov in What is in your Mathematica tool bag?
Example:
Needs["PlotLegends`"];
setOptionConfiguration[ListPlot, "myConfig", {LegendPosition -> {0.8, -0.8}},SetAsDefault -> True]
ListPlot[{#, Sin[#]} & /# Range[0, 2 Pi, 0.1], PlotLegend -> {"sine"}]
Here is the implementation
Options[setOptionConfiguration] = {FilterQ -> False, SetAsDefault -> False};
setOptionConfiguration[f_, tag_, {opts___?OptionQ}, OptionsPattern[]] :=
Module[{protectedFunction},
optionConfiguration[f][tag] =
If[OptionValue#FilterQ, FilterRules[{opts},
Options[f]]
,
{opts}
];
If[OptionValue#SetAsDefault,
If[(protectedFunction = MemberQ[Attributes[f], Protected]),
Unprotect[f];
];
DownValues[f] =
Union[
{
(*I want this new rule to be the first in the DownValues of f*)
HoldPattern[f[args___]] :>
Block[{$inF = True},
withOptionConfiguration[f[args], tag]
] /; ! TrueQ[$inF]
}
,
DownValues[f]
];
If[protectedFunction,
Protect[f];
];
];
];

Nice formatting of numbers inside Messages

When printing string with StyleBox by default we get nicely formatted numbers inside string:
StyleBox["some text 1000000"] // DisplayForm
I mean that the numbers look as if would have additional little spaces: "1 000 000".
But in Messages all numbers are displayed without formatting:
f::NoMoreMemory =
"There are less than `1` bytes of free physical memory (`2` bytes \
is free). $Failed is returned.";
Message[f::NoMoreMemory, 1000000, 98000000]
Is there a way to get numbers inside Messages to be formatted?

I'd use Style to apply the AutoNumberFormatting option:
You can use it to target specific messages:
f::NoMoreMemory =
"There are less than `1` bytes of free physical memory (`2` bytes is free). $Failed is returned.";
Message[f::NoMoreMemory,
Style[1000000, AutoNumberFormatting -> True],
Style[98000000, AutoNumberFormatting -> True]]
or you can use it with $MessagePrePrint to apply it to all the messages:
$MessagePrePrint = Style[#, AutoNumberFormatting -> True] &;
Message[f::NoMoreMemory, 1000000, 98000000]

I think you want $MessagePrePrint
$MessagePrePrint =
NumberForm[#, DigitBlock -> 3, NumberSeparator -> " "] &;
Or, incorporating Sjoerd's suggestion:
With[
{opts =
AbsoluteOptions[EvaluationNotebook[],
{DigitBlock, NumberSeparator}]},
$MessagePrePrint = NumberForm[#, Sequence ## opts] &];
Adapting Brett Champion's method, I believe this allows for copy & paste as you requested:
$MessagePrePrint = StyleForm[#, AutoNumberFormatting -> True] &;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Reading UTF-8 text files with ReadList - wolfram-mathematica

If I leave out Word, this works: $CharacterEncoding = "UTF-8"; ReadList["UTF8.txt"] This however is a failure, because the data is not read as strings. Please try this on a larger file and report its performance: FromCharacterCode[BinaryReadList["UTF8.txt"], "UTF-8"]

Related

With ruamel.yaml how can I conditionally convert flow maps to block maps based on line length?

Write UTF-8 files from R

ruby serialport gem, who is responsible to check parity errors?

PlotLegends` default options

Nice formatting of numbers inside Messages

Categories

Resources