Error in nchar() when reading in stata file in R on Mac - macos

I'm learning R and am simply trying to read in a stata data file but am getting the error below:
X <- Stata.file(Stata_File)
Error in nchar(varlabs) : invalid multibyte string 253
Multiple Mac users here are encountering this error with the program but it works fine on a PC. A google search of this error seems to say it has something to do with the R package but I can't find a solution. Any ideas? Thanks for your help!!
The R code up to the error point is below:
Root <- "/Users/Desktop/R_Training"
PathIn <- paste(Root,"Data/Example_0",sep="/")
# The 2007 Dominican Republic household member file (96 MB)
Stata_File <- "drpr51fl.dta"
# Load the memisc package:
library(memisc)
# Set the working directory:
setwd(PathIn)
# (1) Determine which variables we want:
# The Stata.file function (from memisc) reads the "header"
# of our Stata file so you can see what it contains
# and choose the variables you want.
X <- Stata.file(Stata_File)
**Error in nchar(varlabs) : invalid multibyte string 253**
Below is my session info:
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] memisc_0.95-33 MASS_7.3-13 lattice_0.19-30

This is what worked for me. You can force R to recognize every character by issuing the following command:
Sys.setlocale('LC_ALL','C')
Now run the previous command and all should be fine.

It seems like the encoding of strings in the file isn't what the program thinks it is...
I guess the file was generated on a PC? Does it contain non-ACII column names or data strings?
Since you seem to have UTF-8 encoding, and (US/western europe) PC:s typically have latin-1, that could be the problem. I'd expect the same problem on Linux then (also UTF-8).
Possible work-arounds:
Does the Stata.file method have an "encoding" option? Then you might try 'latin1' and hope for the best...
Another possibility is to start R with the --encoding=latin1 option.

Related

Reading dput() gists from github into R

I am trying to read a gist containing a dput from Github:
library(RCurl)
data <- getURL("https://gist.githubusercontent.com/aronlindberg/848b8efef154d0e7fdb4/raw/5bf4bb864cc4c1db0f66da1be85515b4fa19bf6b/pull_lists")
pull_lists <- dget(textConnection(data))
This generates:
Error: '\U' used without hex digits in character string starting ""## -1,7 +1,9 ##
module ActionDispatch
module Http
module URL
- # Returns the complete \U"
Which I think is a Ruby error message rather than an R error. Now consider this:
data <- getURL("https://gist.githubusercontent.com/aronlindberg/b6b934b39e3c3378c3b2/raw/9b1efe9340c5b1c8acfdc90741260d1d554b2af0/data")
pull_lists2 <- dget(textConnection(data))
This seems to work fine. The former gist is rather large, 1.7mb. Could this be why I can't read it from Github. If not, why?
The gist that you created does not have a .R file in it, since pull_lists does not have an extension. I forked your gist to this one and added the extension. Now it is possible to source the gist and save it to a value.
library("devtools")
pull_lists <- source_gist("a7b157cec3b9259fc5d1")

pandoc document conversion failed with error 43 : pdflatex: The memory dump file could not be found

RStudio : 0.98.994
OS: Microsoft Windows 7 Ultimate Edition, 64-bit Service Pack 1
MiKTeX: 2.9.4503
Hi,
I get the following error when I try to knit a PDF document.
pandoc.exe: Error producing PDF from TeX source.
This is pdfTeX, Version 3.1415926-1.40.11 (MiKTeX 2.9)
pdflatex: The memory dump file could not be found.
pdflatex: Data: pdflatex.fmt
I also tried devtools::install_github('rstudio/rmarkdown') but was still getting an error when I added 'fig.align='center' to a ggplot2 plot in my document. It would work as HTML, but not as PDF.
After seeing isomorphismes's post I clicked on the gear symbol next to the knit PDF button, then under the advanced tab I changed the LaTeX Engine to xelatex. After that I no longer received the error message and my PDF document was created without problems.
Thank you.
I found the answer here: http://rmarkdown.rstudio.com/tufte_handout_format.html#comment-1582377678
The problem is that you need to add \usepackage[utf8]{inputnc} to the preamble of the tufte-handout.tex file in the rmarkdown package.
This was fixed here: https://github.com/rstudio/rmarkdown/commit/484d5b8e903e0e0c75c82f707efa35f9fd9a52b0
To update your rmarkdown package, you can use directly in the RStudio command line
devtools::install_github("rstudio/rmarkdown")
None of the above worked for me when knitting to PDF (and I wanted to keep the scientific notation). The problem was that latex code was generated that included "\times" without the necessary bracketing by $. In the markdown I simply bracketed the inline R code with $'s, like so:
$p = `r signif(cor.HF$p.value, 2)`$
Voila!
happy to share with you my solution.
---
title: "Untitled"
author: "-----"
date: "21/6/2017"
output:
pdf_document:
latex_engine: xelatex
---
I was able to fix it in my case. I experienced that error when generate PDF from Rmd if I added float values into a text that R tried to display as a scientific notation. For example, instead of "520274.72" it tried to add text "5.2027472 e10-5" which leads to latex code \textbf{5.2027472\times 10\^{}{5}} that was not compiling. I fixed it by wrapping it with format(....,scientific=FALSE).
replace
r round(txn_pd,2)
with
r format(round(txn_pd,2),scientific=FALSE)
I had the same problem and devtools::install_github('rstudio/rmarkdown') didn't work for me. I needed to
rmarkdown::render('in.md',
output_format=pdf_document(latex_engine='xelatex')
)
with the novel command (use xelatex) on its own line.
I encountered this problem while I was trying to add an in-line r code r test1$p.value, which is a very small p-value from t test. The error information is as following:
> ! Missing $ inserted.
> <inserted text>
> $
>l.147 9.0044314\times
>
>pandoc: Error producing PDF
>Error: pandoc document conversion failed with error 43
>Execution halted
I think the problem is the pdflatex engine has a trouble in displaying the small p-value in exponential notation.
I solved the problem by clicking on the gear symbol next to the knit button, then under output options, advanced tab I changed the LaTeX Engine to lualatex, or you can just report the p-value as p < 0.001.
If you are using inline values from your R code which are in the scientific format (too small or too big), format them like:
replace r x
with r format(x, digits=n) where n is whatever.
for me it was because on my headers I was putting + signs. For example gene + treatment. This errors but when I removed it, it works.
In my case it was solved simply by editing the author field in:
---
title: "Document Title"
author: '-----'
date: "21-03-2017"
output: pdf_document
---
the default '-----' would yield the error, but replacing it with anything (for example 'Juan') solved the issue.
I just ran into this problem and already solved it. I didn't use any code as other people did in their posts.
I will assume that you have installed all these basic stuff: R, RStudio, the rmarkdown package, the knitr package, and the MikTex basic installation (I know this is very basic, but I want those first timers know that you need these stuff to make this happen).
If you run into this problem, go to R GUI, upgrade the rmarkdown package and it should work then. Note that if you change the LaTeX Engine to xelatex as the poster of the highest vote did, it may not work for you, at least it did not for me. I leave my latex engine as it is (pdflatex).
I had a similar issue. My solution was to remove the "leading" period in the YAML title argument:
Does not work:
---
title: “1. Title”
output: pdf_document
---
output file: example.knit.md
! Argument of \reserved#a has an extra }.
\par l.79 \end{enumerate}}
pandoc: Error producing PDF Error: pandoc document conversion failed
with error 43 Execution halted
Works:
---
title: “1 Title”
output: pdf_document
---
I did try to use the xelatex engine but still, I got the error that xetex.def is not found. This is another to work around.
output:
pdf_document:
keep_tex: yes
latex_engine: xelatex
Then open .tex file in your TEX editor and build pdf as usual.
I faced a similar issue. In my case, the error occurred because of putting a percentage inside the $ sign.
Like this,
$95%$, I removed the % sign, and everything worked fine.

What is a reliable way of getting allowed locale names in R?

I'm trying to find a reliable way of finding locale codes to pass to Sys.setlocale.
The ?Sys.setlocale help page just states that the allowed values are OS dependent, and gives these examples:
Sys.setlocale("LC_TIME", "de") # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8") # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # ditto
Sys.setlocale("LC_TIME", "de_DE") # Mac OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows
Under Linux, the possibilities can be retrieved using
locales <- system("locale -a", intern = TRUE)
## [1] "C" "C.utf8" "POSIX"
## [4] "af_ZA" "af_ZA.utf8" "am_ET"
## ...
I don't have Solaris or Mac machines to hand, but I guess that that output can be generated from that using something like:
library(stringr)
unique(str_split_fixed(locales, "_", 2)[, 1]) #Solaris
unique(str_split_fixed(locales, "\\.", 2)[, 1]) #Mac
Locales on Windows are much more problematic: they require long names of the form “language_country”, for example:
Sys.setlocale("LC_ALL", "German_Germany")
I can't find a reliable reference for the list of locales under Windows. Calling locale -a from the Windows command line fails unless cygwin is installed, and then it returns the same values as under Linux (I'm guessing it's accessing values in a standard C library.)
There doesn't seem to be a list of locales packaged with R (I thought there might something similar to share/zoneinfo/zone.tab that contains time zone details).
My current best strategy is to browse this webpage from Microsoft and form the name by manipulating the SUBLANG column of the table.
http://msdn.microsoft.com/en-us/library/dd318693.aspx
Some guesswork is needed, for example the locale related to SUBLANG_ENGLISH_UK is English_United Kingdom.
Sys.setlocale("LC_ALL", "English_United Kingdom")
Where there are variants in different alphabets, parentheses are needed.
Sys.setlocale("LC_ALL", "Uzbek (Latin)_Uzbekistan")
Sys.setlocale("LC_ALL", "Uzbek (Cyrillic)_Uzbekistan")
This guesswork wouldn't be too bad, but many locales don't work at all, including most Indian locales.
Sys.setlocale("LC_ALL", "Hindi_India")
Sys.setlocale("LC_ALL", "Tamil_India")
Sys.setlocale("LC_ALL", "Sindhi_Pakistan")
Sys.setlocale("LC_ALL", "Nynorsk_Norway")
Sys.setlocale("LC_ALL", "Amharic_Ethiopia")
The Windows Region and Language dialog box (Windows\System32\intl.cpl, see pic) has a similar but not identical list of available locales, but I don't know where that is populated from.
There are several related questions:
1. Mac and Solaris people: please can you check to see if my code for getting locales works under your OS.
2. Indian/Pakistani/Norwegian/Ethiopian people using Windows: Please can you tell me what Sys.getlocale() returns for you.
3. Other Windows people: Is there any better documentation on which locales are available?
Update: After clicking links in the question that Ben B mentioned, I stumbled across this better list of locales in Windows. By manually changing the locale using the Region and Language dialog and calling Sys.getlocale(), I deduced that Nynorsk is "Norwegian-Nynorsk_Norway". There are still many oddities, for example
Sys.setlocale(, "Inuktitut (Latin)_Canada")
is fine, but
Sys.setlocale(, "Inuktitut (Syllabics)_Canada")
fails (as do most of the Indian languages). Starting R in any of these locales causes a warning, and R's locale to revert to C.
I'm still interested to hear from any Indians, etc., as to what locale you have.
In answer to your first question, here's the output on my Mac:
> locales <- system("locale -a", intern = TRUE)
> library(stringr)
> unique(str_split_fixed(locales, "\\.", 2)[, 1])
[1] "af_ZA" "am_ET" "be_BY" "bg_BG" "ca_ES" "cs_CZ" "da_DK" "de_AT" "de_CH"
[10] "de_DE" "el_GR" "en_AU" "en_CA" "en_GB" "en_IE" "en_NZ" "en_US" "es_ES"
[19] "et_EE" "eu_ES" "fi_FI" "fr_BE" "fr_CA" "fr_CH" "fr_FR" "he_IL" "hi_IN"
[28] "hr_HR" "hu_HU" "hy_AM" "is_IS" "it_CH" "it_IT" "ja_JP" "kk_KZ" "ko_KR"
[37] "lt_LT" "nl_BE" "nl_NL" "no_NO" "pl_PL" "pt_BR" "pt_PT" "ro_RO" "ru_RU"
[46] "sk_SK" "sl_SI" "sr_YU" "sv_SE" "tr_TR" "uk_UA" "zh_CN" "zh_HK" "zh_TW"
[55] "C" "POSIX"
I'm not sure what I'm expecting to see with Sys.setlocale() but it doesn't throw any errors:
> Sys.setlocale(locale="he_IL")
[1] "he_IL/he_IL/he_IL/C/he_IL/en_AU.UTF-8"
> Sys.getlocale()
[1] "he_IL/he_IL/he_IL/C/he_IL/en_AU.UTF-8"
Thanks all. I went to the URL that Richie suggested, http://msdn.microsoft.com/en-us/library/dd318693.aspx, and tried LANG_BELARUSIAN in windows. That didn't work, so I lopped off the "LANG_" and included "BELARUSIAN" by itself. Worked fine.
> bk.date1
[1] "Ma 2012 august 14 11:28:30 "
ymd_hms(bk.date1, locale = "BELARUSIAN")
[1] "2012-08-14 11:28:30 UTC"

AlignIO gives 'AssertionError' when reading emboss alignment files

I have been stuck on a problem for three days... searched everywhere, posted on Biostar, still waiting for EMBL to respond to emails... would make a bounty if I had more rep.
After aligning sequences with EMBOSSwin needle() (pairwise global alignments) I get alignment files in pair format, with a .needle file extension. I want to use Biopython to read these alignments for later analysis.
I use AlignIO.read(open('alignment.needle'),'emboss') following the instructions in Biopython's AlignIO wiki but I keep getting an AssertionError.
My code:
>>> from Bio import AlignIO
>>> alignment = AlignIO.read(open("data/all/out/pair1_alignment.needle"), "emboss")
My error:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "C:\Python27\lib\Bio\AlignIO\__init__.py", line 423, in read
first = next(iterator)
File "C:\Python27\lib\Bio\AlignIO\__init__.py", line 370, in parse
for a in i:
File "C:\Python27\lib\Bio\AlignIO\EmbossIO.py", line 150, in __next__
assert seq.replace("-", "") != ""
AssertionError
Example Alignment File:
Download the alignment file here
Versions:
Windows 7
Python version 2.7.3
Biopython version 1.63
EMBOSS version 2.10.0-0.8
Clues:
I suspect this may be related to a warning message I kept getting when actually making the alignments, which was outputted by EMBOSS needle() function:
Warning: Sequence character string not found in ajSeqCvtKS
Duplicate post on BioStars, http://www.biostars.org/p/87226/#87399
This appears to be down to a subtle change in the EMBOSS output. You have an extremely old version, EMBOSS version 2.10.0 (February 2005), and your output file has lines like this:
gag 1288 -------------------------------------------------- 1287
Using a newer version of EMBOSS (e.g. 6.3.0), gives lines like this:
gag 1287 -------------------------------------------------- 1287
The Biopython parser is expecting the latter for alignment sections with no letters (e.g. when one sequence is much longer than the other), where the start and end coordinates agree. Please update your copy of EMBOSS, and then the parser should be happy. The current EMBOSS release is version 6.5.0.
The problem is that you're passing the wrong format file to Biopython. An explanation follows.
Formatting
The format of the file you've linked to is srspair (see the header of pair1_aligned.fasta). It's worth noting that this is not the FASTA format - that's an entirely different format.
Delving into the source of Biopython's EmbossIO, we can see that the EmbossIterator (which is called by AlignIO.read when the format is 'emboss') is only meant to handle the formats pair and simple (see Alignment formats for an explanation of the various formats).
Solution
If you export EMBOSS's output in the pair format (then call AlignIO.read as you have before), that should solve your problem.

What might explain an "invalid stored block lengths" error?

I am running a Rails (3.2.3) application with Ruby 1.9.3p194 on the basic Ubuntu lucid32 image in a Vagrant virtual box. The virtual box is running on Leopard, for what it's worth. I'm trying to use rubyzip in the application to decompress a zip archive - 2009_da_lmp.zip. Using code directly from examples in the rubyzip repository, I can confirm that I can list the archive file contents:
#f is the absolute path to 2009_da_lmp.zip (string)
Zip::ZipFile.open(f) { |zf| zf.entries[0] }
=> 20090101_da_lmp.csv #that is indeed a file in the archive.
Using some more code from the examples in the repository, I try to get at an actual file in the archive:
Zip::ZipInputStream.open(f) { |zis|
entry = zis.get_next_entry
print "first line of '#{entry.name}' (#{entry.size} bytes: ) "
puts "'#{zis.gets.chomp}'" }
=> first line of '20090101_da_lmp.csv' (826610 bytes: ) Zlib::DataError:
invalid stored block lengths #and a long stack trace I can provide
#if that might help
The Mac OS decompression utility unzips the archive fine. I was wondering if it was some kind of encoding-related thing (my locale is set to en_US.UTF-8 because to make using PostgreSQL in dev less painful), but I don't know how to tell if that's the case. I can't find any information on what might cause this error.
This is a typical error found when feeding random data to an inflater. In fact you will get this error about 1/4 of the time from random data (when the low three bits of the first byte are 000 or 001). So I would guess that the inflation is simply starting at the wrong byte for some reason.

Resources