What is a reliable way of getting allowed locale names in R? - windows

I'm trying to find a reliable way of finding locale codes to pass to Sys.setlocale.
The ?Sys.setlocale help page just states that the allowed values are OS dependent, and gives these examples:
Sys.setlocale("LC_TIME", "de") # Solaris: details are OS-dependent
Sys.setlocale("LC_TIME", "de_DE.utf8") # Modern Linux etc.
Sys.setlocale("LC_TIME", "de_DE.UTF-8") # ditto
Sys.setlocale("LC_TIME", "de_DE") # Mac OS X, in UTF-8
Sys.setlocale("LC_TIME", "German") # Windows
Under Linux, the possibilities can be retrieved using
locales <- system("locale -a", intern = TRUE)
## [1] "C" "C.utf8" "POSIX"
## [4] "af_ZA" "af_ZA.utf8" "am_ET"
## ...
I don't have Solaris or Mac machines to hand, but I guess that that output can be generated from that using something like:
library(stringr)
unique(str_split_fixed(locales, "_", 2)[, 1]) #Solaris
unique(str_split_fixed(locales, "\\.", 2)[, 1]) #Mac
Locales on Windows are much more problematic: they require long names of the form “language_country”, for example:
Sys.setlocale("LC_ALL", "German_Germany")
I can't find a reliable reference for the list of locales under Windows. Calling locale -a from the Windows command line fails unless cygwin is installed, and then it returns the same values as under Linux (I'm guessing it's accessing values in a standard C library.)
There doesn't seem to be a list of locales packaged with R (I thought there might something similar to share/zoneinfo/zone.tab that contains time zone details).
My current best strategy is to browse this webpage from Microsoft and form the name by manipulating the SUBLANG column of the table.
http://msdn.microsoft.com/en-us/library/dd318693.aspx
Some guesswork is needed, for example the locale related to SUBLANG_ENGLISH_UK is English_United Kingdom.
Sys.setlocale("LC_ALL", "English_United Kingdom")
Where there are variants in different alphabets, parentheses are needed.
Sys.setlocale("LC_ALL", "Uzbek (Latin)_Uzbekistan")
Sys.setlocale("LC_ALL", "Uzbek (Cyrillic)_Uzbekistan")
This guesswork wouldn't be too bad, but many locales don't work at all, including most Indian locales.
Sys.setlocale("LC_ALL", "Hindi_India")
Sys.setlocale("LC_ALL", "Tamil_India")
Sys.setlocale("LC_ALL", "Sindhi_Pakistan")
Sys.setlocale("LC_ALL", "Nynorsk_Norway")
Sys.setlocale("LC_ALL", "Amharic_Ethiopia")
The Windows Region and Language dialog box (Windows\System32\intl.cpl, see pic) has a similar but not identical list of available locales, but I don't know where that is populated from.
There are several related questions:
1. Mac and Solaris people: please can you check to see if my code for getting locales works under your OS.
2. Indian/Pakistani/Norwegian/Ethiopian people using Windows: Please can you tell me what Sys.getlocale() returns for you.
3. Other Windows people: Is there any better documentation on which locales are available?
Update: After clicking links in the question that Ben B mentioned, I stumbled across this better list of locales in Windows. By manually changing the locale using the Region and Language dialog and calling Sys.getlocale(), I deduced that Nynorsk is "Norwegian-Nynorsk_Norway". There are still many oddities, for example
Sys.setlocale(, "Inuktitut (Latin)_Canada")
is fine, but
Sys.setlocale(, "Inuktitut (Syllabics)_Canada")
fails (as do most of the Indian languages). Starting R in any of these locales causes a warning, and R's locale to revert to C.
I'm still interested to hear from any Indians, etc., as to what locale you have.

In answer to your first question, here's the output on my Mac:
> locales <- system("locale -a", intern = TRUE)
> library(stringr)
> unique(str_split_fixed(locales, "\\.", 2)[, 1])
[1] "af_ZA" "am_ET" "be_BY" "bg_BG" "ca_ES" "cs_CZ" "da_DK" "de_AT" "de_CH"
[10] "de_DE" "el_GR" "en_AU" "en_CA" "en_GB" "en_IE" "en_NZ" "en_US" "es_ES"
[19] "et_EE" "eu_ES" "fi_FI" "fr_BE" "fr_CA" "fr_CH" "fr_FR" "he_IL" "hi_IN"
[28] "hr_HR" "hu_HU" "hy_AM" "is_IS" "it_CH" "it_IT" "ja_JP" "kk_KZ" "ko_KR"
[37] "lt_LT" "nl_BE" "nl_NL" "no_NO" "pl_PL" "pt_BR" "pt_PT" "ro_RO" "ru_RU"
[46] "sk_SK" "sl_SI" "sr_YU" "sv_SE" "tr_TR" "uk_UA" "zh_CN" "zh_HK" "zh_TW"
[55] "C" "POSIX"
I'm not sure what I'm expecting to see with Sys.setlocale() but it doesn't throw any errors:
> Sys.setlocale(locale="he_IL")
[1] "he_IL/he_IL/he_IL/C/he_IL/en_AU.UTF-8"
> Sys.getlocale()
[1] "he_IL/he_IL/he_IL/C/he_IL/en_AU.UTF-8"

Thanks all. I went to the URL that Richie suggested, http://msdn.microsoft.com/en-us/library/dd318693.aspx, and tried LANG_BELARUSIAN in windows. That didn't work, so I lopped off the "LANG_" and included "BELARUSIAN" by itself. Worked fine.
> bk.date1
[1] "Ma 2012 august 14 11:28:30 "
ymd_hms(bk.date1, locale = "BELARUSIAN")
[1] "2012-08-14 11:28:30 UTC"

Related

Best Practices for multiple OS consistency when using Ruby's Dir.glob

I noticed recently during a debugging session that Dir.glob (aka Dir[]) behaves differently depending on the OS. Specifically the order the files are returned in is different.
What are recommended ways to use Dir.glob in Ruby when you know the code will be used on a variety of OSes?
Example Difference:
I cloned the project DeckSchrubber in Linux and Windows
Windows:
irb(main):003:0> puts Dir['./*']
./CHANGELOG.md
./LICENSE
./main.go
./README.md
./types.go
./util
=> nil
Linux:
irb(main):011:0> puts Dir['./*']
./main.go
./LICENSE
./util
./types.go
./README.md
./CHANGELOG.md
=> nil
Once again I am asking for solutions and idioms to ensure the output is canonical.
Usually, FS libraries behave in a different way on Mac and Linux. I don't consider windows as a platform for Ruby.
So, from my experience, it was enough just to add a conditional operator, that checks the current platform name, and sorts the result in required way. As far as I remember, the difference was in the order of returned files.

Bash: Add languages to Chromium

Is it possibe to use Bash to add languages to Chromium? That is, do the equivalent of going to Settings - Advanced - Languages in the Chromium GUI, activate the languages you want, and then activate spell-checking for the same languages? Had a look at this, but there doesn't seem to be anything that fits the bill.
Figured it out. The best way seems to be to add a Python block to read and manipulate the Preferences-file using the JSON library. Before you do anything, you need to get your bearings in the Preferences-file. What are the relevant elements that you need to change?
If you go to Preferences in the Chromium GUI, you can see that there are two relevant settings:
1) Languages:
2) Dictionaries (for spell check):
These can be found in the Preferences-file by pretty-printing the file in the terminal (improving it with pygmentize) or saving a pretty-printed output to a file:
less Preferences | python -m json.tool | pygmentize -g
or
~/.config/chromium/Default$ less Preferences | python -m json.tool >> ~/Documents/output.txt
Searching through the file for language settings, you will find two relevant elements:
"intl": {
"accept_languages": "en-US,en,nb,fr-FR,gl,de,gr,pt-PT,es-ES,sv"
},
and
"spellcheck": {
"dictionaries": [
"en-US",
"nb",
"de",
"gr",
"pt-PT",
"es-ES",
"sv"
],
"dictionary": ""
}
Before you do anything else, it is wise to backup the Preferences-file... Next,you can alter the language settings by adding the following python-block to the bash script:
python - << EOF
import json
import os
data = json.load(open(os.path.expanduser("~/.config/chromium/Default/Preferences"), 'r'))
data['intl'] = {"accept_languages": "en-US,en,nb,fr-FR,gl,de,pt-PT,es-ES,sv"}
data['spellcheck'] = {"dictionaries":["en-US","nb","de","pt-PT","es-ES","sv"],"dictionary":""}
with open(os.path.expanduser('~/.config/chromium/Default/Preferences'), 'w') as outfile:
json.dump(data, outfile)
EOF
In this case, the script will remove Greek from the available languages and the spellchecker. Note that in order to add languages, you need to know the language code accepted by Chromium.
You can find more on reading and writing JSON here and here, and more on how to include Python scripts in bash scripts here.

Using weekdays with any locale under Windows

I'm trying to get the day of the week, and have it work consistently in any locale. In locales with Latin alphabets, everything is fine.
Sys.getlocale()
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252"
weekdays(Sys.Date())
## [1] "Tuesday"
I have two related problems with other locales.
If I set
Sys.setlocale("LC_ALL", "Arabic_Qatar")
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=Arabic_Qatar.1256;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
then I sometimes (correctly) get
weekdays(Sys.Date())
## [1] "الثلاثاء
and sometimes get
weekdays(Sys.Date())
## [1] "ÇáËáÇËÇÁ"
depending upon my setup. The problem is, I can't figure out what is causing the difference.
I thought it might be something to do with getOption("encoding"), but I've tried explicitly setting options(encoding = "native.enc") and options(encoding = "UTF-8") and it makes no difference.
I've tried several recent versions of R, and the problem is consistent across all of them.
At the moment, the string displays correctly in R GUI, but incorrectly when I use an IDE (Architect and RStudio tested).
What should I set to ensure that weekdays always displays correctly?
It may be helpful to know that weekdays(Sys.Date()) is equivalent to format(as.POSIXlt(Sys.Date()), "%A"), which calls an internal format.POSIXlt method.
Secondly, it seems overkill to change all of the locale. I thought I should just be able to set the time options. However, if I set individual components of the locale, weekdays returns a string of question marks.
for(category in c("LC_TIME", "LC_CTYPE", "LC_COLLATE", "LC_MONETARY"))
{
Sys.setlocale(category, "Arabic_Qatar")
print(Sys.getlocale())
print(weekdays(Sys.Date()))
}
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
## [1] "LC_COLLATE=Arabic_Qatar.1256;LC_CTYPE=Arabic_Qatar.1256;LC_MONETARY=Arabic_Qatar.1256;LC_NUMERIC=C;LC_TIME=Arabic_Qatar.1256"
## [1] "????????"
What parts of the locale affect how the weekdays are printed?
Update: The problem seems to be Windows-related. When I run the code on a Linux box with locale "ar_QA.UTF8", the weekdays are correctly displayed.
Further update: As agstudy mentioned in his answer, setting locales under Windows is odd, since you can't just use ISO codes like "en-GB". For Windows 7/Vista/Server 2003/XP you can set a locale using setlocale language strings or National Language Support values. For Qatari Arabic, there is no setlocale language string, so we must use an NLS value. We have several choices:
Sys.setlocale("LC_TIME", "ARQ") # the language abbreviation name
Sys.setlocale("LC_TIME", "Arabic_Qatar") # corresponding to the language/country pair "Arabic (Qatar)"
Sys.setlocale("LC_TIME", "Arabic_Qatar.1256") # explicitly including the ANSI codepage
Sys.setlocale("LC_TIME", "Arabic") # would sometimes be a possibility too, but it defaults to Saudi Arabic
So the problem isn't that R cannot support Arabic locales under Windows (though I'm not entirely convinced of the robustness of Sys.setlocale).
Desperate last ditch attempt: Trying to magically fix things by using Windows Management Instrumentation Command to change the OS locale doesn't work, since R doesn't appear to recognise the changes.
system("wmic os set locale=MS_4001")
## Updating property(s) of '\\PC402729\ROOT\CIMV2:Win32_OperatingSystem=#'
## Property(s) update successful.
system("wmic os get locale") # same as before
The system of naming locales is OS-specific. I recommend you to read the locales from R Installation and Administration manual for a complete explanation.
under windows :
The list of supported language is listed MSDN Language Strings. And surprisingly there is not Arabic language there. The "Language string" column contains the legal input for setting locale in R and even in the list contry /regions strings there no country spoken arabic there.
Of course you can change your locale global settings( panel setting --> region --> ..) but this will change it globally and it is not sure to get the right output without encoding problem.
under linux(ubuntu in my case):
Arabic is generally not supported by default, but is easy to set it using locale.
locale -a ## to list all already supported language
sudo locale-gen ar_QA.UTF-8 ## install it in case does not exist
under RStudio now :
Sys.setlocale('LC_TIME','ar_QA.UTF-8')
[1] "ar_QA.UTF-8"
> format(Sys.Date(),'%A')
[1] "الثلاثاء
Note also that under R console the printing is not as pretty as in R studio because it is written from left to right not from right to left.
The RStudio/Architect problem
This can be solved, slightly messily, by explicitly changing the encoding of the weekdays string to UTF-8.
current_codepage <- as.character(l10n_info()$codepage)
iconv(weekdays(Sys.Date()), from = current_codepage, to = "utf8")
Note that codepages only exist on Windows; l10n_info()$codepage is NULL on Linux.
The LC_TIME problem
It turns out that under Windows you have to set both the LC_CTYPE and LC_TIME locale categories, and you have to set LC_CTYPE before LC_TIME, or it won't work.
In the end, we need different implementations for different OSes.
Windows version:
get_today_windows <- function(locale = NULL)
{
if(!is.null(locale))
{
lc_ctype <- Sys.getlocale("LC_CTYPE")
lc_time <- Sys.getlocale("LC_TIME")
on.exit(Sys.setlocale("LC_CTYPE", lc_ctype))
on.exit(Sys.setlocale("LC_TIME", lc_time), add = TRUE)
Sys.setlocale("LC_CTYPE", locale)
Sys.setlocale("LC_TIME", locale)
}
today <- weekdays(Sys.Date())
current_codepage <- as.character(l10n_info()$codepage)
iconv(today, from = current_codepage, to = "utf8")
}
get_today_windows()
## [1] "Tuesday"
get_today_windows("French_France")
## [1] "mardi"
get_today_windows("Arabic_Qatar")
## [1] "الثلاثاء"
get_today_windows("Serbian (Cyrillic)")
## [1] "уторак"
get_today_windows("Chinese (Traditional)_Taiwan")
## [1] "星期二"
Linux version:
get_today_linux <- function(locale = NULL)
{
if(!is.null(locale))
{
lc_time <- Sys.getlocale("LC_TIME")
on.exit(Sys.setlocale("LC_TIME", lc_time), add = TRUE)
Sys.setlocale("LC_TIME", locale)
}
weekdays(Sys.Date())
}
get_today_linux()
## [1] "Tuesday"
get_today_linux("fr_FR.utf8")
## [1] "mardi"
get_today_linux("ar_QA.utf8")
## [1] "الثلاثاء"
get_today_linux("sr_RS.utf8")
## [1] "уторак"
get_today_linux("zh_TW.utf8")
## [1] "週二"
Enforcing the .utf8 encoding in the locale seems important get_today_linux("zh_TW") doesn't display properly.

Error in nchar() when reading in stata file in R on Mac

I'm learning R and am simply trying to read in a stata data file but am getting the error below:
X <- Stata.file(Stata_File)
Error in nchar(varlabs) : invalid multibyte string 253
Multiple Mac users here are encountering this error with the program but it works fine on a PC. A google search of this error seems to say it has something to do with the R package but I can't find a solution. Any ideas? Thanks for your help!!
The R code up to the error point is below:
Root <- "/Users/Desktop/R_Training"
PathIn <- paste(Root,"Data/Example_0",sep="/")
# The 2007 Dominican Republic household member file (96 MB)
Stata_File <- "drpr51fl.dta"
# Load the memisc package:
library(memisc)
# Set the working directory:
setwd(PathIn)
# (1) Determine which variables we want:
# The Stata.file function (from memisc) reads the "header"
# of our Stata file so you can see what it contains
# and choose the variables you want.
X <- Stata.file(Stata_File)
**Error in nchar(varlabs) : invalid multibyte string 253**
Below is my session info:
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] memisc_0.95-33 MASS_7.3-13 lattice_0.19-30
This is what worked for me. You can force R to recognize every character by issuing the following command:
Sys.setlocale('LC_ALL','C')
Now run the previous command and all should be fine.
It seems like the encoding of strings in the file isn't what the program thinks it is...
I guess the file was generated on a PC? Does it contain non-ACII column names or data strings?
Since you seem to have UTF-8 encoding, and (US/western europe) PC:s typically have latin-1, that could be the problem. I'd expect the same problem on Linux then (also UTF-8).
Possible work-arounds:
Does the Stata.file method have an "encoding" option? Then you might try 'latin1' and hope for the best...
Another possibility is to start R with the --encoding=latin1 option.

debugging littler/Rscripts

How do I debug Rscripts that are run from the command line?
I am currently using the getopt package to pass command line options, nut when there's a bug, it is hard for me to:
see what exactly went wrong;
debug interactively in R (since the script expects command line options.)
Does anyone have example code and willing to share?
You could pass your command line arguments into an interactive shell with --args and then source('') the script.
$ R --args -v
R version 2.8.1 (2008-12-22)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> require(getopt)
Loading required package: getopt
> opt = getopt(c(
+ 'verbose', 'v', 2, "integer"
+ ));
> opt
$verbose
[1] 1
> source('my_script.R')
You could now use the old browser() function to debug.
I either use old-school print statements, or interactive analysis. For that, I first save state using save(), and then load that into an interactive session (for which I use Emacs/ESS). That allows for interactive work using the script code on a line-by-line basis.
But I often write/test/debug the code in interactive mode first before I deploy in a littler script.
Another option is to work with the options(error) functionality. Here's a simple example:
options(error = quote({dump.frames(to.file=TRUE); q()}))
You can create as elaborate a script as you want on an error condition, so you should just decide what information you need for debugging.
Otherwise, if there are specific areas you're concerned about (e.g. connecting to a database), then wrap them in a tryCatch() function.

Resources