How to sort words with accents? - bash

I was wondering how to sort alphabetically a list of Spanish words [with accents].
Excerpt from the word list:
Chocó
Cundinamarca
Córdoba

Cygwin uses GNU utilities, which are usually well-behaved when it comes to locales - a notable and regrettable exception is awk (gawk)ref.
The following is based on Cygwin 1.7.31-3, current as of this writing.
Cygwin by default uses the locale implied by the current Windows user's UI language, combined with UTF-8 character encoding.
Note that it's NOT based on the setting for date/time/number/currency formats, and changing that makes no difference. The limitation of basing the locale on the UI language is that it invariably uses that language's "home" region; e.g., if your UI language is Spanish, Cygwin will invariably use en_ES, i.e., Spain's locale. The only way to change that is to explicitly override the default - see below.
You can override this in a variety of ways, preferably by defining a persistent Windows environment variable named LANG (see below; for an overview of all methods, see https://superuser.com/a/271423/139307)
To see what locale is in effect in Cygwin, run locale and inspect the value of the LANG variable.
If that doesn't show es_*.utf8 (where * represents your region in the Spanish-speaking world, e.g., CO for Colombia, ES for Spain, ...), set the locale as follows:
In Windows, open the Start menu and search for 'environment', then select Edit environment variables for your account, which opens the Environment Variables dialog.
Edit or create a variable named LANG with the desired locale, e.g., es_CO.utf8 -- UTF-8 character encoding is usually the best choice.
Any Cygwin bash shell you open from the on should reflect the new locale - verify by running locale and ensuring that the LC_* values match the LANG value and that no warnings are reported.
At that point, the following:
sort <<<$'Chocó\nCundinamarca\nCórdoba'
should produce (i.e., ó will sort directly after o, as desired):
Chocó
Córdoba
Cundinamarca
Note: locale en_US.utf8 would produce the same output - apparently, it generically sorts accented characters directly after their base characters - which may or may not be what a specific non-US locale actually does.

Related

Is it possible to stop `powershell` wrapping output in ANSI sequences?

I CreateProcess(win32) powershell and read raw bytes from it.
And I see that it produces a lot of invisible chars.
For example \u{1b}[2J\u{1b}[m\u{1b}[
Is there any way how to stop it?
*Exactly it's possible to strip them manually but I do hope there other way.
You mention powershell (powershell.exe), i.e. the CLI of Windows PowerShell.
Windows PowerShell (unlike PowerShell (Core) 7+, see below) itself does not use coloring / formatting based on VT / ANSI escape sequences.
The implication is that third-party code is producing the VT sequences in your case, so you must deactivate (or reconfigure) it to avoid such sequences in the output.
A prime candidate is a custom prompt function; such functions often involve coloring for a better command-line experience.
In programmatic use of powershell.exe, however, you would only see what the prompt function prints if you feed PowerShell commands to the CLI's stdin, accompanied by passing argument -File - to the CLI (to instruct it to read commands from stdin) or by default.
To exclude the prompt-function output from the output altogether, use -Command -, as discussed in the answer to your previous question.
If you do want it, but want to use the default prompt string, suppress $PROFILE loading with the -NoProfile parameter, which is generally preferable in programmatic processing.
Controlling use of colored output (VT / ANSI escape sequences) in PowerShell (Core) 7.2+
In PowerShell (Core) 7+ (pwsh.exe) - but not in Windows PowerShell (powershell.exe) - PowerShell itself situationally uses VT (ANSI) escape sequence to produce formatted/colored output, such as in the output of Select-String and, in v7.2+, in formatted output in general, notably column headers in tabular output / property names in list output.
In PowerShell 7.2+ you can categorically suppress these as follows:
Note: Categorically disabling VT (ANSI) sequences is generally not necessary, because PowerShell automatically suppresses them when output is not sent to the host (display); that is, $PSStyle.OutputRendering defaults to Host[1].
This amounts to the same behavior that may Unix utilities (sensibly) exhibit: coloring is by default only applied when printing to the display (terminal), not when piping to another command or redirecting to a file.
However, note that $PSStyle.OutputRendering only applies to objects that are formatted by PowerShell's for-display formatting system and only when such formatted representations are converted to string data, either explicitly with Out-String, or implicitly with > / Out-File or when piping to an external program.[2]
From outside PowerShell, before launching it:
By defining the NO_COLOR environment variable, with any value, such as 1.
This causes PowerShell to set $PSStyle.OutputRendering to PlainText on startup, which instructs PowerShell not to use VT / ANSI escape sequences.
A growing number of external programs also respect this env. variable - see no-color.org
Alternatively, by setting the TERM environment variable to xtermm / xterm-mono:
Note that value dumb, at least as of PowerShell Core 7.2.0-preview.9 (the most recent version as of this writing), doesn't work: while it does cause $host.UI.SupportsVirtualTerminal to then reflect $false, as documented, $PSStyle.OutputRendering remains at its default, Host, and actual formatted output (such as from Get-Item /) still uses colors.
However, setting TERM may be more effective than NO_COLOR in getting external programs not to emit VT sequences - ultimately, there's no guarantee however.
That said, modifying TERM, especially setting it to value dumb, is best avoided in general, because external programs (on Unix-like platforms) may rely on the TERM variable for inferring (also non-color-related) capabilities of the hosting terminal application (which setting the variable to dumb takes away altogether and - at least hypothetically - value xtermm / xterm-mono may misrepresent the true fundamental terminal type).
From inside PowerShell:
With $PSStyle.OutputRendering = 'PlainText'
Note that this alone - unlike the environment variables discussed above - does not affect the behavior of external programs.
Note that third-party PowerShell code that uses VT sequences, especially if it predates PowerShell (Core) 7+, may not respect any of the standard mechanisms described above for disabling them (though may conceivably offer a custom mechanism).
[1] This applies since v7.2.0-preview.9, where Host was made the default and the previous default, Automatic, was removed altogether. In preview versions of v7.3.0, Ansi was temporarily the default, but since the official v7.3.0 release the (sensible) default is again Host.
[2] Notably, this means that string data that has embedded ANSI / VT escape sequences is not subject to $PSStyle.OutputRendering in v7.3+ (a change from v7.2), because strings aren't handled by the formatting system (they print as-is).
You can disable ANSI output rendering by setting the environment variable TERM to dumb:
SetEnvironmentVariable(TEXT("TERM"), TEXT("dumb"));
// proceed with your call to CreateProcess

How do I turn OFF colors for ls output in Terminal on OSX

my ls output colors all directories differently from files, regardless whether I type ls or /bin/ls. I don't have any LS_COLOR stuff set in .bashrc or related files that I can find.
How do I turn off these colors? (I am quite happy with just ls -F)
Thanks!
As noted in comment, OSX ls pays attention to CLICOLOR. The ls manual page is the place to look. It appears to be the same program as in FreeBSD, which uses the terminal database (in contrast to GNU ls). Likewise, note that the variable is LSCOLORS, rather than LS_COLORS:
CLICOLOR
Use ANSI color sequences to distinguish file types. See LSCOLORS below. In addition
to the file types mentioned in the -F option some extra attributes (setuid bit set,
etc.) are also displayed. The colorization is dependent on a terminal type with the
proper termcap(5) capabilities. The default “cons25” console has the proper capabilities,
but to display the colors in an xterm(1), for example, the TERM variable must be
set to “xterm-color”. Other terminal types may require similar adjustments. Colorization
is silently disabled if the output isn't directed to a terminal unless the CLICOLOR_FORCE variable is defined.
CLICOLOR_FORCE
Color sequences are normally disabled if the output isn't directed to a terminal. This
can be overridden by setting this flag. The TERM variable still needs to reference a
color capable terminal however otherwise it is not possible to determine which color
sequences to use.
TERM
The CLICOLOR functionality depends on a terminal type with color capabilities.
The wording about "termcap(5)" is outdated; both FreeBSD and OSX have used terminfo databases for more than ten years.
The GNU ls manual page does say LS_COLORS (the two are not the same). The dircolors manual page makes an oblique reference to a "precompiled database" (this is unrelated to terminfo/termcap, and its use of TERM to get similar results is a source of confusion).

Makes sense to use wchar_t/wmain in a windows c++ console application?

I have been writing a new command line application in C++. One platform we support is, of course, Windows.
The Windows console, by default, uses the OEM code pages depending on the locale (for example, on my machine it is CP437 / DOS.Western). I think, if it was a Windows Cyrillic version, it would have been CP866, and so on. These OEM code pages contain only 256 characters)
I think what this means is the Windows console translates the input key strokes into characters based on the default code page. (And, depending on the currently selected fonts, if there is a corresponding glyph, it is displayed).
In such a case, whether does it makes sense to use wmain/wchar_t and wide char types in my application?
Is there any advantage of using wide types? Or is there any grave problem if just char * is used?
When wide char types are used, what is the encoding of the command line arguments and environment strings - (wchar_t * argv[] and wchar_t * envp[]), i mean. Are they converted to UTF-16 by Windows CRT, or are they untouched?
Thanks for your contributions.
You seem to be assuming that Windows internally works in the specified codepage. That's not true. Windows internally works in Unicode (UTF-16). For legacy software that uses char instead of wchar_t, input and output are translated into the specified codepage.
I think what this means is the Windows console translates the input key strokes into characters based on the default code page
This is not correct. The mapping of key strokes to (Unicode) characters is defined by the keyboard layout. This is totally independent of the code page. E.g you could use a Chinese keyboard layout on a system using a Cyrillic code page.
Not only makes it totally sense to usewchar_t, it is the recommended way.
Yes, there is an advantage: your program can process all characters supported by Windows. If you use char, you can't handle any characters that are not in the current code page.
They are not converted - they stay what they are, namely UTF-16 characters.
Unfortunately, the command prompt itself is an 'ANSI' application, so it suffers from all of the limitations of 'ANSI', and this affects your application if you use it from the command prompt. However, a console application can be used in other ways, without a command prompt window, and then it can support Unicode fully.

Is there a naming convention for locale-specific static files?

I have some static resources (images and HTML files) that will be localized. One piece of software I've seen do this is Apache, which appends the locale to the name; for example, test_en_US.html or test_de_CH.html. I'm wondering whether this naming scheme is considered standard, or whether every project does it differently.
While there is no documented standard for naming Localized files, I'd recommend using the format filename[_language[ _country]] where
language is the ISO-639 2 letter language code
territory is the ISO-3166 2 letter country code
For example:
myFile.txt (non-localized file)
myFile_en.txt (localized for global English)
myFile_en_US.txt (localized for US English)
myFile_en_GB.txt (localized for UK English)
Why? This is the most typical format used by operating systems, globalization tools (such as Trados and WorldServer), and programming languages. So unless you have a particular fondness for a different format, I see no reason to deviate from what most other folks are doing. It may save you some integration headaches down the road.
While there doesn't appear to a standard conventions as to where in the file name to place them, the international codes for language (e.g. "en") and region (e.g. "en-US") are both very common and very straightforward. Variations I've seen, excluding "enUS" vs. "en_US" vs. "en-US":
foo.enUS.ext
foo.ext_enUS
enUS.foo.ext
foo/enUS.ext
enUS/foo.ext
…ad nauseum
I personally favor the first and last variants. The former for grouping files by name/resource (good for situations in which a limited number of files need localized) and the latter for grouping files by locale (better for situations with a large number of localized files).
You should always use the "de-facto" standard, which is the unix/posix way with gettext. And you shoud use gettext to make your localization!
Therefore one and only correct way is to use localization naming like this:
en
en_US
en_UK
Some applications and especially Java developers ar sometimes using the en-US (hyphenated instead than underscored) and it is ALL WRONG!!!
gettext standard is this and only this:
locale
|_en_US
|_LC_MESSAGES
|_appname.mo
Where:
locale - Name of the directory, can vary but it is highly recommended to stay with "locale"-name
en_US - Any standard locale like *es_ES*, *es_PT*, ...
LC_MESSAGES - mandatory and cannot be changed!
appname.mo - msgfmt compiled appname.po file (appname is what ever you want)

Why doesn't **sort** sort the same on every machine?

Using the same sort command with the same input produces different results on different machines. How do I fix that?
The man-page on OS X says:
******* WARNING ******* The locale specified by the environment affects sort order. Set LC_ALL=C to get
the traditional sort order that uses native byte values.
which might explain things.
If some of your systems have no locale support, they would default to that locale (C), so you wouldn't have to set it on those. If you have some that supports locales and want the same behavior, set LC_ALL=C on those systems. That would be the way to have as many systems as I know do it the same way.
If you don't have any locale-less systems, just making sure they share locale would probably be enough.
For more canonical information, see The Single UNIX ® Specification, Version 2 description of locale, environment variables, setlocale() and the description of the sort(1) utility.
This can be the result of locale differences:
$ echo 'CO2_
CO_' | env LC_ALL=C sort
CO2_
CO_
$ echo 'CO2_
CO_' | env LC_ALL=en_US sort
CO_
CO2_
Setting the LC_ALL environment variable to the same value should correct the problem.
This is probably due to different settings of the locale environment variables. sort will use these settings to determine how to compare strings. By setting these environment variables the way you want before calling sort, you should be able to force it to behave in one specific way.
For more than you ever wanted to know about sort, read the specification of sort in the Single Unix Specification v3. It states
Comparisons [...] shall be performed using the collating sequence of the current locale.
IOW, how sort sorts is dependent on the locale (language) settings of the environment that the script is running under.

Resources