vim sort words alphabetically ignore accents - sorting

I want to use vim :sort to alphabetize a list of french words and have sort consider accented words (é) as unaccented (e). French dictionaries are arranged after this fashion. For example, sorting the list "eduquer ébats" yields "ébats eduquer". However, a simple sort with vim yields the first list. Is there a :sort flag i can set to accomplish this?

At the bottom of :help :sort, there's this note:
The details about sorting depend on the library function used. There is no
guarantee that sorting obeys the current locale. You will have to try it out.
Vim does do a "stable" sort.
First, ensure that you're running in a French locale. This can be done inside Vim via
:lang fr_FR
but it's probably even better to set the LANG environment variable in the shell (assuming Linux; on Windows, you probably need to set your user's language accordingly).
If that does not work, you can fall back to an external sort (which is commonly provided on Linux, you can also download a Windows port of GNU sort here). Sort from Vim via
:%! LANG=fr_FR sort ...

You can try sorting with Unicode::Collate module from perl. It's a perl core module.
Assume your word list is written in utf8:
:%!perl -CIO -MUnicode::Collate -e '$col = Unicode::Collate->new(level => 1); print for $col->sort(<>)'

Apparently, there is no direct vim sort method to accomplish what I want. My workaround consists in setting up 2 macros as mentioned above.
To recap: each line of my text file contains French language "term : definition". Some terms contain accented characters. In order to get the lines alphabetized so that accented letters are treated as unaccented, I wrote a macro the copies the "term", opens a new line, pastes the "term" on that separate line, then invoke a macro that converts accented characters to unaccented in that pasted "term", e.g., let #m=':s/^Vu00e0/a/ge'; my macro is a long string that searches for all accented characters in French.
Once that is done, I cut and paste the modified "term" to the head of the original line and wind up with: "unaccentedterm:accentedterm:definition". Then I run vim :sort, then set up a quick vim macro to strip out the first term, the unaccentedterm.
Many thanks to all who jumped it to help.

Related

No current text sorting works for numbers

In the DOS-era, text sorting when it comes to numbers use to work normally. ASCII order was correctly taken into account by any editor. Example: the list 100,1,20,3,10,2 would of been arranged in the correct order: 1,2,3,10,20,100. Now-days any text editor seems to disregard numbers (and special characters), resulting in something like: 1,10,100,2,20,3, which is practically a mess. This is also valid for other characters.
How can I make a correct sorting now-days ?
Note: I'm trying to use this to put many IP addresses in order.
What sort or any editor does:
103.207.39.0
124.248.228.0
125.75.132.0
13.107.6.0
136.243.202.0
139.217.27.0
14.139.200.0
14.53.187.0
144.76.109.0
148.251.204.0
This is the desired output:
13.107.6.0
14.53.187.0
14.139.200.0
103.207.39.0
124.248.228.0
125.75.132.0
136.243.202.0
139.217.27.0
144.76.109.0
148.251.204.0
Open your file of IP addresses in Notepad++. Do a regex Find-and-Replace:
Find what: (?:^|(?<=\.))\d(\d)?(?=\.|$)
Replace with: \x20(?1:\x20)$0
Make sure the search mode is "Regular expression" and click Replace All.
Now sort the lines using Edit > Line Operations > Sort lines Lexicographically Ascending
Now do another regex Find-and-Replace in order to get rid of the spaces:
Find what: \x20
Replace with nothing: make sure the search mode is "Regular expression" and click Replace All.
source: https://notepad-plus-plus.org/community/topic/14354/can-i-sort-ip-addresses-in-numeric-value

Terminal overwriting same line when too long

In my terminal, when I'm typing over the end of a line, rather than start a new line, my new characters overwrite the beginning of the same line.
I have seen many StackOverflow questions on this topic, but none of them have helped me. Most have something to do with improperly bracketed colors, but as far as I can tell, my PS1 looks fine.
Here it is below, generated using bash -x:
PS1='\[\033[01;32m\]\w \[\033[1;36m\]☔︎ \[\033[00m\] '
Yes, that is in fact an umbrella with rain; I have my Bash prompt update with the weather using a script I wrote.
EDIT:
My BashWeather script actually can put any one of a few weather characters, so it would be great if we could solve for all of these, or come up with some other solution:
☂☃☽☀︎☔︎
If the umbrella with rain is particularly problematic, I can change that to the regular umbrella without issue.
The symbol being printed ☔︎ consists of two Unicode codepoints: U+2614 (UMBRELLA WITH RAIN DROPS) and U+FE0E (VARIATION SELECTOR-15). The second of these is a zero-length qualifier, which is intended to enforce "text style", as opposed to "emoji style", on the preceding symbol. If you're viewing this with a font can distinguish the two styles, the following might be the emoji version: ☔︉ Otherwise, you can see a table of text and emoji variants in Working Group document N4182 (the umbrella is near the top of page 3).
In theory, U+FE0E should be recognized as a zero-length codepoint, like any other combining character. However, it will not hurt to surround the variant selector in PS1 with the "non-printing" escape sequence \[…\].
It's a bit awkward to paste an isolated variant selector directly into a file, so I'd recommend using bash's unicode-escape feature:
WEATHERCHAR=$'\u2614\[\ufe0e\]'
#...
PS1=...${WEATHERCHAR}...
Note that \[ and \] are interpreted before parameter expansion, so WEATHERCHAR as defined above cannot be dynamically inserted into the prompt. An alternative would be to make the dynamically-inserted character just the $'\u2614' umbrella (or whatever), and insert the $'\[\ufe0e\]' in the prompt template along with the terminal color codes, etc.
Of course, it is entirely possible that the variant indicator isn't needed at all. It certainly makes no useful difference on my Ubuntu system, where the terminal font I use (Deja Vu Sans Mono) renders both variants with a box around the umbrella, which is simply distracting, while the fonts used in my browser seem to render the umbrella identically with and without variants. But YMMV.
This almost works for me, so should probably not be considered a complete solution. This is a stripped down prompt that consists of only an umbrella and a space:
PS1='\342\230\[\224\357\270\] '
I use the octal escapes for the UTF-8 encoding of the umbrella character, putting the last three bytes inside \[...\] so that bash doesn't think they take up space on the screen. I initially put the last four bytes in, but at least in my terminal, there is a display error where the umbrella is followed by an extra character (the question-mark-in-a-diamond glyph for missing characters), so the umbrella really does occupy two spaces.
This could be an issue with bash and 5-byte UTF-8 sequences; using a character with a 4-byte UTF-encoding poses no problem:
# U+10400 DESERET CAPITAL LETTER LONG I
# (looks like a lowercase delta)
PS1='\360\220\220\200 '

Windows SED command - simple search and replace without regex

How should I use 'sed' command to find and replace the given word/words/sentence without considering any of them as the special character?
In other words hot to treat find and replace parameters as the plain text.
In following example I want to replace 'sagar' with '+sagar' then I have to give following command
sed "s/sagar/\\+sagar#g"
I know that \ should be escaped with another \ ,but I can't do this manipulation.
As there are so many special characters and theie combinations.
I am going to take find and replace parameters as input from user screen.
I want to execute the sed from c# code.
Simply, I do not want regular expression of sed to use. I want my command text to be treated as plain text?
Is this possible?
If so how can I do it?
While there may be sed versions that have an option like --noregex_matching, most of them don't have that option. Because you're getting the search and replace input by prompting a user, you're best bet is to scan the user input strings for reg-exp special characters and escape them as appropriate.
Also, will your users expect for example, their all caps search input to correctly match and replace a lower or mixed case string? In that case, recall that you could rewrite their target string as [Ss][Aa][Gg][Aa][Rr], and replace with +Sagar.
Note that there are far fewer regex characters used on the replacement side, with '&' meaning "complete string that was matched", and then the numbered replacment groups, like \1,\2,.... Given users that have no knowledge or expectation that they can use such characters, the likelyhood of them using is \1 in their required substitution is pretty low. More likely they may have a valid use for &, so you'll have to scan (at least) for that and replace with \&. In a basic sed, that's about it. (There may be others in the latest gnu seds, or some of the seds that have the genesis as PC tools).
For a replacement string, you shouldn't have to escape the + char at all. Probably yes for \. Again, you can scan your user's "naive" input, and add escape chars as need.
Finally if you're doing this for a "package" that will be distributed, and you'll be relying on the users' version of sed, beware that there are many versions of sed floating around, some that have their roots in Unix/Linux, and others, particularly of super-sed, that (I'm pretty sure) got started as PC-standalones and has a very different feature set.
IHTH.

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Search and replace Greek letters in notebook

I often use Greek letters in my calculations. Is there any way, to replace all occurrences of say ø with µ? From a computational/mathematical standpoint, it makes no difference what the variable name is. But sometimes, we are conditioned to use certain variables and become used to them. So if I happen to use an odd variable and need to share my notebook with a colleague, I'd like to change the variable before sending it. Is there an efficient way to search and replace Greek letters in mma?
Using Mathematica 7 or later, you can use the "Find" dialog to do this. Type \[Phi] as the search string and \[Mu] as the replacement string. This may also work in Mathematica 6 or earlier, but I don't have those versions at hand at the moment to try it.
See the "Listing of Named Characters" in the Mathematica help for the escape codes that you can use.
The find and replace dialog should work for this.
Assuming version 8, you can either use long names (\[OSlash]) to input the names, or shortcuts (shift-esc o / shift-esc).
(shift-esc being necessary because plain old esc in the find dialog will dismiss the dialog.)
In earlier versions the long name method should work. (The long name won't collapse into the character, but after finding/replacing it's all fine.)
In mma7, one possibility that is sometimes handy is Use Selection for Find. That is, select the greek letter (or whatever you want to replace), then (from the EDIT menu) -> Find -> Use Selection for Find.
When the Search/Replace dialog box is now invoked, the \ [phi] (for example) will be in the Find dialog box. On a Macintosh, the shortcut is command E (followed by command F). Also works for\ [CapitalDelta] etc.

Resources