using swedish letters ÖÄÅ with sort in batch - sorting

I need to sort a table of contents that contains Swedish letters in Batch.
I can echo them using (chcp 65001) at the top of my code.
Is there a way to to sort the list according to the swedish alphabetical order ?(something like LC_ALL=C.UTF-8 in bash)

thank you all. I've fixed it using (chcp 1252) and changing the encoding of the input text file to (UTF-16 LE)

Related

vim sort words alphabetically ignore accents

I want to use vim :sort to alphabetize a list of french words and have sort consider accented words (é) as unaccented (e). French dictionaries are arranged after this fashion. For example, sorting the list "eduquer ébats" yields "ébats eduquer". However, a simple sort with vim yields the first list. Is there a :sort flag i can set to accomplish this?
At the bottom of :help :sort, there's this note:
The details about sorting depend on the library function used. There is no
guarantee that sorting obeys the current locale. You will have to try it out.
Vim does do a "stable" sort.
First, ensure that you're running in a French locale. This can be done inside Vim via
:lang fr_FR
but it's probably even better to set the LANG environment variable in the shell (assuming Linux; on Windows, you probably need to set your user's language accordingly).
If that does not work, you can fall back to an external sort (which is commonly provided on Linux, you can also download a Windows port of GNU sort here). Sort from Vim via
:%! LANG=fr_FR sort ...
You can try sorting with Unicode::Collate module from perl. It's a perl core module.
Assume your word list is written in utf8:
:%!perl -CIO -MUnicode::Collate -e '$col = Unicode::Collate->new(level => 1); print for $col->sort(<>)'
Apparently, there is no direct vim sort method to accomplish what I want. My workaround consists in setting up 2 macros as mentioned above.
To recap: each line of my text file contains French language "term : definition". Some terms contain accented characters. In order to get the lines alphabetized so that accented letters are treated as unaccented, I wrote a macro the copies the "term", opens a new line, pastes the "term" on that separate line, then invoke a macro that converts accented characters to unaccented in that pasted "term", e.g., let #m=':s/^Vu00e0/a/ge'; my macro is a long string that searches for all accented characters in French.
Once that is done, I cut and paste the modified "term" to the head of the original line and wind up with: "unaccentedterm:accentedterm:definition". Then I run vim :sort, then set up a quick vim macro to strip out the first term, the unaccentedterm.
Many thanks to all who jumped it to help.

Notepad++ - Binary text error

I have a huge txt file made using python. When I'm trying to sort it using Notepad++/TextFX it returns error: This tool is not compatible with binary text. Please select text without [NUL] characters.. Does it means that I have non-printable chars in this txt file? Is it possible to convert this file to compatible format so I could sort it using TextFX?
EDIT: I used mode 'a' in Python to write this file.
Thank you for your advices.
using TextFX in Notepad++ you could try the following:
Mark the suspicious part or the whole text
Select TextFX, TextFX Characters, Zap all nonprintable characters to #. (The last entry in that submenu.)
All the problematic characters should have been replaced with "#", you can then search for "#".
Another idea is the function: Search, "Find characters in range". Check "My range:" and enter "0" and "0" as range, to find [Nul] characters.
Lars

remove set of characters surrounding value

I'm redirecting output of an API call to file
however I always get the following characters surrounding the value I need
domainid='^[[39;49;00m^[[33;01m75307d12-e3f4-4a96-ac23-e2a9439f8299^[[39;49;00m'
Desired output
domainid='75307d12-e3f4-4a96-ac23-e2a9439f8299'
I really have no idea how to clean the output and make it look like the above.
Any suggestions will be highly appreciated.
Thank you
Those are ANSI control characters, or escape sequences, and they typically are used to add colors, underline, and so forth to your output.
First order of business is to check if your API command line tool supports a no-color mode. That would solve your problem at the source.
Barring that, try this Server Fault answer, which has a command to clear ANSI sequences out of a text file using sed.
You could remove the undesired characters by replacing the line with just the submatches you want to keep:
... | sed -r "s/(domainid=).*([0-9a-f]{8}(-[0-9a-f]{4}){3}-[0-9a-f]{12}).*/\1'\2'/i"

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Resources