Notepad++ - Binary text error - sorting

I have a huge txt file made using python. When I'm trying to sort it using Notepad++/TextFX it returns error: This tool is not compatible with binary text. Please select text without [NUL] characters.. Does it means that I have non-printable chars in this txt file? Is it possible to convert this file to compatible format so I could sort it using TextFX?
EDIT: I used mode 'a' in Python to write this file.
Thank you for your advices.

using TextFX in Notepad++ you could try the following:
Mark the suspicious part or the whole text
Select TextFX, TextFX Characters, Zap all nonprintable characters to #. (The last entry in that submenu.)
All the problematic characters should have been replaced with "#", you can then search for "#".
Another idea is the function: Search, "Find characters in range". Check "My range:" and enter "0" and "0" as range, to find [Nul] characters.
Lars

Related

How to delete quotation mark in text file printed

I'm honestly a novice on scilab.
I'm using print function to create .txt file with my character matrix in it.
But , when I open txt file, double quote appeared. I just want words without "".
This is how I'm using print
Compterendu(1,1)= "Medecin demandeur: "
fileresname= fullfile(RES_PATH, "compterendu.txt")
print(fileresname,Compterendu)
And, compterendu.txt was printed out like this.
Would be so grateful for any help!!
Thanks
Why do you use "print" ? After looking into the doc, yes, it is used to produce the same text as when you type the expression or the variable name on the command line. Hence it does print double quotes for strings. If you need something more basic use lower level i/o commands, like mputl.
S.

Need to strip out invalid characters in CSV file

I am generating a CSV file from a Microsoft SQL database that was provided to me, but somehow there are invalid characters in about two dozen places throughout the text (there are many thousands of lines of data). When I open the CSV in my text editor, they display as red, upside-down question marks (there are two of them in the attached screenshot).
When I copy the character and view the "find/replace" dialog in my text editor, I see this:
\x{0D}
...but I have no idea what that means. I need to modify my script that generates the CSV so it strips these characters out, but I don't know how to identify them. My script is written in Classic ASP.
You can also use RegEx to remove unwanted characters:
Set objRegEx = CreateObject(“VBScript.RegExp”)
objRegEx.Global = True
objRegEx.Pattern = “[^A-Za-z]”
strCSV = objRegEx.Replace(strCSV, “”)
This code is from the following article which explains in details what it does:
How Can I Remove All the Non-Alphabetic Characters in a String?
In your case you will want to add some characters to the Pattern:
^[a-zA-Z0-9!##$&()\\-`.+,/\"]*$
You can simply use the Replace function and specify Chr(191) (or "¿" directly):
Replace(yourCSV, Chr(191), "")
or
Replace(yourCSV, "¿", "")
This will remove the character. If you need to replace it with something else, change the last parameter from "" to a different value ("-" for example).
In general, you can use charmap.exe (Character Map) from Run menu, select Arial, find a symbol and copy it to the clipboard. You can then check its value using Asc("¿"), this will return the ASCII code to use with Chr().

A hint for end of ASCII data in a binary file

I'm developing a software that stores its data in a binary file format. However, as a courtesy to innocent shell users that might cat to inspect the contents of such a file, I'm thinking of having an ASCII-compatible "magic string" in the start of the file that tells the name and the version of the binary format.
I'm thinking of having at least ten rows (\n) in the message so that head by default settings doesn't hit the binary part.
Now, I wonder if there is any control character or escape code that would hint to the shell that the following content isn't interpretable as printable text, and should be just ignored? I tried 0x00 (the null byte) and 0x04 (ctrl-D) but they seem to be just ignored when catting the file.
Cat regards a file as text. There is no way you can trigger an end-of-file, since EOF is not actually any character.
The other way around works of course; specifying a format that only start reading binary format from a certain character on.

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Parsing out abnormal characters

I have to work with text that was previously copy/pasted from an excel document into a .txt file. There are a few characters that I assume mean something to excel but that show up as an unrecognised character (i.e. that '?' symbol in gedit, or one of those rectangles in some other text editors.). I wanted to parse those out somehow, but I'm unsure of how to do so. I know regular expressions can be helpful, but there really isn't a pattern that matches unrecognisable characters. How should I set about doing this?
you could work with http://spreadsheet.rubyforge.org/ maybe to read / parse the data
I suppose you're getting these characters because the text file contains invalid Unicode characters, that means your '?'s and triangles could actually be unrecognized multi byte sequences.
If you want to properly handle the spreadsheet contents, i recommend you to first export the data to CSV using (Open|Libre)Office and choosing UTF-8 as file encoding.
https://en.wikipedia.org/wiki/Comma-separated_values
If you are not worried about multi byte sequences I find this regex to be handy:
line.gsub( /[^0-9a-zA-Z\-_]/, '*' )

Resources