I need to prepend the BOM to a NamedBlobFile field on my Dexterity type to make Windows users happy, but leave the file in blobstorage intact.
What is the recommended way to hook into Zope's file streaming without changing the actual file on the filesystem?
I'd go a different direction:
Ban Notepad. Really, broken software is broken software, you can only go so far to fix it's shortcomings.
Save UTF-8 files with the BOM; U+FEFF means "ZERO WIDTH NO-BREAK SPACE", 'normal' software will mostly ignore it.
Related
[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly
I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.
Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.
What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?
Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.
Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.
Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.
You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.
I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?
As far as a I know, in 1987 PC-DOS 3.3 as well as MS-DOS 3.3 were released and they had several code pages (850, 860, 863, 865).
Does it mean that user could write text using Portuguese (cp860) and, say, Nordic (cp865) symbols in one file?
Or it was something like one code page per one operation system. For example, PC-DOS from Portugal had only 860 code page and user could use symbols only from that code page, and PC-DOS from Scandinavia had only 865 code page.
The same question about Windows. Starting from what version it started to support multilingual text documents?
DOS has not really knowledge of code page. They were just ASCII strings (zero or dollar terminated).
Codepage were used mostly for display: changing a code page, it will change how a bytecode is printed on screen.
What you describe here, it is a frequent problem: mixed encoding in one text. If you are old enough, you will remember a lot of such problem in web. The text file has no tag or metadata about the codepage. If you mix it, you will just see the characters according the active codepage. You change the codepage of screen, and you will get a new interpretation of characters.
You can do anything you want in your own file. It's communicating how to read it to others that would be a problem.
So, no, not really. Using more than one character encoding in a file and calling it a text file would be more trouble than it's worth.
What the setting of an operating system does not have a direct relationship on the contents of a file. Programs that exchange files between systems (such as over the Internet) might use an understanding of the source character encoding and a local setting for character encoding and do a lossy transcoding.
Nothing has changed except with the advent of Unicode more than 25 years ago, more scripts than you can imagine are available in one character set. So, if there is any transcoding to be done, ideally, it would only be to UTF-8.
I wrote a script with German special characters e.g. ü.
However, whenever I close R and reopen the script the characters are substituted:
Before "für"; "hinzufügen"; "Ø" - After "für"; "hinzufügen"; "Ã".
I tried to remedy it using save with encoding and choosing UTF-8 as it is stated here but it did not work.
What am I missing?
You don't say what OS you're using, but this kind of thing really only happens on Windows nowadays, so I'll assume that.
The problem is that Windows has a local encoding that is not UTF-8. It is commonly something like Latin1 in English-speaking countries. I'm not sure what encoding people use in German-speaking countries, if that's where you are. From the junk you saw, it looks as though you saved the file in UTF-8, then read it using your local encoding. The encodings for writing and reading have to match if you want things to work.
In RStudio you can try "Reopen with encoding..." and specify UTF-8, and you'll probably get your original back, as long as you haven't saved it after the bad read. If you did that, you've got a much harder cleanup to do.
Motivation: I am rewriting a doc -- text files to be processed later. The new sources now use UTF-8. Large portions of the sources are the same. I need to find differences.
Details: The old doc sources use the cp1250 encoding, the new sources use the UTF-8. Both new and old sources use the same line endings (CR+LF). I am using the Unicode version of the WinMerge application (WinMergeU.exe), version 2.12.4.0.
It almost works, but... When the lines differ, they are initially marked as block by the dark yellow, and the different portions are marked using the lighter colour. When moving the red block cursor there, the panes below show the different part.
However, the block of text is marked by the dark yellow also in cases when (the Unicode representation of) the text is the same. The red block moves also to those portions of the files. In such case, the two panes below (that show the differences) containt the same text and nothing is marked as different. See the picture below:
The very first line differs -- this is OK. But the second line has visually the same content. The only character outside of the ASCII range is Ú there. It has a different representation in the encoded sources. This causes the line marked as different, but the panes below does not mark anyting at the line as different.
See also the following paragraphs that are exactly the same (only the encoding in the sources differ, the same line ending is used).
It looks as if the initial comparison were based on binary representation of the lines. Is there any setting to tell WinMerge that the comparison (I mean the block marking) should be based on Unicode content?
I tried hard, but no luck, yet.
Update: The above question was for the latest stable 2.12.4. The beta version 2.13.22 works just perfectly for me. See my answer below.
This doesn't really answer your question about WinMerge, but have you considered using another diff program? One of my favorites is kdiff - http://kdiff3.sourceforge.net/
When I do a compare on KDiff using one UTF8 file and another Unicode file, I get the following:
Here is the compare screen - note that the encodings on the files are different, but the files are considered to be equal from a text standpoint:
I think it really should not be the task of a merge tool to allow the merging of files stored in different encodings.
An encoding is a function that maps bytes (stored on the disk or in memory) to characters (displayed on screen). Unfortunately, by default the encoding of a file is not stored together with the file. Therefore, any program that wants to open the file and display its contents needs to guess the encoding. While this sometimes works, it is also an error prone procedure.
Now, the character sets of different encodings do not overlap in general. So what is the merge tool supposed to do if you merge a character C from file A in encoding X into a file B in encoding Y, if character C is not part of the character set of encoding Y?
Thus, I think the task of a merge tool should be to merge the binary content. Anything else is a dirty hack and damned to fail at some level. (A merge tool maker may decide to provide character level merging, which also might work most of the time. But there is some guesswork involved.)
Therefore, I'd also recommend you first translate the old files to UTF-8 and then merge those with the new versions.
Just for your information. The question was for the latest stable 2.12.4. I have tried the beta version 2.13.22, and it works just perfectly for me. See the difference for exactly the same files -- only the first lines in the files were removed. (My big thanks to authors.)
Edit -> Options
Select 'Compare' from categories pane on left.
Check box 'Ignore carriage return differences' (UNIX, Windows, Mac)
I would recommend converting the files to the same encoding before diffing.
If you are working with a version control system I'd recommend the following:
Create a fresh checkout of the files
Convert all files to UTF-8
Commit the files
Copy your new files over
Use WinMerge
That way you end up with two commits in the history - one for the encoding change and another for the content changes and WinMerge will work as expected.
What about option File -> File Encoding... in WinMerge? It allows to set encoding for files independently.
I'm trying to store some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0 bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.
This may not be relevant to you; read the last paragraph if it doesn't sound like it is.
I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py:
#!/usr/bin/env python
from mercurial import hg, util
import codecs
boms = [
codecs.BOM_UTF8,
codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
]
def binary(s):
if s:
for bom in boms:
if s.startswith(bom):
return False
return '\0' in s
return False
def reposetup(ui, repo):
util.binary = binary
This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:
[extensions]
bom = ~/.hgexts/BOM.py
Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever (it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.
In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.
I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.
Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!
I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.
Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.
If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff with a new function which converts utf-16le to utf-8. This will not affect hg st, but will affect hg diff. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.
Anyway, I think it may be useful to you, so here it is.
Extension file utf16decodediff.py:
import codecs
from mercurial import mdiff
unidiff = mdiff.unidiff
def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
"""
A simple wrapper around mercurial.mdiff.unidiff which first decodes
UTF-16LE text.
"""
if a.startswith(codecs.BOM_UTF16_LE):
try:
# Gets reencoded as utf-8 to be a str rather than a unicode; some
# extensions may expect a str and may break if it's wrong.
a = a.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
if b.startswith(codecs.BOM_UTF16_LE):
try:
b = b.decode('utf-16le').encode('utf-8')
except UnicodeDecodeError:
pass
return unidiff(a, ad, b, bd, fn1, fn2, r, opts)
mdiff.unidiff = new_unidiff
In .hgrc:
[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py
(Or equivalent paths.)