Does Notepad convert an ANSI file to Unicode before displaying it? - winapi

If the Notepad Edit control is Unicode, so when loading an ANSI file does Notepad first convert its contents to Unicode and then displays it, or does Notepad have two memory buffers (one for ANSI and one for Unicode)?

Yes, Notepad does a conversion, which is evident by the fact that it calls IsTextUnicode() to discover the text's encoding when no BOM is present, and thus suffers from the infamous Bush hid the facts bug, which is discussed on Raymond Chen's blog:
Some files come up strange in Notepad
The Notepad file encoding problem, redux

Related

Ghostscript - Indentation of postscript code

Is there an option for to me to ask Ghostscript to indent the Postscript it creates?
Everything starts at the beginning of a line and I find it difficult to follow.
Alternatively, I am using Emacs and ps-mode.
If anyone know how to indent code in this mode I would appreciate a tip (apologize because this may not be relevant to this StackExchange)
No, there is no option for indenting the output.
PostScript is pretty much regarded as a write-only language anyway, and the output of ps2write (which is what I assume you are using though you don't say) is particularly difficult since it fundamentally outputs PDF syntax with a PostScript program on the front to parse it into PostScript operations.
Why do you want to read it ?
[EDIT]
You can always edit your question, you don't need to post a new answer.
I'm afraid what you want to do isn't as simple as you might think.
It might be possible for this use case if the PDF files you receive are always created the same way, but there are significant problems.
The font you use as a substitute for the missing font must be encoded the same way. Say for example the font in the PDF file is encoded so that 0x41 is 'A', you need to make sure that the replacement font is also encoded so that 0x41 is an 'A'. So just the findfont, scalefont, setfont sequence is not always going to be sufficient, sometimes you will need to re-encode the font.
CIDFonts will be a major stumbling block. Firstly because ps2write simply doesn't emit CIDFonts at all. These were not part of level 2 PostScript. As a result all text in a CIDFont will be embedded as bitmaps. If your original file doesn't contain the CIDFont then you'll get the fallback CIDFont bitmapped.
Secondly CIDFonts can use multiple-byte character codes, of variable length. You can't simply replace a CIDFont with a Font, it just won't work.
The best solution, obviously, is to have the PDF files created with the fonts required embedded. This is best practice. If you can't get that, then I'd suggest that rather than trying to hand edit PostScript, you use the fontmap.GS and cidfmap files which Ghostscript uses to find font.
Ghostscript already has a load of code to do font substitution automatically, using both Fonts and CIDFonts as substitutes, and it does all the hard work of re-encoding the fonts or building CMaps as required. If you are on Windows much of this may already be done for you, when you install Ghostscript it will ask if you want to create font mappings. If you said yes then it will
Add the font substitutions you want to use in those files (they have comments explaining the layout) and then use the pdfwrite device to make a new PDF file. Set EmbedAllFonts to true (you may need to add a AlwayEmbed font array as well, listing the fonts specifically) and SubsetFonts to false.
That should create a new PDF file where the missing fonts have been replaced by your defined substitutes, those substitutes will have been embedded in the new PDF file and they have will not been subset (Acrobat will generally refuse to edit text in a subset font).
The switches I mentioned above are standard Adobe Distiller parameters, but they are documented for pdfwrite here. There's some documentation on adding fonts here and here and specifically for CIDFonts here.
Basically I'd suggest you define your substitutions and let Ghostscript do the work for you.
This is not an answer to the problem but rather an answer to KenS's question about "Why do you want to read it?"
I tried to put it in the comment box but it was too long.
I am a retired engineer with a strong programming background.
I would like to read and understand the postscript code for the reason shown below.
I play duplicate bridge as a hobby. I recieve a PDF file of what is know as a convention card (a single page document of bridge agreements).
Frequently I would like to edit these files.
When I open with Adobe Illustrator I have to spend a significant amount of time replacing fonts that are not on my system with fonts that I do have.
I can take the PDF and export it as a postscript file using Ghostscript.
I was going to write a little program to replace the embedded fonts with the fonts that I use to replace them.
I was going to leave the postscript file unaltered and insert things like
/HelveticaMonospacedPro-RG findfont
12 scalefont setfont
just above where the text is written.
I was planning on using the fonts that I have on my system (e.g., HelveticaMonospacedPro-RG).

editing files with bitbucket adds this to the start of files: M-oM-;M-?

Firstly, what is M-oM-;M-? ?
When I push a commit to bitbucket, and someone uses the online editor to make a small change, it changes the first line from:
<?xml version="1.0" encoding="utf-8"?>
to:
M-oM-;M-?<?xml version="1.0" encoding="utf-8"?>
I can see these special characters using cat -A <myfile>
This is a problem because this breaks my *.csproj files and fails to load projects in Visual Studio.
Bitbucket Support gave me articles about .gitattributes, and config, which I've already tried, but the issue persists:
$ git config core.autocrlf
true
$ cat .gitattributes
*.js text
*.cs text
*.xml text
*.csproj text
*.sln text
*.config text
*.cshtml text
*.json text
*.sql text
*.ts text
*.xaml text
I've also tried:
$ cat .gitattributes
*.js text eol=crlf
*.cs text eol=crlf
*.xml text eol=crlf
*.csproj text eol=crlf
*.sln text eol=crlf
*.config text eol=crlf
*.cshtml text eol=crlf
*.json text eol=crlf
*.sql text eol=crlf
*.ts text eol=crlf
*.xaml text eol=crlf
Is there some setting that I'm missing to help prevent this set of characters from being inserted into the start of my files?
First: M-o, M-;, and M-? are representation techniques to show non-ASCII characters as ASCII. Specifically, they're an encoding technique to show that bit 7 (0x80) is set, and the remaining bits are then displayed as if the characters were ASCII. Lowercase o is code 0x6f, ; is 0x3b, and ? is 0x3f. Putting the high bit (0x80) back into all three, and dropping the 0x and using uppercase, we get the values EF, BB, and BF. If nothing else, you should memorize this sequence—EF BB BF—or at least remember that it exists, because it's the UTF-8 encoding of a Unicode Byte Order Mark or BOM, U+FEFF (which you should also memorize, at least that it exists).
For more on Unicode in general, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
When storing Unicode as UTF-16, the byte order mark has a purpose: it tells you whether the stored data is UTF-16-LE, or UTF-16-BE. But when storing Unicode as UTF-8, the byte order mark is almost entirely useless. I personally believe it should never be used. Microsoft, on the other hand, apparently believe it should always be used (or almost always). See the Wikipedia quote below.
... and someone uses the online editor ...
This online editor, apparently, is either written by Microsoft, or by someone who thinks Microsoft is correct. They are inserting a UTF-8 byte order mark in your plain-text file.
Bitbucket Support gave me articles about .gitattributes ...
Unless the online editor looks inside .gitattributes files, this won't help: it's that editor that is adding the BOM.
That said, since Git 2.18, Git has had the notion of a working-tree-encoding attribute. Some editors might actually look at this. I may not understand the Microsoft philosophy correctly—I already noted that I disagree with it. I think, though, that they say: store a BOM in any UTF-8 encoded file if the "main" copy of that file should be stored in UTF-16 format. (Side note: the UTF-8 BOM tells you nothing about whether the UTF-16 file would be UTF-16-LE or UTF-16-BE, so—again in my opinion—it's pretty useless as an indicator. See also In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?)
In any case, if this editor does look at some configuration option, setting the configuration option—whatever it is—would help. If it does not, nothing you do here will help. Note that working-tree-encoding, while related to Unicode encoding, does not imply that a BOM should or should not be included. So, if your Git is 2.18 or later, you have this extra knob you can twiddle, but that's not what it's for. If it does actually help, that's great, but also quite wrong. :-)
The thing that's weirdest about this is:
[The BOM] breaks my *.csproj files and fails to load projects in Visual Studio.
Visual Studio is a Microsoft product. The Wikipedia page notes that:
Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad treat the BOM as a required magic number rather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.
One would think that if their editors insist on adding BOMs, their other programs would be able to handle BOMs.

Visual Studio, cshtml file, understanding how arabic characters are treated

Take the character "ب". It shows in stack overflow. I can see this in a cshtml file and in a js file
The character "ُ" on the other hand shows here correctly. However it shows as a question mark in the cshtml file and js file. If I copy it to notepad it shows as a Ḍammah (a loop normally above a letter which indicates a 'u' sound)
Why is it a question mark in the cshtml file if notepad understands it? ALso Visual Studio understands other arabic characters so why not this one
All I can think of is that a Dammah (as far as I know) always sits above another letter so can't be used in isolation?
What I'm trying to do is detect words that have a Dammah in them via Javascript
I'm completely new to unicode and non acii characters so this may be a stupid question, apologies if so
It often happens that an application uses a default font that does not support all the unicode characters that a usecase requires. In such a case, try to change the font to a more compatible one. "Courier New" works mostly well, also "Arial Unicode MS" does a good job. But there is no font that covers totally everything, so maybe you will need to switch between two or three fonts to cover all required characters. For Arabic, "Arial" is a good choice, but there are many interesting alternatives.

Why is UTF8 saved as 16 bit encoding from two Mac OSX programs?

I'm having a fairly large problem with text encoding, and in the process of trying to solve it I saved a text file consisting of entirely non-exotic characters, from TextEdit and TextWrangler on Mac OS X, choosing UTF8, only to find that each character occupied 16 bits when I viewed the files in a hex editor. This seems wrong to me. What am I missing? Are these bugs I should be reporting?
My mistake. The text editors are being very clever and I am underestimating them.

Text editor/viewer with ANSI codes rendering support for Windows [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I need some tool to display text containing ANSI codes correctly on Windows. No full support needed, but at least coloring/bold is a must.
Reason: My logger/debug module produce nicely rendered rich output with important sections colored using ANSI codes. This helps a lot when debugging on the serial terminal, but if I dump the debug to a file or copy-paste it into a text editor on Windows (interactive remote debug is not always viable), at best all the ANSI codes are stripped, at worst they are rendered as junk characters obscuring the real data. Rudimentary editing capabilities would be appreciated to be able to pick out specific parts, annotate, and so on.
The open-source editor Atom has the package language-ansi-styles. It supports all kinds of formatting except ;r;g;b.
You might have some more luck with ASCII/ANSI utilities, like the ones listed here:
List of ASCII/ANSI/NFO utilities
**Note: some files on this page might be outdated, you might find newer versions of these utilities on their respective homepages.*
For example, the latest version of NFOPad can be found here.
I've been looking for a solution to display the ANSI colors as well (for program debug output readability) and stumbled upon Sublime Text (paid software with trial http://www.sublimetext.com/) with a the ANSIescape package (https://github.com/aziz/SublimeANSI or installed through the package control).
It supports coloring and the bold escape is recognized but not displayed, although a special color can be assigned to it in the settings file. Also worth noting that this plugin shows text in read-only mode, and needs to be turned off if editing is necessary.
Here is the screenshot provided on the github, and I have personally tried it and verified it works:
If you're primarily interested in viewing the file instead of editing it, Ansifilter will convert it to HTML, which you can then view and at least search in your browser, or RTF if wordpad would be good enough (hard to imagine). Huh, looks like there's a notepad++ plugin version on the download page, too, so that might be perfect if it allows you to load into notepad++.
http://www.andre-simon.de/doku/ansifilter/ansifilter.html
There's also a different plugin for vim which colors text according to ANSI codes.
http://www.vim.org/scripts/script.php?script_id=302
However, while it highlights the text in the correct color, it leaves the ANSI codes themselves in there (in a faded, near-background color) which probably will mess up any alignment formatting in the file, as well as making it harder to move around the file (lots of "empty space" to wade the cursor through, searching for a word won't match if there's an ansi code in the middle of it, etc.). There's a patch it can take advantage of to hide the codes too, but that would require patching and then recompiling vim itself from source.
Yeah, suggesting vim is pretty unhelpful if you aren't a vim user already, it has too huge of a learning curve, I know. But it might be useful to the vim users out there.
I know it won't be of much help - but I was looking for the exact same thing on linux; was just trying to view some log outputs that had bash ANSI color codes inside. Unfortunately, those ANSI color codes were spread across several lines - meaning 'cat'-ing the file and piping into 'less -R', 'most' and similar tools, would simply display the starting line where the color originated, but not the subsequent lines that should've been colored.
Funnily enough, I thought usual Linux tools like 'nano', 'gedit', 'vim' and whatnot would have capabilities for ANSI color codes in a text file, but it's very modest out there with info on ANSI color in text files in these editors. I've only found info on ANSI color for the test editor 'joe':
Cheap ANSI Color! - http://tldp.org/LDP/LG/issue01to08/articles.html#ansi
but couldn't get the recommendations there to work (also couldn't get 'emacs' to work either, at least not by directly reading a text file with ANSI color characters inside).
The good thing - it seems what you need, if you need ANSI color in text, is to look for ASCII art / NFO utilities as recommended above - and the one that I finally found, and was working for me, was tetradraw (via www.linux.org/apps/AppId_42.html ; can be sudo apt-get installed in Ubuntu ... actually, tetradraw is the name of the drawing/editor part - however there is a separate viewer that also works with ANSI color codes, tetraview).
Well, who would have thought, that you need to track down an ASCII art utility, in order to read log files :)
Anyways, hope this may somehow help in the further search of ANSI color text editors for Windows, too.. Cheers!
If you just want to view then the terminal program "Tera Term" can do this. Just click "File" -> "Replay Log" and select your file containing the ANSI codes.
You can download Tera Term here:
http://logmett.com/index.php?/download/tera-term-477-freeware.html
In Emacs, just eval the following before opening your .nfo file:
(add-to-list 'auto-coding-alist '("\\.nfo\\'" . cp437-dos))
I have been a while testing multiple programs on the URL refered by Andras Vass with no results (they don't show colors, or they keep showing ANSI codes as a mess of characters).
Tired of searching I have finally found ANSIFilter (not the NotePad++ plugin refered by Jeffson), the only that works for me.
I have added it to Windows context menu, so I can now easily open my ANSI text files.
I would be surprised if emacs can't do that.
At least with the embeded shell.
There are:
http://www.emacswiki.org/emacs/AnsiTerm
http://www.emacswiki.org/emacs/MultiTerm
http://www.emacswiki.org/emacs/ansi-color.el
Update: as it had been pointed, they are just term output colorizers. But if you can edit the shell buffer contents in emacs too, eg. cat file && colorize.
But wait a minute, I had just found these:
http://vaperized.com/ansiexpress.htm
http://www.syaross.org/thedraw/
http://picoe.ca/products/pablodraw/
If the debug logging of your application goes via 1 class/function, you could try to split the output so that:
ANSI-like logging is shown on the terminal/console
HTML-like logging is written to file
For your application all logging goes to this class, and this class splits the output to terminal/console and file.
Make a 'standard' in your logging class for specifying colors and boldness (e.g. predefined codes like Ctrl-A means red, Ctrl-B means bold, ..., or specific methods in the logging class for setting the color and boldness, or maybe even the ANSI-codes), and translate this in your central logging class to:
the correct ANSI codes on terminal
the correct HTML codes in file
Alternatively, I think that instead of HTML you also could use rich-text, but I don't know all the possibilities of rich text so you may have to look this up.
You could try notepad++ (see http://notepad-plus.sourceforge.net/uk/site.htm). It's pretty powerful (Scintilla based) and has an option to view non-printable characters (like line-breaks and the like).

Resources