Parse out ^M characters - macos

I've got a Cocoa app that parses text from a standard text file. When using terminal programs like nano and pico to edit the file, I'll sometimes notice that a ^M character shows up. I can't reproduce this on later versions of Mac OS X, but it seems to abound in version 10.5.
Oddly, when I take a file that has the ^M character in it from a 10.5 system, it magically goes away in 10.6+; I'm assuming this is because Mac OS started to convert the linux-style linefeeds into Mac-style (??). Consequently, it has made it somewhat complex to fix the problem at debug-time since I have XCode only installed on 10.7.
I need a way to find the ^M character and replace it with something more standard (like \n) while I'm parsing the file. What type of character do I need to look for? It doesn't seem to be a \n and likewise, no combination of \r\n seems to do the trick either. The ^M still hangs around.

This is easy to manage within XCode.
Simply select the file you want to change the line endings for and then open up the utilities panel and you can change the Line Endings.
Like so:

I'm not a MacOS user, in general, so I'm only guessing this will work. On typical Unix-like systems, you can use Ctrl-V to get the next control character you enter to appear as a visible character sequence. Thus, for instance, you might be able to get the ^M you want to appear by first entering Ctrl-V, then pressing your Return key. Note that the ^M that appears on your screen in this case is not the same (from the perspective of your software) as the ^M that appears when you first enter the ^ character then the M character. In this manner, you can do things like use regular expressions to replace that ^M control character representation with instances of \n instead.
You could also, as Jakrabbit suggests, use dos2unix to filter the file -- assuming it's available on your Mac.

"^M" is a representation of the ASCII line feed character. This is character code 13 in ASCII (and UTF-8), so when parsing the file, look for characters with a value of 13 and just ignore them.

^M is the standard carriage return character in DOS/Windows.
I would just use the dos2unix program to get rid of them all.

Related

Weird wrapping of bash prompt with coloring (`\[` and `\]` being used)

I was working on my own bash prompt when hit strange behaviour (both iTerm and Terminal.app on macos). I managed to boil it down to the minimum working example:
~/.bash_profile:
WHITE="\[\033[1;37m\]"
COLOR_NONE="\[\033[0m\]"
# This always wraps correctly
PS1="\u:\w"
# This (added coloring) wraps incorrectly when using small window.
# PS1="${WHITE}\u:\w${COLOR_NONE}"
Now create a long directory name say
mkdir ~/very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name_very_long_name
and cd to it. Problem is: with the first version of the PS1 it wraps perfectly, while adding just one color breaks wrapping for small windows. Can anybody clarify of give a workaround? Adding \n to the end is an option, but looks ugly for short prompts.
Thanks!
Sources I have already seen:
BashFAQ/053 about colors
SO question about \[ and \]
Issue with \x01 and \x02
UPD:
bash version version 3.2.57(1)-release (x86_64-apple-darwin17)
Bash has always had trouble handling long prompts with invisible characters, so the usual practice is to avoid them. For example, you could automatically trim the length of the prompt by omitting the beginning of the path if it is too long, or you could automatically output a trailing newline if the path is very long. (In such cases, you might want to use $COLUMNS, which will normally tell you the width of the window.)
In patch 19 to bash v4.4 (which I realise is not really relevant to your environment since you seem to still be using the antique version of bash provided by a default OS X install), a long-standing bug was corrected, which had the effect of triggering a segfault in certain very rare cases of multiline prompts with both invisible and multibyte characters. The behaviour definitely changed between v4.4.18 and v4.4.19, but even with that patch very long prompts cause problems when the prompt extends to a third line.
There is a comment in lib/readline/display.c which indicates that the readline library assumes that prompts will not exceed two lines. I suggest that you use that as a limit.

Searching for a character in the bash prompt

In vim, to find a character you can use 'f' to find the next matching character in the line that your cursor is on, and 'F' to find the previous matching character in the line that your cursor is on.
Is there a way to move around like that on the bash command line?
I know that you can set bash to be in vim mode, by saying set -o vim, and this works great for my local machine, but I administer a lot of machines where I can't change that setting.
Ignoring for a moment the security issues associated with everybody in your office sharing the same user, you could add a key binding to the readline command character-search:
# ~/.inputrc
C-]: character-search
To use the search, type Ctrl-] followed by the character you want to search for. You can bind the command to any key sequence, not just Ctrl-], but for obvious reasons you probably don't want to emulate vi mode by binding it to the letter f.
This would be less invasive than turning on vi mode so most users would probably not even notice the change. However, somebody could easily stumble upon your key sequence by accident and become very confused. You would also have to use three keystrokes instead of the two you're accustomed to with vi.

Weird space character in string, that's not a space?

I've been having trouble with a Capistrano script, or in fact, a bash command that was causing my script to fail. I kept on getting errors from the script saying:
No such file or directory
So here's the script bit.
run "sudo ln -s #{shared_path}/readme.txt  #{shared_path}/readme-symlink.txt"
Upon closer inspection it turns out that there are two spaces between the readme.txt and readme-symlink.txt bits. By accident I found that one is a space, and the other is just a weird character that looks like a space, but it's not. Here's what it looks like in Sublime Text, configured to display whitespace:
Notice how, in the above image, there is only one dot after readme.txt, and then another "space"
So here's my question, what on earth is this charachter, I'm just so confused how someone managed to get that in there by typing on a normal keyboard?
So I pasted the string at http://www.asciivalue.com/index.php, the second space has an ASCII value of 160. According to http://www.ascii-code.com/ this is a space, but it's a non-breaking space, which I believe, the command line isn't too happy about.
Removing the nbsp fixes my script, and I can go on with my life again.
I'm just stumped about how the person that created the file got a nbsp in there in the first place.

How do I make Emacs dired mode display unicode characters in windows?

I have emacs 23.3 running on windows XP and I work on some files whose filenames contain a combination of English & devanagari or tamil characters (e.g., que.प्रश्न.txt or ans.பதில்.txt).
When I visit the directory containing this file in Dired, these file names don't appear correctly, even though I can see the names in windows explorer. Dired displays names like "deva~1.txt" for filenames that have begin with english characters but in case of names fully composed of non-english characters it displays something like "47d1~1.txt".
I suppose this has something to do with what Windows internally returns to emacs but I notice that running dir on command-prompt at the same directory displays the full names (even though cmd just renders all non-english characters as ? symbol).
Is there anyway I can enable dired to render filenames with non-english characters correctly?
It's actually a limitation of Emacs's implementation. Emacs uses Windows primitives that date back to before Unicode, so any filename with chars that cannot be encoded in your "codepage" will be replaced with the mangled foo~1 name (if your file system is VFAT) or something else in other cases. Hopefully we will soon switch over to the "new" Windows primitives that use UTF-16 (IIRC) and do not suffer from such problems any more. But you may have to wait for Emacs-25.1 for that. It may happen sooner if you give us a hand, tho ;-)

Emacs failure in loading charset map when saving file with unicode

I created an ordinary text file on Windows 7 64-bit using gnu emacs 23.3.1. I can edit the file with other programs such as LinqPad (the file happens to be a linqpad script, extension .linq). Everything is fine until I put a Unicode character in the file, a character such as the greek letter λ (lambda). I can input the letter in emacs and it displays correctly. However, emacs refuses to save the file, reporting the following error
Failure in loading charset map: 8859-7
If I input the λ in LinqPad, emacs will read and display them, but will not save the file.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λ's, but instead pairs of odd characters such as λ. That is fitting to an untuition (pun intended) that the unicode chars are being stored as pairs. So it looks like this is a kind of ambiguous situation (storing unicode in text files), but it also looks like linqPad and visual studio "do the obvious thing."
I want to use emacs because it's the only program that I have that reflows sequences of commented lines (lines after //, reflows them with Alt-Q), and I want to use greek characters in my comments because I'm describing a mathematical program.
I'll be grateful for advice and answers.
UPDATE: some advice in other questions said to try M-x describe-char, also bound to C-x = ; both of those give me the same failure message as above, so they're on the right track, just not answers.
This once happened to me when I had upgraded all packages (including Emacs) without realising I still had an Emacs session open during the upgrade. Next time I asked it to save some Unicode, it tried to load 8859-7 and failed because the path was different in the upgraded version. I had to redo the edit after restarting Emacs.
I just noticed that Notepad++ has other unexpected behavior with this file: it does not display the λs, but instead pairs of odd characters such as λ.
λ is what you get when you interpret the byte sequence 0xCE, 0xBB using the encoding ISO-8859-1, or Windows code page 1252 (Western European). Code page 1252 is probably the default (‘ANSI’) code page on your machine.
0xCE, 0xBB is the UTF-8 encoding of the character λ (U+03BB Greek small letter lambda). So to display it correctly you need to tell your text editor that the file is saved in UTF-8 and not ANSI.
In Notepad++, choose UTF-8 from the menu bar ‘Encoding’ entry.
In Emacs, C-x C-m c utf-8-dos (or unix or whatever) as a prefix to opening or saving the file. Hopefully by saving in UTF-8 you'll avoid whatever the problem is with the ISO 8859-7 (Greek) map; you certainly don't want to be saving any files in 8859-7, or indeed anything but UTF-8, if you can help it.

Resources