Viewing unicode strings in Intellij Idea debugger

Viewing unicode strings in Intellij Idea debugger - debugging

How to view Unicode encoded strings in Intellij Idea debugger data view?
e.g. consider following code
String str = "सुझाव"; //some Unicode string: utf-8
String str1 = "\u0938\u0941\u091d\u093e\u0935";
System.out.println(str);
Not able to view the variable- str - in data view of debugger, it shows garbage value.
Even though set value shows proper value.
Any suggestions/work-arounds?
Probably, missing some encoding setting somewhere.

Check project encoding. I have same issue. Idea shows variables content encoding in WIN-1252.
In version in Editor > File Encodigs you have: IDE Encoding, and Project Encoding. After changing Project Encoding to UTF-8 everything looks ok.

The issue is in the debugger font.
Since you can view your unicode string in the code editor it means that everything is good with your symbols and encoding of the file.
A font may not have glyphs for each unicode character. That's the issue. You have different fonts for code editor and debugger. The debugger's font doesn't have glyphs for those symbols. Unfortunately, it turns out you can't change the font of the debugger.
I've just got resolved similar issue: Why Intellij IDEA doesn't display 𝔊 symbol?

blank squares means those characteristics are not supported by your IDE font . change IDE font should solve it (File -> Settings -> Appearance & Behavior -> Appearance)

This should work. I copied your code and I can see your characters in my debugger.
Perhaps your file encoding is wrong - can you look in the lower right edge of your intellij window, does it say UTF-8 or something else?
Alternatively, if you use the operation convert to Basic Latin (position your cursor on the string and use alt-enter), does this work?
String str = "\u0938\u0941\u091d\u093e\u0935"; //some Unicode string: utf-8

Related

Has anyone heard of this strange bug with the standard Windows message box?

Years ago, I was messing around with Visual Basic and I discovered a bug with the MsgBox function. I tried searching for it, but nobody had ever said anything about it. It's not just with Visual Basic though; it's with anything that uses the standard Windows MessageBox API call.
The bug is triggered when the title text has more than one character, and the first character is a lowercase 'y' with an umlaut ('ÿ'). What's so special about this character? It almost definitely not the character itself, but rather its ASCII value that's special. 'ÿ' is character 255 (0xFF), meaning it's the highest value that can be stored in an unsigned byte, and all its bits are set to 1.
What does this bug do? Well, there are two different possibilities, which depend on the number of characters in the title text. If there are an even number of characters (unless it's 2) in the title text, no message box appears, and you just hear the alert sound. If there are two characters in the title text, or any odd number other than 1 (in which case the bug wouldn't be triggered)...then this happens:
And that's not all--the message will also be truncated to one line. It seems like the kind of bug that would occur in at least one semi-high-profile incident, considering how often this API call is used. Are there any reports of this on the Internet, or anything showing what could cause it? Maybe it's a Unicode-related glitch, like that "bush hid the facts" glitch in Notepad?
I made a program in case you want to play around with this; download it here.
Alternatively, copy the following into Notepad, save it with a .vbs extension, and double-click it to display the dialog box seen above:
MsgBox "Windows 3.1 font, anyone?", 0, "ÿ ODD NUMBER!"
Or for a different font:
MsgBox "I CAN HAS CHEEZBURGER?", 0, "ÿ HImpact"
EDIT: It seems that if the first four characters are ÿ's, it doesn't ever display the message, even if there's an odd number of characters.

This is a bug with dialog templates generally. It is not a message box bug as such.
For example, in Visual Studio create the default win32 application. In the .rc file, change the caption in the template for the about box from
CAPTION "About sampleapp"
to
CAPTION "ÿT"
and the bug will manifest itself when you display the about box.
In the DLGTEMPLATEEX documentation note that the menu and class name have type sz_Or_Ord which means either a null-terminated string or 0xFFFF followed by a single word resource identifier.
Windows incorrectly applies a similar scheme to the dialog title: if the first character is 0xFF then it treats the title as being two WORDs long, but only when it is trying to locate the font information. When it is displaying the title it correctly treats the title as a string.
In other words, Windows is looking for the font information inside the title string. In most case this won't specify a valid font, so Windows defaults to the system font.
To prove this, I constructed a dialog template in memory (based on this). Once this was working I deleted the code that writes the font information to the template and used the dialog title "ÿa\xd\x200\x21SimSun". This displays the dialog in italic SimSun because windows is reading the font information from the title string.
This bug is likely a hangover from 16-bit Windows, where (I guess) 0xFF was used as the resource ID marker.

A strange bug. I suspect the symptoms are the result of the way the MessageBox() actually displays the dialog.
Internally, MessageBox() builds a dialog template dynamically. If you look at the description of a DLGTEMPLATE structure you'll find the following nugget of information:
In a standard template for a dialog box, the DLGTEMPLATE structure is
always immediately followed by three variable-length arrays that
specify the menu, class, and title for the dialog box. When the
DS_SETFONT style is specified, these arrays are also followed by a
16-bit value specifying point size and another variable-length array
specifying a typeface name.
So, the in-memory layout of a dialog template has the font specification immediately following the dialog box title.
Visual Basic does not use Unicode and so the function you're calling is actually MessageBoxA(). This is simply a thunk that converts the passed-in strings from multibyte to Unicode and then calls MessageBoxW().
I believe what's happening is that, for some reason, the conversion of that string from multibyte to Unicode is either going wrong, or returning a spurious length value. This has the knock-on effect, when the dialog template is built in memory, of corrupting the memory immediately following the title string - which, as we know, is the font specification.

How to display unicode Arabic string in VS output window?

I have a uni-code string in Arabic to display in output window rather than in console, so I could only use OutputDebugStringW, and I call SetConsoleOutputCP(1256) to set Arabic code page but still it only output "????". What should I do...

This is a documented restriction for OutputDebugStringW():
OutputDebugStringW converts the specified string based on the current system locale information and passes it to OutputDebugStringA to be displayed. As a result, some Unicode characters may not be displayed correctly.
Calling SetConsoleOutputCP() doesn't solve the problem, that changes the code page for the console window, not the debugger. You'd have to change your system locale, Control Panel + Region, Administrative tab. If Arabic is your favorite language then changing it to 1256 is the appropriate thing to do. It will of course have system-wide effects.

What is the Difference between "Csharp editor" and "Csharp editor with encoding"?

In the Open with menu of a .cs file there's Csharp editor and Csharp editor with encoding. I opened a solution with both and didn't see a difference.
What's the difference between them?

Unless your .cs file includes characters outside of the normal ASCII range, you won't see a difference in the actual contents of the file. The difference is whether or not the editor tries to detect the character encoding you saved your file with when you open it again, or asks you specifically.
By default, when you save a new .cs file, VS uses the current ANSI code page to encode the characters. (You can switch this to use UTF-8 by default with the appropriate options.) However, you can instead chose to "Save with Encoding...", which will prompt you for the specific character encoding you want to save it.
Internally, your code is being handled as UTF-16, since that's what Windows deals with as it's native string format. On-disk, however, UTF-16 would most likely blow up your source files to double their size, since most of the C# code you write probably fits into a single byte. So, when writing to disk, VS writes out your data in a particular code page that defines how to convert the UTF-16 characters into some other, possibly 8-bit character set.
When you reload a file in VS, it attempts to figure out what encoding that file was in, and if it can't, it will fall back on the current ANSI code page. (You can force it to fall back to UTF-8 via some options, but it won't ever fall back to a different encoding.)
When you reload a file "With Encoding", you get the same prompt as when you saved the file, asking you which encoding was used. This way, if Studio gets it wrong, you can fix it.
Unless you do a lot of internationalized programming, where you have foreign-language strings embedded in your .cs file from a language other than the default, you probably don't need to use the explicit "with encoding" save or loads. But, they are there if you need them.

If you open with encoding you can save with whatever character encoding is appropriate for your culture or region.

How to display unicode control characters in visual studio text visualizer?

I get some text string from service, which contains Unicode control characters
(i.e \u202B or \u202A and others for Arabic language support).
But while debugging I can't see them in default text visualizer. So I need to enable display for such characters to determine which of them my text consists of. There is checkbox in text visualizer "show all characters", but it doesn't work as I expect.
Any suggestions?
Thanks in advance

Those are codes for explicit RLE and LRE order, ie if in RLE something should be displayed in LRE order.
http://unicode.org/reports/tr9/#Directional_Formatting_Codes

Nabla Special Character Shows as Null Character

I have a page that uses the nabla character for bullets on a menu. The code for it is ∇. However, on my machine as well as some others around the office, this character comes up as the null character, [], in IE. In Firefox, though, it displays the actual character. My question is whether there is a fix for this so I can make sure anyone viewing this site will see the actual character. Is there a browser font that needs to be installed on any machine that views this site, or is it an issue that can be fixed from my end.

The empty square character is not the null character, it is a visual placeholder - a representation of a Unicode character for which there is no associated glyph in the current font.
There are couple of possible issues:
your page does not have explicit encoding associated with it and your IE is configured to use Win-1252 as default instead of Unicode.
the font family you specify in the CSS is missing on your computer and IE fallback font is different from Firefox and does not have tha babka char.
Make sure your page explicitly specifies Unicode encoding and use the iE developer tools and Firebug to examine the actual rendered style for the bullet and see what font is being used by the two browsers.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio