How to get a utf code from symbol in linux

How to get a utf code from symbol in linux - shell

I'm struggling with a special symbol in a text file on linux. I actually successfully pasted it between the following letters "a‏a" (my cursor in Geany stops but no character is displayed).
I'd like to know what's the easiest way to get its utf8 code (in the form U+0000). I'm using ubuntu and geany and I tried hexdump on a file containing it but I'm obviously missing something.

You could open the file with vim, put the text cursor over the character, then type 'ga' (without quotes) and it will display the character code in decimal, hex and octal in the status line.

Related

CMD: clip command issue

I have found a weird issue using the clip command of Widows CMD.
I created a simple text file containing this text:
^.*A{0,0}.*$
Then I ran the command clip < PATH_TO_THE_TEXT_FILE in CMD.
Finally, I tried pasting the copied text into text editors such as Notepad and Notepad++, and what I got was some weird Japanese characters. This issue can be reproduced every time and on different PCs.
Can you please tell me what is causing this issue and how can I make the clip command copy the actual text in the text file, and not the weird Japanese characters?

what is causing this
Incorrect encoding conversion, ASCII String -> bytes -> UTF-16 String
// JavaScript code to emulate this
bytes = new TextEncoder("UTF-8").encode( '^.*A{0,0}.*$' )
new TextDecoder("UTF-16LE").decode( bytes )
how can I make the clip command copy the actual text
type PATH_TO_THE_TEXT_FILE | clip
Sorry, update the answer.
Update: clip works if there is a newline in the end of the file.

Based on what 7cc said, I figued out how to make it work.
I need to create the text file using the encoding UTF-16LE, and then it works.

Corruption when using certain batch variable names in custom build command

I have a VS2013 project with a custom build command. In the command script I set an environment variable, and read it out again in the same script. I can confirm by calling set that setting the variable works. However, depending on the variable name, I can't read it out again.
The following works as expected when run as a batch script:
set AVAR=xxx
set ABLAH=xxx
set BBLAH=xxx
set DEV=xxx
set #ABLAH=xxx
echo %AVAR%
echo %ABLAH%
echo %BBLAH%
echo %DEV%
echo %#ABLAH%
But produces the following output in the project:
1> xxx
1> «LAH
1> »LAH
1> ÞV
1> xxx
In this case, the name AVAR works, but many others don't. Also, variables starting with # seem to work. Any idea what is going on?

I've found the solution. Visual Studio (msbuild) converts %XX escape sequences like in URLs. I only expected it to so in URLs, like browsers do. However, it seems to replace them everywhere.
So when it encounters %ABCDE%, it recognizes %AB and inserts the character « = 0xAB, giving «CDE% to the batch interpreter. But if the code is not a valid hexadecimal number, it silently ignores it, and the interpreter sees the right characters. That's why variable names with # at the beginning always worked.
So the solution is to escape at least all % in front valid hex codes 00-FF, better even all of them, with %25.
An easy solution would be to just edit the corresponding commands in the GUI (via project properties), and not directly in the .vcxproj or .props file. This way, VS inserts the correct escape codes. In my case this was not possible since the commands were defined as user macros (Property Pages: Common Properties/User Macros). My commands span multiple lines, but the user macro editor only supports single lines.
Another thing to watch out for is that it not only replaces percent signs. Other symbols have special meaning and have to be replaced, too. (This goes beyond XML entities, like & -> &.) Here is a list of special characters from MSDN. The characters are: % $ # ' ; ? *. It doesn't seem to be necessary to replace all of them all the time, but if you notice funky behavior then this is a thing to look at. You can try to enter these characters through the GUI and see how and if VS escapes them in the project file.
On other character to note especially is the semicolon. If you define a property with unescaped semicolons, like <MyPaths>DirA;DirB</MyPaths>, msbuild/VS will internally convert them to newlines (well, or it splits the property into a list or something). But it will still show the paths as separated with semicolons in the property pages! Except when you click the dropdown button next to a property and select <Edit...>, then it will show the paths as a list or separated by newlines! This is completely invisible most of the time, except when you set a property not in XML or the GUI, but you are reading the output of a command into a property. In this case the command must output newlines, if you want the effect of a semicolon. Otherwise you don't get multiple paths, but one long path with semicolons in it.

Batch files are usually in North American and Western European countries "ASCII" files using an OEM code page like code page 850 (OEM multilingual Latin I) or code page 437 (OEM US) and not code page Windows-1252 as used usually for single byte encoded text files. The code page to use for a batch file depends on local settings for non Unicode files in console. The code page does not matter if just characters with a code value smaller 128 are used in batch file, i.e. the batch file is a real ASCII file.
Therefore make sure that you edit and save the batch file as ASCII file using the right code page and not as Unicode file using UTF-8, UTF-16 Little Endian or UTF-16 Big Endian. Editor of Visual Studio uses by default UTF-8 encoding for the files. This is the wrong encoding for batch files.
Character « has in table of code page 850 the code value 174 decimal (0xAB). In table of code page 1252 code value 174 is for character ® which is an indication that you want to output in batch file characters encoded in UTF-8 (also code value 174 for character ®) or Windows-1252.
A simple batch code for demonstration stored as ANSI file with code page Windows-1252.
#echo off
cls
echo This batch file was saved as ANSI file using code page Windows-1252.
echo.
echo Registered trademark symbol ® has code value 174 in Windows-1252.
echo.
echo But active code page is not Windows 1252 in console window.
echo.
chcp
echo.
echo Therefore the left guillemet character is output instead of registered
echo trademark symbol as this character has in code page 850 code value 174.
echo.
echo Press any key to continue ...
pause>nul
And batch files are for DOS/Windows and should therefore use carriage return + line-feed as line terminator instead of just line-feed (UNIX) or just carriage return (old MAC).
Some text editors display line terminator type and encoding respectively code page somewhere in status bar at bottom of main application window for active file.

Issue with encoding of a character (not able to sed or .gsub)

I am dealing with some multilingual data(English and Arabic) in a json file with a weird character i am not able to parse. I am not sure what the character is. I tried getting the ASCII value via vim and this is what i got
"38 0x26"
This is the status line in vim i used to get the value (http://vim.wikia.com/wiki/Showing_the_ASCII_value_of_the_current_character).
:set statusline=%<%f%h%m%r%=%b\ 0x%B\ \ %l,%c%V\ %P
This is how the character looks in vim -
I tried 'sed' and '.gsub' to replace this character unsuccessfully.
Is there a way where i can replace this character(preferably with .gsub ruby) with '&' or something else?
Thanks

try with something like
sed 's/[[:alpnum:][:space:]\[\]{}()\.\*\\\/_(AllAsciiVariationYouWant)/&/g;t
s/./?/g' YourFile
where (AllAsciiVariationYouWant) is all character that you want to keep as is (without the surrounding "()" )

JSON is encoded in UTF-8 (Unicode). If you're seeing funky-looking characters in your file, it's probably because your editor is not treating Unicode characters properly. That could be caused by the use of a terminal emulator that doesn't support Unicode; an incorrect $LANG setting; vim not being able to correctly determine the encoding of the file; and likely other reasons.
What terminal program are you using? What's your $LANG environment variable set to (echo $LANG)? If you're certain your terminal supports Unicode, try:
LANG=en_US.utf-8 vim your_file_here.json
(The above example assumes that U.S. English is appropriate for the file, which it may not be.)
As for replacing characters in the file, vim's substitution command can be used:
:%s/old text/new text/g
The above command will run the substitute command on all lines in the file (%), replacing every instance of "old text" with "new text". (The g at the end tells vim to replace every instance on a line, not just the first it finds.)

Unknown Character

Facing a typical issue of some unknown character.
Actually trying to compile some packages in database through script and got an error as below:
SP2-0734: unknown command beginning "?SET DEF..." - rest of line ignored.
When i open the log file in notepad++ it shows the line as shown above.
Now, if I open the same log file in scite editor it shows the same file as:
SP2-0734: unknown command beginning "ï»¿SET DEF..." - rest of line ignored.
Not getting what could be the issue.
Any help would be welcomed.

Your script has an unprintable character at the start (as you discovered from comments), which some editors don't display at all, and others display as an unknown character. "ï»¿" is the byte order mark:
The UTF-8 representation of the BOM is the byte sequence
0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as
ISO-8859-1 or CP1252 will display the characters ï»¿ for this.
From that article some editors (notable Notepad) add that automatically. It should be safe to open the file with a hex editor and remove the extra character, and you'll then be able to run the script normally.

BASH: Replacing special character groups

I have a rather tricky request...
We use a special application which is connected to a oracle database. For control reasons the application uses special characters which are defined by the application and saved in a long field of the database.
My task is to query the long field periodically and check for changes. To do that, I write the content by using a bash script in a file and compare the old and the new file with md5sum.
When there's a difference, I want to send the old file via mail. The problem is, that the old file contains these special characters and I don't know how to replace them with for example a string which describes them.
I tried to replace them on the basis of their ASCII code, but this didn't work. I've also tried to replace them by their appearance in the file. (They look like this: ^P ) This didn't work neither.
When viewing the file by text editor like nano the characters are visible like described above. But when using cat on the file, the content is only displayed until the first appearance of such a control character.
As far as I know there is know possibility to replace them while querying from the database because of the fact that the content is in a LONG field.
I hope you can help me.
Thank you in advance.
Marco

^P is the Control-P character, which is decimal 16 or hexadecimal 0x10, also known as the Data Link Escape (DLE) character in ASCII.
To replace all occurrences of 0x10 in a file with another string we can use our friend gsed:
gsed "s/\x10/Data Link Escape/g" yourfile.txt
This should replace all occurrences of characters containing the hex value 0x10 with the text string "Data Link Escape". You'll probably want to use a different string - this is just an example.
Depending on the system you're using you may be able to use the standard sed command if your version of sed recognizes the \xNN single-character escape codes. If there are multiple hex characters you need to replace you may want to create a file containing your sed commands, one for each hexadecmial character you need to replace, and tell sed or gsed to use the commands in the file - consult the sed or gsed man pages for how to do this.
Share and enjoy.

You can use xxd to change the string to its hex representation, then use xxd -r to convert back.
Or, you can use uuencode and uudecode.

One option is to run the file through cat -v. This replaces nonprinting characters with visible representations (using the ^ notation for control characters):
$ echo $'\x10\x12\x13\x14\x16' | cat -v
^P^R^S^T^V

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio