c-file encoded in utf-16 is not read properly by gcc - gcc

Doing some encoding tests, I saved a c-file with encoding 'UTF-16 LE' (using sublimeText).
The c file contains the following:
#include <stdio.h>
void main() {
char* letter = "é";
printf("%s\n", letter);
}
Compiling this file with gcc returns the error:
test.c:1:3: error: invalid preprocessing directive #i; did you mean #if?
1 | # i n c l u d e < s t d i o . h >
It's as if gcc inserted a space before each character when reading the c-file.
My question is: Can we submit c-files encoded in some format other than "utf-8" ? Why it was not possible for gcc to detect the encoding of my file and read it properly ?

Because design choice.
From GNU Manual, Character-sets:
At present, GNU CPP does not implement conversion from arbitrary file encodings to the source character set. Use of any encoding other than plain ASCII or UTF-8, except in comments, will cause errors. Use of encodings that are not strict supersets of ASCII, such as Shift JIS, may cause errors even if non-ASCII characters appear only in comments. We plan to fix this in the near future.
GCC is born to create GNU, so from Unix world, where UTF16 is not an allowed character set (for standard files, and GNU pass sources files between different programs, e.g. CPP the preprocessor, GCC the compiler, etc.).
But also, who uses UTF16 for sources? And for C, which hates all the \0 in strings? The encoding of source code has nothing to do with the program (and do default locales for reading files, printing strings, etc.).
If it cause problem, just use a pre-preprocessor (which is not so uncommon), to change your source code in gcc useable code (but hidden to you, so you can continue edit in UTF16).

Related

How to Pass Unicode Arguments to G++?

I am trying to use G++ to compile a simple C++ program. I am running Windows 10 and I have installed MinGW. So I tried to compile this file C:\Users\Vesk\Desktop\Информатика\Hello World.cpp with G++ by typing g++ "C:\Users\Vesk\Desktop\Информатика\Hello World.cpp" -o "C:\Users\Vesk\Desktop\Информатика\Hello World.exe" in the Command Prompt. G++ though didn't compile the file and gave me this error message:
g++: error: C:\Users\Vesk\Desktop\???????????\Hello World.cpp: Invalid argument
g++: fatal error: no input files
compilation terminated.
'Информатика' is just a word written in Cyrillic, so I was confused what the problem was. But then I just renamed the 'Информатика' folder to 'Informatics'. I tried to compile the file again with g++ "C:\Users\Vesk\Desktop\Informatics\Hello World.cpp" -o "C:\Users\Vesk\Desktop\Informatics\Hello World.exe". And lo and behold it worked. G++ compiled the file and the executable was there in the folder and working. But is there any way to actually compile a file if its path contains Cyrillic (or other Unicode) characters? If so, how?
Windows uses UTF-16 for Unicode file names. To my knowledge, it does not support UTF-8 as a locale although that would be very useful.
I tried on a very old MinGW G++ 4.6.3 and indeed it does not support Unicode characters in file paths that are outside current locale. Don't know about more recent MinGW GCC. A first possible solution would be to use a Russian locale.
For a Windows application to properly support Unicode file names, it needs to handle paths as wchar_t wide characters. The int main(int argc, const char* argv[]) classical signature for example must be replaced by int wmain(int argc, const wchar_t* argv[]). For a portable software like GCC, this is a complication that may not be worth it. Extremely few people will put characters in source file paths that are outside their current locale.
I tried G++ 10.2.0 on Cygwin and it works. This is because all Cygwin software link with cygwin1.dll which, among other services, automatically convert all UTF-8 paths to UTF-16.
You should first get the command line with UTF16 encoding with GetCommandLineW function (https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew) and then separate the tokens with CommandLineToArgW (https://learn.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw).
If you want UTF8 encoded strings you need to convert them, a simple, open source and useful tool to convert strings with different encodings in C++20 can be found here.

GCC Compiler options -wno-four-char-constants and -wno-multichar

Couldn't find any documentation on -Wno-four-char-constants, however I suspect that it is similar to -Wno-multichar. Am I correct?
They're related but not the same thing.
Compiling with the -Wall --pedantic flags, the assignment:
int i = 'abc';
produces:
warning: multi-character character constant [-Wmultichar]
with both GCC and CLANG, while:
int i = 'abcd';
produces:
GCC warning: multi-character character constant [-Wmultichar]
CLANG warning: multi-character character constant [-Wfour-char-constants]
The standard (C99 standard with corrigenda TC1, TC2 and TC3 included, subsection 6.4.4.4 - character constants) states that:
The value of an integer character constant containing more than one character (e.g., 'ab'), [...] is implementation-defined.
A multi-char always resolves to an int but, since the order in which the characters are packed into one int is not specified, portable use of multi-character constants is difficult (the exact value is implementation-dependent).
Also compilers differ in how they handle incomplete multi-chars (such as 'abc').
Some compilers pad on the left, some on the right, regardless of endian-ness (some compilers may not pad at all).
Someone who can accept the portability problems of a complete multi-char may anyway want a warning for an incomplete one (-Wmultichar -Wno-four-char-constants).

Is it possible to get GCC to compile UTF-8 with BOM source files?

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on uBuntu Linux.
In Visual Studio I can use unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).
For example:
// A = π.r²
double π = 3.14;
GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:
wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program
wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program
Which brings me to the question:
Is there a way to get GCC to compile UTF-8 files without first removing the BOM?
I'm using:
Windows 7
Visual Studio 2010
and:
uBuntu Oneiric 11.10
GCC 4.6.1 (as provided by apt-get install gcc)
Edit:
As the first commenter pointed out, my problem was not the BOM, but having non-ascii characters outside of string constants. GCC does not like non-ascii characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.
According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:
perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'
See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?
While unicode identifiers are supported in gcc, UTF-8 input is not. Therefore, unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the cpp preprocessor allows gcc and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
However, the patch is so simple it can be given right here.
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
Even with the patch, two command line options are needed to enable UTF-8 input. In particular, try something like
$ /usr/local/gcc-5.2/bin/gcc \
-finput-charset=UTF-8 -fextended-identifiers \
-o circle circle.c

Compile using gcc without plain text constants in ELF image

I have some string constants in a C code. when i compile it using gcc, the strings are stored in a.out in plain text. These can be hand-edited in a.out. I wan't them to be encoded in some format so that no one can change the strings by editing a.out. Are there any objcopy or gcc options to avoid this?
is it then atleast possible to compile the code so that the elf executes only after an integrity self-check & terminate with an error if it fails...
that is it can store some kind of md5sum in the end, and check it at each execution..
i believe win32 apps have this, & hand-editing a windows exe , makes it an invalid win32 app, because the checksum fails..
is this possible in GCC/Linux ?

Are Multiline macros in GCC supported

Are multi-line macros supported(compilable) in gcc version 3.2.4. I am trying to build my source which has multi-line macros on a linux host using above mentioned gcc version.
I get compilation error at the macro, which is multiline.
#define YYCOPY(To, From, Count) \
do \
{ \
YYSIZE_T yyi; \
for (yyi = 0; yyi < (Count); yyi++) \
(To)[yyi] = (From)[yyi]; \
} \
while (0)
If they are not supported, what is the workaround for this, converting the macro to a function or some other compiler option can help?
thank you.
-AD
Backslashes to continue the macro is standard preprocessor. Check for extra spaces or other invisible characters after your backslash.
The ANSI C specification requires compilers to support this -- specifically, the standard says that if a line ends in a backslash immediately before the newline, the preprocessor is to treat that line and the subsequent line as one logical line, as if both the backslash and the newline did not exist. If a preprocessor does not do this, it is not a conforming preprocessor (or more technically, a translator, as the standard calls it).
GCC strives to be as conforming as possible to the ANSI C standard. Yes, it support multiline macros defined with backslashes at the end of lines.
The reason you're getting compiler errors is something else. You're not using the macro properly. Without posting the exact error messages you're receiving and the code which invokes the macro, it's impossible to say what you're doing wrong.

Resources