How to Pass Unicode Arguments to G++? - cmd

I am trying to use G++ to compile a simple C++ program. I am running Windows 10 and I have installed MinGW. So I tried to compile this file C:\Users\Vesk\Desktop\Информатика\Hello World.cpp with G++ by typing g++ "C:\Users\Vesk\Desktop\Информатика\Hello World.cpp" -o "C:\Users\Vesk\Desktop\Информатика\Hello World.exe" in the Command Prompt. G++ though didn't compile the file and gave me this error message:
g++: error: C:\Users\Vesk\Desktop\???????????\Hello World.cpp: Invalid argument
g++: fatal error: no input files
compilation terminated.
'Информатика' is just a word written in Cyrillic, so I was confused what the problem was. But then I just renamed the 'Информатика' folder to 'Informatics'. I tried to compile the file again with g++ "C:\Users\Vesk\Desktop\Informatics\Hello World.cpp" -o "C:\Users\Vesk\Desktop\Informatics\Hello World.exe". And lo and behold it worked. G++ compiled the file and the executable was there in the folder and working. But is there any way to actually compile a file if its path contains Cyrillic (or other Unicode) characters? If so, how?

Windows uses UTF-16 for Unicode file names. To my knowledge, it does not support UTF-8 as a locale although that would be very useful.
I tried on a very old MinGW G++ 4.6.3 and indeed it does not support Unicode characters in file paths that are outside current locale. Don't know about more recent MinGW GCC. A first possible solution would be to use a Russian locale.
For a Windows application to properly support Unicode file names, it needs to handle paths as wchar_t wide characters. The int main(int argc, const char* argv[]) classical signature for example must be replaced by int wmain(int argc, const wchar_t* argv[]). For a portable software like GCC, this is a complication that may not be worth it. Extremely few people will put characters in source file paths that are outside their current locale.
I tried G++ 10.2.0 on Cygwin and it works. This is because all Cygwin software link with cygwin1.dll which, among other services, automatically convert all UTF-8 paths to UTF-16.

You should first get the command line with UTF16 encoding with GetCommandLineW function (https://learn.microsoft.com/en-us/windows/win32/api/processenv/nf-processenv-getcommandlinew) and then separate the tokens with CommandLineToArgW (https://learn.microsoft.com/en-us/windows/win32/api/shellapi/nf-shellapi-commandlinetoargvw).
If you want UTF8 encoded strings you need to convert them, a simple, open source and useful tool to convert strings with different encodings in C++20 can be found here.

Related

My compiler doesn't parse escape sequences as expected

I am trying to run static analysis on my code using a tool. The Makefile contains:
export TASK=MY_TASK_NAME
my_static_code_tool.exe <arguments> -- gcc <arguments..> -D__TASK_NAME__=\"$(TASK)\" -o missionFile.o missionFile.c
I find that this executes without an issue on RedHat but fails to run on my Cygwin environment. I assign __TASK_NAME__ variable to an unsigned char in a C file such as:
const unsigned char TASK_NAME[] = __TASK_NAME__;
I get the error as:
gcc: no input files
I am very sure my arguments are all correct and I am referring to sources in the correct directory. To me it looks as if the -- stops the parsing of escape sequences in the command on Windows. Can anybody help me with a workaround?
The -- is used by the tool to introduce the compiler and its arguments [and thereby inform the tool that the following is compiler specific]. The GCC had all the required source/files/configuration defined in the Makefile. However it was not processed completely in the Cygwin shell (the command processing stopped with the escaping hence the corresponding gcc error).
The solution I employed to make this work was pre-processor stringification.
C file:
#define STRINGIFY_IT(str) STRING_OF(str)
#define STRING_OF(str) #str
const unsigned char TASK_NAME[] = STRINGIFY_IT(__TASK_NAME__);
Makefile:
export TASK=MY_TASK_NAME
my_static_code_tool.exe <arguments> -- gcc <arguments..> -D__TASK_NAME__=$(TASK) -o missionFile.o missionFile.c
So, if any of you face such problems in the future with 3rd party tools, try not to pass string arguments through the command line to GCC (as they will need to be escaped and might break the command)

Don't understand gcc that well, but I can't find why it's not working

I'm trying to compile a simple "hello world"
file_name
#include <stdio.h>
void main () {
printf ("Hello World\n");
}
then I try: gcc file_name and I get "File not recognized. File format not recognized"
I however am 100% sure I did the exact same thing a few weeks back (just to see if it works, as now) and it worked, so I just don't get it.
gcc -ver // returns 4.6.1 if this helpes
Also how is gcc -o supposed to work ? The manual (man gcc) is just gibberish at times (for me)
Let's say you program is saved as helloworld.c. Typing gcc -o myprog helloworld.c would compile helloworld.c into myprog. That way, when you want to run the program, all you type in the command line is ./myprog
gcc tries to guess the language used (e.g. C or C++) based on the extension of the file, so you need to ensure you have the proper file extension (usually .cpp for C++ and .c for C dource files). Alternatively, read the manual if there is a command line option to explicitly state the format (regardless of the extension).
As for the "-o" command line parameter: the name specified after that option is the name of the object file created from the compiled source file. The object files are then linked together to form an executable

Is it possible to get GCC to compile UTF-8 with BOM source files?

I develop C++ cross platform using Microsoft Visual Studio on Windows and GCC on uBuntu Linux.
In Visual Studio I can use unicode symbols like "π" and "²" in my code. Visual Studio always saves the source files as UTF-8 with BOM (Byte Order Mark).
For example:
// A = π.r²
double π = 3.14;
GCC happily compiles these files only if I remove the BOM first. If I do not remove the BOM, I get errors like these:
wwga_hydutils.cpp:28:9: error: stray ‘\317’ in program
wwga_hydutils.cpp:28:9: error: stray ‘\200’ in program
Which brings me to the question:
Is there a way to get GCC to compile UTF-8 files without first removing the BOM?
I'm using:
Windows 7
Visual Studio 2010
and:
uBuntu Oneiric 11.10
GCC 4.6.1 (as provided by apt-get install gcc)
Edit:
As the first commenter pointed out, my problem was not the BOM, but having non-ascii characters outside of string constants. GCC does not like non-ascii characters in symbol names, but it turns out GCC is fully compatible with UTF-8 with BOM.
According to the GCC Wiki, this isn't supported yet. You can use -fextended-identifiers and pre-process your code to convert the identifiers to UCN. From the linked page:
perl -pe 'BEGIN { binmode STDIN, ":utf8"; } s/(.)/ord($1) < 128 ? $1 : sprintf("\\U%08x", ord($1))/ge;'
See also g++ unicode variable name and Unicode Identifiers and Source Code in C++11?
While unicode identifiers are supported in gcc, UTF-8 input is not. Therefore, unicode identifiers have to be encoded using \uXXXX and \UXXXXXXXX escape codes. However, a simple one-line patch to the cpp preprocessor allows gcc and g++ to process UTF-8 input provided a recent version of iconv that support C99 conversions is also installed. Details are present at
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
However, the patch is so simple it can be given right here.
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
Even with the patch, two command line options are needed to enable UTF-8 input. In particular, try something like
$ /usr/local/gcc-5.2/bin/gcc \
-finput-charset=UTF-8 -fextended-identifiers \
-o circle circle.c

Undefining linker symbols in gcc

We have a programm that runs on an embedded oOS. We normally embed a version string in the output binary that can identify all the versions contained when generating the binary. Usually the compilers we use can make sure that the version string is in the binary by creating an "undefined" symbol, which is then resolved by our version string.
However, we have now moved to a Linux based system and gcc.
gcc is removing the version string from the final exe. The final exe is created through linking in a bunch of libraries. Each library has a version string embedded.
gcc is removing the version string because nothing is referencing the string and we have turned on -Os optimisations.
Is there a way of making sure that gcc does not strip a collection of strings (there are about 5-10 version strings we need to embed)?
Thanks.
Try working with --retain-symbols-file (option to the linker)
From the ld mangpage:
--retain-symbols-file filename
Retain only the symbols listed in the file filename, discarding all others. filename is simply a flat file, with one symbol name per line. This option is especially useful in environments (such as VxWorks) where a large global symbol table is accumulated gradually, to conserve run-time memory.
--retain-symbols-file does not discard undefined symbols, or symbols needed for relocations.
You may only specify --retain-symbols-file once in the command line. It overrides -s and -S.
EDIT I just noticed the last line of the docs quoted above. It will override the 'strip all' option, so I'm not sure this will help you...
Ok, to solve this we did this in a c file:
const char _string_[] = "some string";
Then include the object file in the final link:
gcc <snip> -Wl,--start-group string.o <snip> -Wl,--end-group -Wl,--strip-all -o final.exe

Compile using gcc without plain text constants in ELF image

I have some string constants in a C code. when i compile it using gcc, the strings are stored in a.out in plain text. These can be hand-edited in a.out. I wan't them to be encoded in some format so that no one can change the strings by editing a.out. Are there any objcopy or gcc options to avoid this?
is it then atleast possible to compile the code so that the elf executes only after an integrity self-check & terminate with an error if it fails...
that is it can store some kind of md5sum in the end, and check it at each execution..
i believe win32 apps have this, & hand-editing a windows exe , makes it an invalid win32 app, because the checksum fails..
is this possible in GCC/Linux ?

Resources