Reading an UTF-8 encoded file into std::u32string without intermediate buffering - c++11

Having worked quite long time with Unicode and C++ I thought this would be a simple thing to accomplish, especially with the new C++11 std::codecvt_utf8 facet. Though it turned out to be a diffcult task. What I want is to read a file encoded in UTF-8 into a u32string (converting it from UTF-8 to UTF-32 implicitly). Sure, I could load the entire content into a buffer and convert that using std::wstring_convert. But that doubles the memory footprint when loading a file. So I tried to use a std::wifstream and imbue a locale with a utf-8 facet like this:
std::wifstream stream(fileName, std::ios::binary);
stream.imbue(std::locale(stream.getloc(), new std::codecvt_utf8<char32_t, 0x10ffff, std::consume_header>));
std::u32string data;
for (char32_t c; stream >> c; )
data += c;
which looks like a straight forward implementation. It only doesn't compile. wifstream's element type is wchar_t, so you can only use wchar_t in the loop, like this:
std::u32string data;
for (wchar_t c; stream >> c; )
data += c;
(at least with clang, VC++ also accepts char32_t there, but that doesn't change anything). After fixing this several other problems remain, though:
In Visual C++ wchar_t is only 16bit (no UTF-32 then, we don't consider surrogate pairs here).
Using char32_t for the facet essentially disables conversion. The iteration over the stream returns the original UTF-8 content, both in clang and VC++.
Using wchar_t also for the facet makes it work in clang, but not in VC++, because in clang wchar_t is 32bit wide, while (as mentioned already) it is only 16bit in VC++.
So, what is the correct approach here? With the lock into wchar_t for the facet I cannot even use a different data type. I also tried defining a basic_ifstream<char32_t> but that requires additional typedefs, hence I didn't follow that path further.

Seems there is no way using a facet and imbue that in a stream, so I went with an intermediate buffer, which is a very elegant solution too, only that it doubles (more or less) the memory needed to load the content. Use a byte (file) stream in binary mode to call this:
void load(std::istream &stream)
{
static std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utfConverter;
std::string s((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());
_data = utfConverter.from_bytes(s);
}

Related

how to include text file as string at compile time without adding c++11 string literal prefix and suffix in the text file

I'm aware of many similar questions on this site. I really like the solution mention in the following link:
https://stackoverflow.com/a/25021520/884553
with some modification, you can include text file at compile time, for example:
constexpr const char* s =
#include "file.txt"
BUT to make this work you have to add string literal prefix and suffix to your original file, for example
R"(
This is the original content,
and I don't want this file to be modified. but i
don't know how to do it.
)";
My question is: is there a way to make this work but not modifying file.txt?
(I know I can use command line tools to make a copy, prepend and append to the copy, remove the copy after compile. I'm looking for a more elegant solution than this. hopefully no need of other tools)
Here's what I've tried (but not working):
#include <iostream>
int main() {
constexpr const char* s =
#include "bra.txt" // R"(
#include "file.txt" //original file without R"( and )";
#include "ket.txt" // )";
std::cout << s << "\n";
return 0;
}
/opt/gcc8/bin/g++ -std=c++1z a.cpp
In file included from a.cpp:5:
bra.txt:1:1: error: unterminated raw string
R"(
^
a.cpp: In function ‘int main()’:
a.cpp:4:27: error: expected primary-expression at end of input
constexpr const char* s =
^
a.cpp:4:27: error: expected ‘}’ at end of input
a.cpp:3:12: note: to match this ‘{’
int main() {
^
No, this cannot be done.
There is a c++2a proposal to allow inclusion of such resources at compile time called std::embed.
The motivation part of ths p1040r1 proposal:
Motivation
Every C and C++ programmer -- at some point -- attempts to #include large chunks of non-C++ data into their code. Of course, #include expects the format of the data to be source code, and thusly the program fails with spectacular lexer errors. Thusly, many different tools and practices were adapted to handle this, as far back as 1995 with the xxd tool. Many industries need such functionality, including (but hardly limited to):
Financial Development
representing coefficients and numeric constants for performance-critical algorithms;
Game Development
assets that do not change at runtime, such as icons, fixed textures and other data
Shader and scripting code;
Embedded Development
storing large chunks of binary, such as firmware, in a well-compressed format
placing data in memory on chips and systems that do not have an operating system or file system;
Application Development
compressed binary blobs representing data
non-C++ script code that is not changed at runtime; and
Server Development
configuration parameters which are known at build-time and are baked in to set limits and give compile-time information to tweak performance under certain loads
SSL/TLS Certificates hard-coded into your executable (requiring a rebuild and potential authorization before deploying new certificates).
In the pursuit of this goal, these tools have proven to have inadequacies and contribute poorly to the C++ development cycle as it continues to scale up for larger and better low-end devices and high-performance machines, bogging developers down with menial build tasks and trying to cover-up disappointing differences between platforms.
MongoDB has been kind enough to share some of their code below. Other companies have had their example code anonymized or simply not included directly out of shame for the things they need to do to support their workflows. The author thanks MongoDB for their courage and their support for std::embed.
The request for some form of #include_string or similar dates back quite a long time, with one of the oldest stack overflow questions asked-and-answered about it dating back nearly 10 years. Predating even that is a plethora of mailing list posts and forum posts asking how to get script code and other things that are not likely to change into the binary.
This paper proposes <embed> to make this process much more efficient, portable, and streamlined. Here’s an example of the ideal:
#include <embed>
int main (int, char*[]) {
constexpr std::span<const std::byte> fxaa_binary = std::embed( "fxaa.spirv" );
// assert this is a SPIRV file, compile-time
static_assert( fxaa_binary[0] == 0x03 && fxaa_binary[1] == 0x02
&& fxaa_binary[2] == 0x23 && fxaa_binary[3] == 0x07
, "given wrong SPIRV data, check rebuild or check the binaries!" )
auto context = make_vulkan_context();
// data kept around and made available for binary
// to use at runtime
auto fxaa_shader = make_shader( context, fxaa_binary );
for (;;) {
// ...
// and we’re off!
// ...
}
return 0;
}

Embedding big data file into executable binary

I am working on a C++11 application that is supposed to ship as a single executable binary file. Optionally, users can provide their own CSV data files to be used by the application. To simplify things, assume each element is in format key,value\n. I have created a structure such as:
typedef struct Data {
std::string key;
std::string value;
Data(std::string key, std::string value) : key(key), value(value) {}
} Data;
By default, the application should use data defined in a single header file. I've made a simple Python script to parse default CSV file and put it into header file like:
#ifndef MYPROJECT_DEFAULTDATA
#define MYPROJECT_DEFAULTDATA
#include "../database/DefaultData.h"
namespace defaults {
std::vector<Data> default_data = {
Data("SomeKeyA","SomeValueA"),
Data("SomeKeyB","SomeValueB"),
Data("SomeKeyC","SomeValueC"),
/* and on, and on, and on... */
Data("SomeKeyASFHOIEGEWG","SomeValueASFHOIEGEWG")
}
}
#endif //MYPROJECT_DEFAULTDATA
The only problem is, that file is big. I'm talking 116'087 (12M) lines big, and it will probably be replaced with even bigger file in the future. When I include it, my IDE is trying to parse it and update indices. It slows everything down to the point where I can hardly write anything.
I'm looking for a way to either:
prevent my IDE (CLion) from parsing it or
make a switch in cmake that would use this file only with release executables or
somehow inject data directly into executable
Since your build process already includes a pre-process, which generates C++ code from a CSV, this should be easy.
Step 1: Put most of the generated data in the .cpp file, not a header.
Step 2: Generate your code so that it doesn't use vector or string.
Here's how to do these:
struct Data
{
string_view key;
string_view value;
};
You will need an implementation of string_view or a similar type. While it was standardized in C++17, it doesn't rely on C++17 features.
As for the data structure itself, this is what gets generated in the header:
namespace defaults {
extern const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data;
}
{{GENERATED_ARRAY_COUNT}} is the number of items in the array. That's all the generated header should expose. The generated .cpp file is a bit more complex:
static const char ptr[] =
"SomeKeyA" "SomeValueA"
"SomeKeyB" "SomeValueB"
"SomeKeyC" "SomeValueC"
...
"SomeKeyASFHOIEGEWG" "SomeValueASFHOIEGEWG"
;
namespace defaults
{
const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data =
{
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
...
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
};
}
ptr is a string which is a concatenation of all of your individual strings. There is no need to put spaces or \0 characters or whatever between the individual strings. However, if you do need to pass these strings to APIs that take NULL-terminated strings, you'll either have to copy them into a std::string or have the generator stick \0 characters after each generated sub-string.
The point is that ptr should be a single, giant block of character data.
{{GENERATED_OFFSET}} and {{GENERATED_SIZE}} are offsets and sizes within the giant block of character data that represents a single substring.
This method will solve two of your problems. It will be much faster at load time, since it performs zero dynamic allocations. And it puts the generated strings in the .cpp file, thus making your IDE cooperate.

Convert AnsiString to UnicodeString in Lazarus with FreePascal

I found similar topics here but none of them had the solution to my question, so I am asking it in a new thread.
Couple of days ago, I changed the format the preferences of an application I am developing is saved, from INI to JSON.
I use the jsonConf unit for this.
A sample of the code I use to save a key-value pair in the file would be like below.
Procedure TMyClass.SaveSettings();
var
c: TJSONConfig;
begin
c:= TJSONConfig.Create(nil);
try
c.Filename:= m_settingsFilePath;
c.SetValue('/Systems/CustomName', m_customName);
finally
c.Free;
end;
end;
In my code, m_customName is an AnsiString type variable. TJSONConfig.SetValue procedure requires the key and value both to be of UnicodeString type. The application compiles fine, but I get warnings such
Warning: Implicit strung type conversion from "AnsiString" to "UnicodeString".
Some messages warn saying there is a potential data loss.
Of course I can go and change everything to UnicodeString type but this is too risky. I have't seen any issues so far by ignoring these warnings, but they show up all the time and it might cause issues on a different PC.
How do I fix this?
To avoid the warning do an explicit conversion because this way you tell the compiler that you know what you are doing (I hope...). In case of c.SetValue the expected type is a Unicodestring (UTF16), m_customname should be declared as a string unless there is good reason to do differently (see below), otherwise you may trigger unwanted internal conversions.
A string in Lazarus is UTF8-encoded, by default. Therefore, you can use the function UTF8Decode() for the conversion from UTF8 to Unicode, or UTF8ToUTF16() (unit LazUtf8).
var
c: TJSONConfig;
m_customName: String;
...
c.SetValue('/Systems/CustomName', UTF8Decode(m_customName));
You say above that the key-value pairs are in a file. Then the conversion depends on the encoding of the file. Normally I open the file in a good text editor and find the encoding somewhere - NotePad++, for example, displays the name of the encoding in the right corner of the statusbar. Suppose the encoding is that of codepage 1252 (Latin-1). These are ansistrings, therefore, you can declare the strings read from the file as ansistring. Because UTF8 strings are so common in Lazarus there is no direct conversion from ansistring to Unicode, and you must convert to UTF8 first. In the unit lconvencoding you find many conversion routines between various encodings. Select CP1252toUTF8() to go to UTF8, and then apply UTF8Decode() to finally get Unicode.
var
c: TJSONConfig;
m_customName: ansistring;
...
c.SetValue('/Systems/CustomName', UTF8Decode(CP1252ToUTF8(m_customName)));
The FreePascal compiler 3.0 can handle many of these conversions automatically using strings with predefined encodings. But I think explicit conversions are very clear to see what is happening. And fpc3.0 still emits the warnings which you want to avoid...

How to split blob into Byte Array In shell script?

I have a blob in postgresql database. Have inserted a C structure into it.
struct temp {
uint64_t a,
uint64_t b,
uint64_t c
};
Now when I write q query in shell for retrieving it.
select resource,.....,blob_column from rtable where rId is=1
I got the result as a blob from database. the result is
x00911ee3561ac801cb0783462586cf01af00000000000000
But now in shell script I need to iterate on this and display the result on console. Tried different things like awe,split , convert_from ,convert function but nothing is helping me.
Can someone tell me how can I read this hex string and get back the integers?
Is this some kind of exersise in programmer-torture? I can't imagine why you'd possibly do this. Not least because your struct-as-a-blob could be subject to padding and alignment that will vary from compiler to compiler and platform to platform. Even then, it'll vary between architectures because of endianness differences. At least you used fixed-width types.
Assuming you only care about little-endian and your compilers don't add any padding or alignment (likely for a struct that's just 3 64-bit fields) it's possible. That doesn't make a great idea.
My preferred approach would be to use some Python code with struct, e.g.
python - "x00911ee3561ac801cb0783462586cf01af00000000000000" <<__END__
import sys
import struct
print "{} {} {}".format(*struct.unpack('#QQQ', sys.argv[1][1:].decode("hex")))
__END__
as this can even handle endianness and packing using appropriate modifiers, and you can easily consume the output in a shell script.
If that's not convenient/suitable, it's also possible in bash, just absolutely horrible. For little-endian, unpadded/packed-unaligned:
To decode each value (adapted from https://stackoverflow.com/a/3678208/398670):
$ x=00911ee3561ac801
$ echo $(( 16#${x:14:2}${x:12:2}${x:10:2}${x:8:2}${x:6:2}${x:4:2}${x:2:2}${x:0:2} ))
so, for the full deal:
x=x00911ee3561ac801cb0783462586cf01af00000000000000
uint64_dec() {
echo $(( 16#${1:14:2}${1:12:2}${1:10:2}${1:8:2}${1:6:2}${1:4:2}${1:2:2}${1:0:2} ))
}
uint64_dec ${x:1:16}
uint64_dec ${x:17:16}
uint64_dec ${x:33:16}
produces:
128381549860000000
130470408871937995
175
Now, I feel dirty and need to go wash. I strongly suggest the following:
CREATE TYPE my_struct AS (a numeric, b numeric, c numeric);
then using my_struct instead of a bytea field. Or just use three numeric columns. You can't use bigint because Pg doesn't have a 64-bit unsigned integer.

Restoring .proto file from descriptor string. Possible?

Is it possible to decompile a string containing Protocol Buffers descriptor back to .proto file?
Say I have a long string like
\n\file.proto\u001a\u000ccommon.proto\"\u00a3\u0001\n\nMsg1Request\u0012\u0017\n\u0006common\u0018\u0001 ... etc.
I need to restore .proto, not necessary exactly as it was but compilable.
In C++, the FileDescriptor interface has a method DebugString() which formats the descriptor contents in .proto syntax -- i.e. exactly what you want. In order to use it, you first need to write code to convert the raw FileDescriptorProto to a FileDescriptor, using the DescriptorPool interface.
Something like this should do it:
#include <google/protobuf/descriptor.h>
#include <google/protobuf/descriptor.pb.h>
#include <iostream>
int main() {
google::protobuf::FileDescriptorProto fileProto;
fileProto.ParseFromFileDescriptor(0);
google::protobuf::DescriptorPool pool;
const google::protobuf::FileDescriptor* desc =
pool.BuildFile(fileProto);
std::cout << desc->DebugString() << std::endl;
return 0;
}
You need to feed this program the raw bytes of the FileDescriptorProto, which you can get by using Java to encode your string to bytes using the ISO-8859-1 charset.
Also note that the above doesn't work if the file imports any other files -- you would have to load those imports into the DescriptorPool first.
Yes it should be possible to get some thing close get original definition. I do not know of any existing code to do it (hopefully some one else will).
Hava a look at how protocol buffers itself handles the String.
Basically
convert the string to bytes (using charset="ISO-8859-1" in java), it will then be a Protocol-Buffer message(format=FileDescriptorProto in java). The FileDescriptorProto is built as part of the Protocol-Buffers install.
Extract the data in the Protocol-Buffer message
Here is a File-Descriptor protocol displayed in the Protocol-Buffer editor

Resources