Embedding big data file into executable binary - c++11

I am working on a C++11 application that is supposed to ship as a single executable binary file. Optionally, users can provide their own CSV data files to be used by the application. To simplify things, assume each element is in format key,value\n. I have created a structure such as:
typedef struct Data {
std::string key;
std::string value;
Data(std::string key, std::string value) : key(key), value(value) {}
} Data;
By default, the application should use data defined in a single header file. I've made a simple Python script to parse default CSV file and put it into header file like:
#ifndef MYPROJECT_DEFAULTDATA
#define MYPROJECT_DEFAULTDATA
#include "../database/DefaultData.h"
namespace defaults {
std::vector<Data> default_data = {
Data("SomeKeyA","SomeValueA"),
Data("SomeKeyB","SomeValueB"),
Data("SomeKeyC","SomeValueC"),
/* and on, and on, and on... */
Data("SomeKeyASFHOIEGEWG","SomeValueASFHOIEGEWG")
}
}
#endif //MYPROJECT_DEFAULTDATA
The only problem is, that file is big. I'm talking 116'087 (12M) lines big, and it will probably be replaced with even bigger file in the future. When I include it, my IDE is trying to parse it and update indices. It slows everything down to the point where I can hardly write anything.
I'm looking for a way to either:
prevent my IDE (CLion) from parsing it or
make a switch in cmake that would use this file only with release executables or
somehow inject data directly into executable

Since your build process already includes a pre-process, which generates C++ code from a CSV, this should be easy.
Step 1: Put most of the generated data in the .cpp file, not a header.
Step 2: Generate your code so that it doesn't use vector or string.
Here's how to do these:
struct Data
{
string_view key;
string_view value;
};
You will need an implementation of string_view or a similar type. While it was standardized in C++17, it doesn't rely on C++17 features.
As for the data structure itself, this is what gets generated in the header:
namespace defaults {
extern const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data;
}
{{GENERATED_ARRAY_COUNT}} is the number of items in the array. That's all the generated header should expose. The generated .cpp file is a bit more complex:
static const char ptr[] =
"SomeKeyA" "SomeValueA"
"SomeKeyB" "SomeValueB"
"SomeKeyC" "SomeValueC"
...
"SomeKeyASFHOIEGEWG" "SomeValueASFHOIEGEWG"
;
namespace defaults
{
const std::array<Data, {{GENERATED_ARRAY_COUNT}}> default_data =
{
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
...
{{ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}, {ptr+{{GENERATED_OFFSET}}, {{GENERATED_SIZE}}}},
};
}
ptr is a string which is a concatenation of all of your individual strings. There is no need to put spaces or \0 characters or whatever between the individual strings. However, if you do need to pass these strings to APIs that take NULL-terminated strings, you'll either have to copy them into a std::string or have the generator stick \0 characters after each generated sub-string.
The point is that ptr should be a single, giant block of character data.
{{GENERATED_OFFSET}} and {{GENERATED_SIZE}} are offsets and sizes within the giant block of character data that represents a single substring.
This method will solve two of your problems. It will be much faster at load time, since it performs zero dynamic allocations. And it puts the generated strings in the .cpp file, thus making your IDE cooperate.

Related

how to include text file as string at compile time without adding c++11 string literal prefix and suffix in the text file

I'm aware of many similar questions on this site. I really like the solution mention in the following link:
https://stackoverflow.com/a/25021520/884553
with some modification, you can include text file at compile time, for example:
constexpr const char* s =
#include "file.txt"
BUT to make this work you have to add string literal prefix and suffix to your original file, for example
R"(
This is the original content,
and I don't want this file to be modified. but i
don't know how to do it.
)";
My question is: is there a way to make this work but not modifying file.txt?
(I know I can use command line tools to make a copy, prepend and append to the copy, remove the copy after compile. I'm looking for a more elegant solution than this. hopefully no need of other tools)
Here's what I've tried (but not working):
#include <iostream>
int main() {
constexpr const char* s =
#include "bra.txt" // R"(
#include "file.txt" //original file without R"( and )";
#include "ket.txt" // )";
std::cout << s << "\n";
return 0;
}
/opt/gcc8/bin/g++ -std=c++1z a.cpp
In file included from a.cpp:5:
bra.txt:1:1: error: unterminated raw string
R"(
^
a.cpp: In function ‘int main()’:
a.cpp:4:27: error: expected primary-expression at end of input
constexpr const char* s =
^
a.cpp:4:27: error: expected ‘}’ at end of input
a.cpp:3:12: note: to match this ‘{’
int main() {
^
No, this cannot be done.
There is a c++2a proposal to allow inclusion of such resources at compile time called std::embed.
The motivation part of ths p1040r1 proposal:
Motivation
Every C and C++ programmer -- at some point -- attempts to #include large chunks of non-C++ data into their code. Of course, #include expects the format of the data to be source code, and thusly the program fails with spectacular lexer errors. Thusly, many different tools and practices were adapted to handle this, as far back as 1995 with the xxd tool. Many industries need such functionality, including (but hardly limited to):
Financial Development
representing coefficients and numeric constants for performance-critical algorithms;
Game Development
assets that do not change at runtime, such as icons, fixed textures and other data
Shader and scripting code;
Embedded Development
storing large chunks of binary, such as firmware, in a well-compressed format
placing data in memory on chips and systems that do not have an operating system or file system;
Application Development
compressed binary blobs representing data
non-C++ script code that is not changed at runtime; and
Server Development
configuration parameters which are known at build-time and are baked in to set limits and give compile-time information to tweak performance under certain loads
SSL/TLS Certificates hard-coded into your executable (requiring a rebuild and potential authorization before deploying new certificates).
In the pursuit of this goal, these tools have proven to have inadequacies and contribute poorly to the C++ development cycle as it continues to scale up for larger and better low-end devices and high-performance machines, bogging developers down with menial build tasks and trying to cover-up disappointing differences between platforms.
MongoDB has been kind enough to share some of their code below. Other companies have had their example code anonymized or simply not included directly out of shame for the things they need to do to support their workflows. The author thanks MongoDB for their courage and their support for std::embed.
The request for some form of #include_string or similar dates back quite a long time, with one of the oldest stack overflow questions asked-and-answered about it dating back nearly 10 years. Predating even that is a plethora of mailing list posts and forum posts asking how to get script code and other things that are not likely to change into the binary.
This paper proposes <embed> to make this process much more efficient, portable, and streamlined. Here’s an example of the ideal:
#include <embed>
int main (int, char*[]) {
constexpr std::span<const std::byte> fxaa_binary = std::embed( "fxaa.spirv" );
// assert this is a SPIRV file, compile-time
static_assert( fxaa_binary[0] == 0x03 && fxaa_binary[1] == 0x02
&& fxaa_binary[2] == 0x23 && fxaa_binary[3] == 0x07
, "given wrong SPIRV data, check rebuild or check the binaries!" )
auto context = make_vulkan_context();
// data kept around and made available for binary
// to use at runtime
auto fxaa_shader = make_shader( context, fxaa_binary );
for (;;) {
// ...
// and we’re off!
// ...
}
return 0;
}

Reading an UTF-8 encoded file into std::u32string without intermediate buffering

Having worked quite long time with Unicode and C++ I thought this would be a simple thing to accomplish, especially with the new C++11 std::codecvt_utf8 facet. Though it turned out to be a diffcult task. What I want is to read a file encoded in UTF-8 into a u32string (converting it from UTF-8 to UTF-32 implicitly). Sure, I could load the entire content into a buffer and convert that using std::wstring_convert. But that doubles the memory footprint when loading a file. So I tried to use a std::wifstream and imbue a locale with a utf-8 facet like this:
std::wifstream stream(fileName, std::ios::binary);
stream.imbue(std::locale(stream.getloc(), new std::codecvt_utf8<char32_t, 0x10ffff, std::consume_header>));
std::u32string data;
for (char32_t c; stream >> c; )
data += c;
which looks like a straight forward implementation. It only doesn't compile. wifstream's element type is wchar_t, so you can only use wchar_t in the loop, like this:
std::u32string data;
for (wchar_t c; stream >> c; )
data += c;
(at least with clang, VC++ also accepts char32_t there, but that doesn't change anything). After fixing this several other problems remain, though:
In Visual C++ wchar_t is only 16bit (no UTF-32 then, we don't consider surrogate pairs here).
Using char32_t for the facet essentially disables conversion. The iteration over the stream returns the original UTF-8 content, both in clang and VC++.
Using wchar_t also for the facet makes it work in clang, but not in VC++, because in clang wchar_t is 32bit wide, while (as mentioned already) it is only 16bit in VC++.
So, what is the correct approach here? With the lock into wchar_t for the facet I cannot even use a different data type. I also tried defining a basic_ifstream<char32_t> but that requires additional typedefs, hence I didn't follow that path further.
Seems there is no way using a facet and imbue that in a stream, so I went with an intermediate buffer, which is a very elegant solution too, only that it doubles (more or less) the memory needed to load the content. Use a byte (file) stream in binary mode to call this:
void load(std::istream &stream)
{
static std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utfConverter;
std::string s((std::istreambuf_iterator<char>(stream)), std::istreambuf_iterator<char>());
_data = utfConverter.from_bytes(s);
}

My exists function says the file exists, but winapi functions say it does not

I copied code that's supposed to change desktop wallpaper. I have this constant in my program:
const char * image_name = "button_out.gif";
Later, I write the image on disk using Magick++:
image.write(image_name);
The image appears in program's working directory. If I run the program directly from explorer the working directory equals the program location.
Because the code prints the 0x80070002 - File not found error I added a exist function in the beginning:
#include <sys/stat.h>
bool exists(const char* name) {
struct stat buffer;
return (stat (name, &buffer) == 0);
}
void SetWallpaper(LPCWSTR file){
if(!exists((const char* )file)) {
wcout << "The file "<<file<<" does not exist!" << endl;
return;
... actually try to set a wallpaper ...
}
The error is not printed however and the code proceeds.
Now the question is:
Does my exist function work properly?
Where does windows look for that image?
Full code to set a Magick++ generated image as background in case I have missed something relevant in this question.
Problem 1: String Conversions
Your primary problem is that you are attempting to use LPCWSTR (a const wchar_t *) and const char * interchangeably. I see a number of issues in your source, in particular:
You start with const char * image_name.
You then cast it to a LPCWSTR to pass to SetWallpaper. This basically guarantees that SetWallpaper will fail, as desktop->SetWallpaper is not able to handle non wide-character strings.
You then cast it back to a const char * to pass to stat() via exists(). This should work in your situation (since the original string really is a char *) but isn't correct because your string parameter to SetWallpaper is supposedly a proper LPCWSTR.
You need to pick a string format (wide-character vs. what Windows terms "ANSI") and stick to that format, using consistent APIs throughout.
The easiest option is probably just to leave most of your code untouched, but modify SetWallpaper to take a const char * and convert to a wide-character string when needed (for this you can use mbstowcs). So, for example:
void SetWallpaper(const char * file){ // <- Use a const char* parameter.
...
// Convert to a wide-character string to pass to COM:
wchar_t wcfile[MAX_PATH + 1];
mbstowcs(wcfile, file, sizeof(wcfile) / sizeof(wchar_t));
// Pass the converted wide-character string:
desktop->SetWallpaper(wcfile, 0);
...
}
The other option would be to use wide-character strings throughout, i.e.:
LPCWSTR image_name = L"button_out.gif";
Modify exists() to take a LPCWSTR and use _wstat() instead.
Use wide-character versions of all other API functions.
However, I am unsure how that would interact with the ImageMagick API, which may not have wide-character support. So it's up to you. Choose whatever approach is the easiest to implement but make sure you are consistent. The general rule is do not cast between LPCWSTR and const char *; if you are ever in a situation where you need to change one to the other, you cannot cast, you must convert (via mbstowcs or wcstombs).
Problem 2: SetWallpaper default directory is not current working directory
At this point, your string usage will be consistent. Now that you have that problem ironed out, if SetWallpaper fails while exists() does not, then SetWallpaper is not looking where you think it is. As you discovered in your comment, SetWallpaper looks in the desktop by default. In this case, while I have not tested it, you may be able to work around this by passing an absolute path to SetWallpaper. For this, you can use GetFullPathName to determine the absolute file name given your relative path. Remember to be consistent with your string types, though.
Also, if stat() continues to fail, then that problem is either that your working directory is not what you think it is, or your filename is not what you think it is. To that end you will want to perform the following tests:
Print the current working directory at the point you check for the files existence, verify it is correct.
Print the filename when you check for its existence, verify it is correct.
You should be good to go once you work all the above issues out.

Restoring .proto file from descriptor string. Possible?

Is it possible to decompile a string containing Protocol Buffers descriptor back to .proto file?
Say I have a long string like
\n\file.proto\u001a\u000ccommon.proto\"\u00a3\u0001\n\nMsg1Request\u0012\u0017\n\u0006common\u0018\u0001 ... etc.
I need to restore .proto, not necessary exactly as it was but compilable.
In C++, the FileDescriptor interface has a method DebugString() which formats the descriptor contents in .proto syntax -- i.e. exactly what you want. In order to use it, you first need to write code to convert the raw FileDescriptorProto to a FileDescriptor, using the DescriptorPool interface.
Something like this should do it:
#include <google/protobuf/descriptor.h>
#include <google/protobuf/descriptor.pb.h>
#include <iostream>
int main() {
google::protobuf::FileDescriptorProto fileProto;
fileProto.ParseFromFileDescriptor(0);
google::protobuf::DescriptorPool pool;
const google::protobuf::FileDescriptor* desc =
pool.BuildFile(fileProto);
std::cout << desc->DebugString() << std::endl;
return 0;
}
You need to feed this program the raw bytes of the FileDescriptorProto, which you can get by using Java to encode your string to bytes using the ISO-8859-1 charset.
Also note that the above doesn't work if the file imports any other files -- you would have to load those imports into the DescriptorPool first.
Yes it should be possible to get some thing close get original definition. I do not know of any existing code to do it (hopefully some one else will).
Hava a look at how protocol buffers itself handles the String.
Basically
convert the string to bytes (using charset="ISO-8859-1" in java), it will then be a Protocol-Buffer message(format=FileDescriptorProto in java). The FileDescriptorProto is built as part of the Protocol-Buffers install.
Extract the data in the Protocol-Buffer message
Here is a File-Descriptor protocol displayed in the Protocol-Buffer editor

Printing out the names of implicitly linked dll's from .idata section in a portable executable

I am trying to write a code which is supposed to print out the names of all the imported dll's in the exe by using the 'name' field of the IMAGE_IMPORT_DESCRIPTOR structure in the .idata section of the exe, but the program seems to be getting stuck in an infinite loop. Can someone please tell me how to get the names printed out correctly...
#include<iostream>
#include<Windows.h>
#include<stdio.h>
#include<WinNT.h>
int main()
{
FILE *fp;
int i;
if((fp = fopen("c:\\Linked List.exe","rb"))==NULL)
std::cout<<"unable to open";
IMAGE_DOS_HEADER imdh;
fread(&imdh,sizeof(imdh),1,fp);
fseek(fp,imdh.e_lfanew,0);
IMAGE_NT_HEADERS imnth;
fread(&imnth,sizeof(imnth),1,fp);
IMAGE_SECTION_HEADER *pimsh;
pimsh = (IMAGE_SECTION_HEADER *)malloc(sizeof(IMAGE_SECTION_HEADER) * imnth.FileHeader.NumberOfSections);
long t;
fread(pimsh,sizeof(IMAGE_SECTION_HEADER),imnth.FileHeader.NumberOfSections,fp);
for(i=0;i<imnth.FileHeader.NumberOfSections;i++)
{
if(!strcmp((char *)pimsh->Name,".idata"))
t = pimsh->PointerToRawData;
pimsh++;
}
fseek(fp,t,0);
IMAGE_IMPORT_DESCRIPTOR iid;
char c;
while(1)
{
fread(&iid,sizeof(iid),1,fp);
if(iid.Characteristics == NULL)
break;
t = ftell(fp);
fseek(fp,(long)iid.Name,0);
while(c=fgetc(fp))
printf("%c",c);
printf("\n");
fseek(fp,t,0);
}
}
There are several problems.
You can't assume the import section is called ".idata". You should locate the imports using IMAGE_OPTIONAL_HEADER.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].
Most offsets within a PE file are Relative Virtual Addresses (RVAs), not file offsets. To convert an RVA to an offset you need to determine which section the virtual address is in, then calculate an offset based on where the section is in the file. Specifically, the IMAGE_IMPORT_DESCRIPTOR.Name field contains an RVA, not a file offset.
Your code will be much simpler (and quicker) if you use a memory-mapped file rather than file I/O.
This MSDN article explains RVAs, the data directory, etc. It also includes pedump, an application with full source code for dumping PE files, which is a useful reference.
The answer by mox is right on all points, however I would also like to add another solution - load the file as an image to read the data.
This is achieved very simply using LoadLibraryEx with just one line of code.
Base = LoadLibraryEx("c:\Linked List.exe", 0, DONT_RESOLVE_DLL_REFERENCES);
This load and maps your executable as an image, so no need for opening/reading/mapping or converting rva to raw offsets.
With the DONT_RESOLVE_DLL_REFERENCES flag the image is uninitialized, so all import data is untouched, and entrypoint code is not executed. The executable is just mapped into memory.
You can simply use Base + Rva to find imported dll name - or any other kind of PE information.
Free the executable image after use with FreeLibrary(Base)

Resources