Save data in the .exe file - c++11

Is there a way to create a console application program that asks input data (eg users birth dates, favourite food, everything) and does anything i program it to do and keep those data stored in that .exe when i close it?
This way, when I'll open it again, all those datas will still be saved there, so I just have to update or modify them.

Don't save data inside your executable but outside of it
Is there a way to create a [...] program that asks input data [...] then keep those data stored in that .exe when i close it?
There is no simple way (but you don't need that). What you want is related to persistence and application checkpointing.
In practice, you probably want to store data in a file -not your executable- (perhaps using some textual format like JSON) or in some database (perhaps as simple as some sqlite, or interacting with some RDBMS like PostGreSQL). For things like birthdays and food preference, an sqlite database file is probably the good approach (see some SQLite tutorial). Put efforts in the good design of your database schema.
This way, when I'll open it again, all those datas will still be saved there
Those datas will still be there if you keep them in some outside file (perhaps a simple myappdata.sqlite one). You can easily design your program to create that file if it does not exist (this happens only the first time you run your program; on the next runs, your program would successfully read that data from that outside file at startup).
In most current operating systems (read this textbook to learn more about OSes) notably Windows, MacOSX, Linux, Android, ..., the executable is supposed to be read-only. And it might be running in several processes at the same time (in such case, what should happen? Think of ACID properties).
The usual practice is to store data outside of the executable (most programs, including your text processor, your compiler, your web browser, ... are doing that). You don't explain why you want to store some data inside the executable, and doing so is unusual and highly operating system specific and executable format specific (for Linux, study carefully elf(5)...)
I would suggest to save the data in some optional file (or database) - its filepath could have some wired-in constant default, etc.... At startup, you check the existence of that data (e.g. with access(2) on POSIX, or just by handling the failure case of fopen or sqlite3_open etc....). If it does not exist, you initialize your program data somehow. At exit (or save time), you write that data. BTW most programs are doing so.
Notice that on most operating systems and computers, a software is not simply a single executable file, but much more (e.g. required libraries and dependencies, configuration files, data files, build automation scripting such as Makefile, etc...). Its installation is a well identified technical process (sometimes a quite complex one), and package managers are helpful.
My feeling is that without specific motivation, you should not even try to store (mutable) data (persistently) in your executable (it is complex, brittle since very OS & compiler and build-chain specific, unusual, and opens vulnerabilities).
For completeness, some programs did actually write some data by rewriting their executable. On Linux, GNU emacs is doing that (in practice, only during its installation procedure) in its unexec.c file (very brittle, since OS & compiler specific) but that feature is disputed and is likely to disappear.
Many other systems deal cleverly with orthogonal persistence: SBCL has some save-lisp-and-die primitive (it usually persists the state in some other "image" file). Poly/ML has some export facility. J.Pitrat's CAIA system (see this paper and his blog; a 2016 tarball of CAIA is available -with permission- on my home page) is able to regenerate entirely all its C code and all the required data (in thousands of files). FullPliant is persisting its state in a well organized file tree. Such persistence or checkpointing techniques are tied to garbage collection (so you should then read the GC handbook) and are using techniques and algorithms close to copying garbage collection.
FWIW, my current project, bismon, is orthogonally persisting its entire heap, but do that outside of the main executable (in an ideal world, I would like to re-generate all the C or C++ source code of it; I am far from that goal).
My recommendation is to keep your software in several files: its executable, the C++ source code related to it, its data files (and probably much more dependencies, i.e. shared libraries or DLLs, font and image files, required binaries, etc...). Then you don't need to overwrite your executable when persisting your state. Since you mention C++ (which is not homoiconic), you could generate the C++ code of your system (then called a Quine program) with its persistent data (and leave the recompilation of all that generated C++ to the system's C++ compiler). I also recommend to make your self-generating program some free software. (if you do that, be nice to edit your question to gives its URL).
In C++, you might keep the data inside the executable (again, it is a bad idea, and I hope to have convinced you to avoid that approach) in the following way: You add one C or C++ source file (e.g. mydata.cc) which contains only data (e.g. some big const char data[]="... many lines of data ...";) - BTW, the XBM file format could be inspirational. You keep all the other *.o object files (in a place known to your program). To save data, you regenerate that mydata.cc file (with the new data for your current state) at each save operation, and at last you run the appropriate commands (perhaps using std::system in your code) to compile that mydata.cc and link it with the kept *.o into a fresh executable. So every save operation requires the recompilation of data.cc and its linking with other *.o object files (and of course the C++ compiler and linker, perhaps with additional build automation tools, becomes a required dependency of your program). Such an approach is not simpler than keeping an external data file (and requires to keep those *.o object files anyway).
This way, when I'll open it again, all those datas will still be saved there
If your goal is just to get the data if it was written in the past, just keep the data in some optional database or file (as many programs do: your word processor would ask you to save its document before exiting if you start it without any document and write a few words in it) outside of your executable and write it before exiting your program. No need to overwrite your executable!

What you want to do requires the ability to write into (and possibly read from) an executable while it is running. As far as I know this is not possible.
While it is possible to change the behaviour of a running executable based on the user input which it is pre-conditioned to receive (think of a video game), it is not possible to store those inputs directly into the exe.
Video games store the progress, points of the player (which are the result of the inputs from the player) into a file(s) outside the running .exe.
So you will have to store the data in a file outside of the .exe file.
I normally use google protocol buffers to do this.
A good explanation of them can be found here.
They are free, simple to use and supported for C++.
They are better than other formats like XML.
Some of the advantages are mentioned here
Protocol buffers have many advantages over XML for serializing structured data.
Protocol buffers:.
are simpler
are 3 to 10 times smaller
are 20 to 100 times faster
are less ambiguous
generate data access classes that are easier to use programmatically

managing birth dates
As I explain in my other answer (which you should read before this one), you don't want to save data inside your .exe file.
But I am guessing that you want to keep user birth date (and other data) from one run to the next. This answer focus mostly on that "users birth date" aspect and guesses that your question is some XY problem (you really care about birth dates, not about overwriting an executable).
So, you decide to keep them somewhere (but outside of your executable). That could be a textual file; perhaps using JSON or YAML format, or some other textual file format that you define properly, specified in EBNF notation in some document; or a binary file (perhaps protocol buffers as suggested by P.W, or some sqlite "database file", or your own binary format that you need to document properly). It is very important to document properly the file format you are using.
Dealing with a textual file (whose format you have well defined) is easy with just fopen. You first need to define well defined a file path, perhaps as simple as
#define MYAPP_DATA_PATH "mydata.txt"
or better
const char* myapp_data_path = "mydata.txt";
(in reality, you better use some absolute file path to be able to run your program from various working directories, and provide some way to redefine it, e.g. thru program options, i.e. command-line arguments)
You might also need to organize some data structure (a global variable MyData global_data; perhaps) keeping that data. In C++, you'll define some class MyData; and you want it to have at least member functions like void MyData::add_birth_date(const std::string& person, const std::chrono::time_point& birthdate); and void MyData::remove_birth_date(const std::string& person);. Probably you'll have more classes like class Person; etc...
using textual format
So your application starts first by filling global_data if a file mydata.txt exists (otherwise, your global_data keeps its empty initial state). That is simple, you'll have some initialization function like:
void initial_fill_global_data(void) {
std::ifstream input(myapp_data_path);
// the file opening could have failed.... then we return immediately
if (!input || !input.good() || input.fail())
return;
Of course, you need to parse that input. Use well known parsing techniques that would call global_data.add_birth_date appropriately. Notice that for the JSON format, you'll find good C++ libraries (such as jsoncpp) to make that really easy.
Before exiting your application, you should save that file. So you would call a save_global_data function wich outputs into the mydata.txt file the contents of MyData. BTW, you could even register it with std::atexit.
The functions initial_fill_global_data and save_global_data could be member functions (perhaps static ones) of your MyData class.
You might want your program to lock the data file. so that two processes running your program won't make havoc. This is operating system specific (e.g. flock(2) on Linux).
using an sqlite database file
I also suggested to keep your data in an sqlite database file. Read some sqlite tutorial and refer to sqlite C & C++ interface reference documentation. Then you need to think of a well designed database schema. And you don't need anymore to keep all the data in memory, since sqlite is capable of managing a big amount of data (many gigabytes), more that what fits in memory.
Obviously you need a global database pointer. So declare some global sqlite3*db;. Of course, myapp_data_path is now some "mydata.sqlite" path. Your main starts by opening that (and creating an empty database if necessary) using
int opsta = sqlite3_open(myapp_data_path, &db);
if (opsta != SQLITE_OK) {
std::cerr << "failed to open database " << myapp_data_path
<< " with error#" << opsta << "=" << sqlite_errstr(opsta)
<< std::endl;
exit (EXIT_FAILURE);
}
If the database did not exist, it is created empty. In that case, you need to define appropriate tables and indexes in it. My first suggestion could be something as simple as
char* errormsg = NULL;
int errcod = sqlite3_exec(db,
"CREATE TABLE IF NOT EXISTS data_table ("
" name STRING NOT NULL UNIQUE,"
" birthdate INT"
")",
&errormsg);
if (errcod != SQLITE_OK) {
std::cerr << "failed to create data_table " << errormsg << std::endl;
exit(EXIT_FAILURE);
}
Of course, you need to think of some more clever database schema (in reality you want several tables, some database normalization, and you should cleverly add indexes on your tables), and to prepare the queries (transform them into sqlite_stmt-s) done in your program.
Obviously you should not save data inside your executable. In all my approaches above, your myapp program behaves as you want it to be. The first time it is running, it initialize some data -outside of the myapp executable- on the disk if that data was missing. The next times, it reuses and updates that data. But that myapp executable is never rewritten when running.

Related

Ruby PStore file too large

I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error;
Errno::EINVAL: Invalid argument - <filename>
I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key.
Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets.
For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file.
A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore.
Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.

Initializing large arrays efficiently in Xcode

I need to store a large number of different kind of confidential data in my project.
The data can be represented as encoded NSStrings. I rather like to initialize this in code than read from a file, since that is more secure.
So I would need about 100k lines like this:
[myData addObject: #"String"];
or like this
myData[n++] = #"String";
Putting these lines in Xcode causes compile time to increase extensively up to several hours (by the way in Eclipse it takes a fraction of a second to compile 100k lines like this)
What would be feasible secure alternatives?
(please do not suggest reading from a file since this makes the data much easier to crack)
Strings in your code can be readily dumped with tools like strings.
Anyway, if you want to incorporate a data file directly into the executable, you can do that with the -sectcreate linker option. Add something like -Wl,-sectcreate,MYSEG,MYSECT,path to the Other Linker Commands build setting. In your code, you can use getsectdata() to access that data section.
However, you must not consider any of the data that you actually deliver to the end user, whether in code or resource files, as "confidential". It isn't and can never be.
I would put the strings in a plist file and read it into an NSArray at run time. For security encrypt the file.

Move or copy and truncate a file that is in use

I want to be able to (programmatically) move (or copy and truncate) a file that is constantly in use and being written to. This would cause the file being written to would never be too big.
Is this possible? Either Windows or Linux is fine.
To be specific what I'm trying to do is log video with FFMPEG and create hour long videos.
It is possible in both Windows and Linux, but it would take cooperation between the applications involved. If the application that is writing the new data to the file is not aware of what the other application is doing, it probably would not work (well ... there is some possibility ... back to that in a moment).
In general, to get this to work, you would have to open the file shared. For example, if using the Windows API CreateFile, both applications would likely need to specify FILE_SHARE_READ and FILE_SHARE_WRITE. This would allow both (multiple) applications to read and write the file "concurrently".
Beyond sharing the file, though, it would also be necessary to coordinate the operations between the applications. You would need to use some kind of locking mechanism (either by locking some part of the file or some shared mutex/semaphore). Note that if you use file locking, you could lock some known offset in the file to act as a "semaphore" (it can even be a byte value beyond the physical end of the file). If one application were appending to the file at the same exact time that the other application were truncating it, then it would lead to unpredictable results.
Back to the comment about both applications needing to be aware of each other ... It is possible that if both applications opened the file exclusively and kept retrying the operations until they succeeded, then perform the operation, then close the file, it would essentially allow them to work without "knowledge" of each other. However, that would probably not work very well and not be very efficient.
Having said all that, you might want to consider alternatives for efficiency reasons. For example, if it were possible to have the writing application write to new files periodically, it might be more efficient than having to "move" the data constantly out of one file to another. Also, if you needed to maintain some portion of the file (e.g., move out the first 100 MB to another file and then move the second 100 MB to the beginning) that could be a fairly expensive operation as well.
logrotate would be a good option is linux, comes stock on just about any distro. I'm sure there's a similar windows service out there somewhere

Transaction implementation for a simple file

I'm a part of a team writing an application for embedded systems. The application often suffers from data corruption caused by power shortage. I thought that implementing some kind of transactions would stop this from happening. One scenario would include copying the area of a file before writing to some additional storage (transaction log). What are other possibilities?
Databases use a variety of techniques to assure that the state is properly persisted.
The DBMS often retains a replicated control file -- several synchronized copies on several devices. Two is enough. More if your're paranoid. The control file provides a few key parameters used to locate the other files and their expected states. The control file can include a "database version number".
Each file has a "version number" in several forms. A lot of times it's in plain form plus in some XOR-complement so that the two version numbers can be trivially checked to have the correct relationship, and match the control file version number.
All transactions are written to a transaction journal. The transaction journal is then written to the database files.
Before writing to database files, the original data block is copied to a "before image journal", or rollback segment, or some such.
When the block is written to the file, the sequence numbers are updated, and the block is removed from the transaction journal.
You can read up on RDBMS techniques for reliability.
There's a number of ways to do this; generally the only assumption required is that small writes (<4k) are atomic. For example, here's how CouchDB does it:
A 4k header contains, amongst other things, the file offset of the root of the BTree containing all the data.
The file is append-only. When updates are required, write the update to the end of the file, followed by any modified BTree nodes, up to and including the root. Then, flush the data, and write the new address of the root node to the header.
If the program dies while writing an update but before writing the header, the extra data at the end of the file is discarded. If it fails after writing the header, the write is complete and all is well. Because the file is append-only, these are the only failure scenarios. This also has the advantage of providing multi-version concurrency control with no read locks.
When the file grows too long, simply read out all the 'live' data and write it to a new file, then delete the original.
You can avoid implementing such transaction logs yourself by using existing transaction managers around file-systems, e.g. XADisk.
The old link is no longer available, a github repo is here.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.

Resources