I am currently working on a tool to extract archives from a game for the purpose of data mining. I currently extract metadata from the archives (number of files per archive, filenames, packed/unpacked sizes, etc.) and write them to a std::wstring for further analysis. I have stumbled over an issue with converting filenames to wide characters using std::wstring_conver.
My code looks something like this now:
struct IndexEntry {
int32_t file_id;
std::array<char, 260> filename;
// more fields
}
wstring foo(IndexEntry entry) {
std::wstringstream buffer;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
buffer << entry.file_id << L'\n';
buffer << converter.from_bytes(entry.filename.data()) << L'\n';
// add rest of the IndexEntry fields to the stream
return buffer.str();
}
The IndexEntry struct is filled by reading from files with a std::ifstream in binary mode. The error happens with converter.from_bytes(). Some of the filenames contain 0x81 as a character and when the converter encounters these, it throws a std::range_error exception.
Is there a way to tell wstring_convert to replace characters it can not convert with something else? Or is there a generally better way to handle this conversion?
This whole project is mostly a learning excercise. I wanted to do all internal string handling with wstring, so I can get some experience dealing with strings in different encodings. Unfortunatly I have no idea what exact encoding was used to generate these archive files.
Related
One of the most widely used functions for output generation in Omnet++ is recordScalar.
virtual void recordScalar (cComponent *component, const char *name, double value, opp_string_map *attributes=nullptr)=0
Is there a more comprehensive function than recordScalar that stores structured data as value instead of storing a double number? Or coding it ourselves.
Or coding a similar function to write mentioned outputs in a text file in the format of JSON by that function?
By structured data, I mean struct data type in c++. like this:
struct logtype {
int src;
int dest;
int messagescount; // the count of messages transmitted between src and dest and vice versa
};
Thanks
OMNeT++ does not contain ready to use tool for storing complex structures. However, OMNeT++ uses C++ and one can write own method that will store some data to a text file, or to a JSON file, or to any file.
Is it possible to use custom encoding with the Jackson CsvMapper? I need the CSV to be encoded with Windows-1252, not the default UTF-8.
Here is a simplified version of my code:
class CSVService(private val mapper: CsvMapper) {
fun <T : CsvResponse> writeAsCSV(result: List<T>, resultClass: KClass<T>, separatorChar: Char = ',', encoding: Charset = Charsets.WINDOWS_1252): String {
mapper.schemaFor(resultClass.java).withHeader().withColumnSeparator(separatorChar)
return mapper.writeValueAsBytes(result).toString(encoding) // does not work.
}
}
It seems like the object mapper writeValueAsBytes method has UTF-8 hard-coded.
Can someone show with an example of how to configure the objectmapper to use a different encoding?
Thanks in advance.
Any help is appreciated.
Using com.fasterxml.jackson.dataformat:jackson-dataformat-csv:2.11.2
Jackson Javadoc specifies that method writeValueAsBytes encoding is forced to UTF-8.
To customize charset, you'll have to create an in-memory writer as intermediate output.
Example:
val buffer = ByteArrayOutputStream()
OutputStreamWriter(buffer, encoding).use {
mapper.writeValue(it, result)
}
val bytes = buffer.toByteArray();
EDIT:
On second look, I see you try to read back byte array to string. That operation will break any effort to customize encoding. Java internal system use UTF-16 encoding for text representation, and the encoding provided as toString(encoding) input serves to decode bytes from the originating charset (in your case windows by default).
It won't serve for further writing, because it's the writer that is responsible for the encoding, not the content. So, once you've written your byte array, either you return it as is, or you'll have to customize encoding later on the final encoder.
When I try to define Unicode string literals in C++(17) I see some very odd results during debugging, which I would like to discuss. Look at the following variable definitions:
std::string u8 { u8"⬀⬁" };
std::string u8_1 { u8"\u2B00\u2B01" };
std::u16string u16 { u"⬀⬁" };
std::u16string u16_1 { 0x2B00, 0x2B01 };
std::u32string u32 { U"⬀⬁" };
std::u32string u32_1 { 0x2B00, 0x2B01 };
std::string u8_2 { u8"\u2B00⬀⬁" };
std::u16string u16_2 { u"\u2B00⬀⬁" };
std::u32string u32_2 { U"\u2B00⬀⬁" };
During debugging I now get the following strings:
As you can see the values are pretty surprising. Strings defined with an initializer list or escape codes appear correct, while those specified as normal characters appear as if they were UTF-8 encoded and the bytes are written to the strings. This is correct for an UTF-8 string (u8_1 here), but not for UTF-16 and UFT-32 strings (u16 and u32 here). And what's even more odd is the fact that the u8 variable contains values that are doubly UTF-8 encoded. You can easily prove that with an online UTF converter. If the debugger was the problem I wouldn't see some correct values, but since it shows values correct that have been specified with escape sequence, I assume something else is guilty for messing up the strings.
What would explain the results somehow is when the file is UTF-8 encoded and that would directly have taken over to string variables, without any conversion. Though I think the original string in the C++ file should be converted from the source encoding (which is indeed UTF-8) to the correct target encode, instead. Needless to say this works as expected in XCode.
Is this a known problem? Can this somehow worked around to avoid having to use numeric values instead?
I'm creating a file that isn't really a csv file, but SuperCSV can help me to make the creation of this file easier. The structure of the file uses different lengths for each line, following a layout that don't separate the different information. So, to know which information has in one line you need look at the first 2 characters (the name of the register), count the characters and extract it by size.
I've configured SuperCSV to use empty delimiter, however, the created file is using a space where it should have nothing.
public class TarefaGerarArquivoRegistrosFiscais implements ITarefa {
private static final CsvPreference FORMATO_ANEXO_IV = new CsvPreference.Builder('"', '\0' , "\r\n").build();
public void processar() {
try {
writer = new CsvListWriter(getFileWriter(), FORMATO_ANEXO_IV);
writer.write(geradorRegistroU1.gerar());
} finally {
if (writer != null)
writer.close();
}
}
}
I'm doing something wrong? '\0' is the correct code for a null char?
It's probably not what you want to hear, but I wouldn't recommend using Super CSV for this (and I'm a committer!). Its sole purpose is to deal with delimited files - and you're not using delimiters.
You could misuse Super CSV by creating a wrapper object (containing your List) whose toString() method simply concatenates all of the values together, then passing that single object to writer.write(), but it's an awful hack.
I'd recommend either finding another library more suited to your problem, or writing your own solution.
I have this structure defined and a class in my project. It is a class that holds id numbers generated by GetIdUsingThisString(char *), which is a function that loads a texture file into GPU and returns an id(OpenGL).
The problem is, when I try to read a specific file, the program crashes. When I run this program in VS with debugging it works fine, but running .exe crashes the program(or running without debugging from MSVS). By using just-n-time debugger I have found out that, for num of that specific file, Master[num].name actually contains "\x5" added(concatenation) at the end of the file path, and this is only generated for this one file. Nothing out of this method could do it, and I also use this type of slash / in paths, not \ .
struct WIndex{
char* name;
int id;
};
class Test_Class
{
public:
Test_Class(void);
int AddTex(char* path);
struct WIndex* Master;
TextureClass* tex;
//some other stuff...
};
Constructor:
Test_Class::Test_Class(void)
{
num=0;
Master=(WIndex*)malloc(1*sizeof(WIndex));
Master[0].name=(char*)malloc(strlen("Default")*sizeof(char));
strcpy(Master[0].name,"Default");
Master[0].id=GetIdUsingThisString(Master[0].name);
}
Adding a new texture:(The bug)
int Test_Class::AddTex(char* path)
{
num++;
Master=(WIndex*)realloc(Master,(num+1)*sizeof(WIndex));
Master[num].name=(char*)malloc(strlen(path)*sizeof(char));
strcpy(Master[num].name,path);<---HERE
Master[num].id=GetIdUsingThisString(path);
return Master[num].id;
}
At runtime, calling AddTex with this file would have path with the right value, while Master[num].name will show this modified value after strcpy(added "\x5").
Question:
Is there something wrong with copying(strcpy) to a dynamically allocated string? If i use char name[255] as a part of the WIndex structure, everything works fine.
More info:
This exact file is called "flat blanc.tga". If I put it in a folder where I intended it to be, fread in GetIdUsingThisString throws corrupted heap errors. If I put it in a different folder it is ok. If I change it's name to anything else, it's ok again. If I put a different file and give it that same name, it is ok too(!!!). I need the program to be bug free of this kind of things because I won't know which textures will be loaded(if I knew I could simply replace them).
Master[num].name=(char*)malloc(strlen(path)*sizeof(char));
Should be
Master[num].name=(char*)malloc( (strlen(path)+1) * sizeof(char));
There was not place for the terminating NULL character
From http://www.cplusplus.com/reference/cstring/strcpy/:
Copies the C string pointed by source into the array pointed by
destination, including the terminating null character (and
stopping at that point).
The same happens here:
Master[0].name=(char*)malloc(strlen("Default")*sizeof(char));
strcpy(Master[0].name,"Default");
Based on the definitions (below) - you should use strlen(string)+1 for malloc.
A C string is as long as the number of characters between the beginning of the string and the terminating null character (without including the terminating null character itself).
The strcpy() function shall copy the string pointed to by s2 (including the terminating null byte)
Also see discussions in How to allocate the array before calling strcpy?