Use custom charset in Jackson CsvMapper - spring-boot

Is it possible to use custom encoding with the Jackson CsvMapper? I need the CSV to be encoded with Windows-1252, not the default UTF-8.
Here is a simplified version of my code:
class CSVService(private val mapper: CsvMapper) {
fun <T : CsvResponse> writeAsCSV(result: List<T>, resultClass: KClass<T>, separatorChar: Char = ',', encoding: Charset = Charsets.WINDOWS_1252): String {
mapper.schemaFor(resultClass.java).withHeader().withColumnSeparator(separatorChar)
return mapper.writeValueAsBytes(result).toString(encoding) // does not work.
}
}
It seems like the object mapper writeValueAsBytes method has UTF-8 hard-coded.
Can someone show with an example of how to configure the objectmapper to use a different encoding?
Thanks in advance.
Any help is appreciated.
Using com.fasterxml.jackson.dataformat:jackson-dataformat-csv:2.11.2

Jackson Javadoc specifies that method writeValueAsBytes encoding is forced to UTF-8.
To customize charset, you'll have to create an in-memory writer as intermediate output.
Example:
val buffer = ByteArrayOutputStream()
OutputStreamWriter(buffer, encoding).use {
mapper.writeValue(it, result)
}
val bytes = buffer.toByteArray();
EDIT:
On second look, I see you try to read back byte array to string. That operation will break any effort to customize encoding. Java internal system use UTF-16 encoding for text representation, and the encoding provided as toString(encoding) input serves to decode bytes from the originating charset (in your case windows by default).
It won't serve for further writing, because it's the writer that is responsible for the encoding, not the content. So, once you've written your byte array, either you return it as is, or you'll have to customize encoding later on the final encoder.

Related

VS 2017 shows incorrect C++ Unicode literals during debugging

When I try to define Unicode string literals in C++(17) I see some very odd results during debugging, which I would like to discuss. Look at the following variable definitions:
std::string u8 { u8"⬀⬁" };
std::string u8_1 { u8"\u2B00\u2B01" };
std::u16string u16 { u"⬀⬁" };
std::u16string u16_1 { 0x2B00, 0x2B01 };
std::u32string u32 { U"⬀⬁" };
std::u32string u32_1 { 0x2B00, 0x2B01 };
std::string u8_2 { u8"\u2B00⬀⬁" };
std::u16string u16_2 { u"\u2B00⬀⬁" };
std::u32string u32_2 { U"\u2B00⬀⬁" };
During debugging I now get the following strings:
As you can see the values are pretty surprising. Strings defined with an initializer list or escape codes appear correct, while those specified as normal characters appear as if they were UTF-8 encoded and the bytes are written to the strings. This is correct for an UTF-8 string (u8_1 here), but not for UTF-16 and UFT-32 strings (u16 and u32 here). And what's even more odd is the fact that the u8 variable contains values that are doubly UTF-8 encoded. You can easily prove that with an online UTF converter. If the debugger was the problem I wouldn't see some correct values, but since it shows values correct that have been specified with escape sequence, I assume something else is guilty for messing up the strings.
What would explain the results somehow is when the file is UTF-8 encoded and that would directly have taken over to string variables, without any conversion. Though I think the original string in the C++ file should be converted from the source encoding (which is indeed UTF-8) to the correct target encode, instead. Needless to say this works as expected in XCode.
Is this a known problem? Can this somehow worked around to avoid having to use numeric values instead?

How to merge a serialized protobuf with a another protobuf without deserializing the first

I'm trying to understand if it's possible to take a serialized protobuf that makes up part of another protobuf and merge them together without having to deserialize the first protobuf.
For example, given a protobuf wrapper:
syntax = "proto2";
import "content.proto";
message WrapperContent {
required string metatData = 1;
required Content content = 2;
}
And then imagine we get a serialized version of Content below (i.e. Content is coming that is coming from a remote client):
syntax = "proto2";
message Content {
required string name = 1;
required bytes payload = 2;
}
Do you know if any way I can inject the serialized Content into the WrapperContent without first having to deserialize Content.
The reason I'm trying to inject Content without deserializing it, is that I'm try and save on the overhead of deserializing the message.
If that answer is, no, it's not possible. That is still helpful.
Thanks, Mike.
In protobuf, submessages are stored like bytes fields.
So you can make a modified copy of your wrapper:
message WrapperContentBytes {
required string metatData = 1;
required bytes content = 2;
}
and write the already serialized content data into the content field.
Decoders can use the unmodified WrapperContent message to decode also the submessage. The binary data on the wire will be the same so decoders do not know the difference.

Can wstring_convert just replace invalid characters?

I am currently working on a tool to extract archives from a game for the purpose of data mining. I currently extract metadata from the archives (number of files per archive, filenames, packed/unpacked sizes, etc.) and write them to a std::wstring for further analysis. I have stumbled over an issue with converting filenames to wide characters using std::wstring_conver.
My code looks something like this now:
struct IndexEntry {
int32_t file_id;
std::array<char, 260> filename;
// more fields
}
wstring foo(IndexEntry entry) {
std::wstringstream buffer;
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
buffer << entry.file_id << L'\n';
buffer << converter.from_bytes(entry.filename.data()) << L'\n';
// add rest of the IndexEntry fields to the stream
return buffer.str();
}
The IndexEntry struct is filled by reading from files with a std::ifstream in binary mode. The error happens with converter.from_bytes(). Some of the filenames contain 0x81 as a character and when the converter encounters these, it throws a std::range_error exception.
Is there a way to tell wstring_convert to replace characters it can not convert with something else? Or is there a generally better way to handle this conversion?
This whole project is mostly a learning excercise. I wanted to do all internal string handling with wstring, so I can get some experience dealing with strings in different encodings. Unfortunatly I have no idea what exact encoding was used to generate these archive files.

Serializing an object to a JSON input stream using GSON?

I'm not sure if I'm asking for a right thing, but is it possible to make the GSON Gson.toJson(...) methods family work in "streaming mode" while serializing to JSON? Let's say, sometimes there are cases when using Appendable is not possible:
final String json = gson.toJson(value);
final byte[] bytes = json.getBytes(charset);
try ( final InputStream inputStream = new ByteArrayInputStream(bytes) ) {
inputStreamConsumer.accept(inputStream);
}
The example above is not perfect in this scenario, because:
It generates a string json as a temporary buffer.
The json string produces a new byte array just to wrap it up into a ByteArrayInputStream instance.
I think it's not a big problem to write a CharSequence to InputStream adapter and get rid of creating the byte array clone, but I still couldn't get rid of generating the string temporary buffer to use the inputStreamConsumer efficiently. So, I'd expect something like:
try ( final InputStream inputStream = gson.toJsonInputStream(value) ) {
inputStreamConsumer.accept(inputStream);
}
Is it possible using just GSON somehow?
According to this comment, this cannot be done using GSON.

SuperCSV with null delimiter

I'm creating a file that isn't really a csv file, but SuperCSV can help me to make the creation of this file easier. The structure of the file uses different lengths for each line, following a layout that don't separate the different information. So, to know which information has in one line you need look at the first 2 characters (the name of the register), count the characters and extract it by size.
I've configured SuperCSV to use empty delimiter, however, the created file is using a space where it should have nothing.
public class TarefaGerarArquivoRegistrosFiscais implements ITarefa {
private static final CsvPreference FORMATO_ANEXO_IV = new CsvPreference.Builder('"', '\0' , "\r\n").build();
public void processar() {
try {
writer = new CsvListWriter(getFileWriter(), FORMATO_ANEXO_IV);
writer.write(geradorRegistroU1.gerar());
} finally {
if (writer != null)
writer.close();
}
}
}
I'm doing something wrong? '\0' is the correct code for a null char?
It's probably not what you want to hear, but I wouldn't recommend using Super CSV for this (and I'm a committer!). Its sole purpose is to deal with delimited files - and you're not using delimiters.
You could misuse Super CSV by creating a wrapper object (containing your List) whose toString() method simply concatenates all of the values together, then passing that single object to writer.write(), but it's an awful hack.
I'd recommend either finding another library more suited to your problem, or writing your own solution.

Resources