SuperCSV with null delimiter - supercsv

I'm creating a file that isn't really a csv file, but SuperCSV can help me to make the creation of this file easier. The structure of the file uses different lengths for each line, following a layout that don't separate the different information. So, to know which information has in one line you need look at the first 2 characters (the name of the register), count the characters and extract it by size.
I've configured SuperCSV to use empty delimiter, however, the created file is using a space where it should have nothing.
public class TarefaGerarArquivoRegistrosFiscais implements ITarefa {
private static final CsvPreference FORMATO_ANEXO_IV = new CsvPreference.Builder('"', '\0' , "\r\n").build();
public void processar() {
try {
writer = new CsvListWriter(getFileWriter(), FORMATO_ANEXO_IV);
writer.write(geradorRegistroU1.gerar());
} finally {
if (writer != null)
writer.close();
}
}
}
I'm doing something wrong? '\0' is the correct code for a null char?

It's probably not what you want to hear, but I wouldn't recommend using Super CSV for this (and I'm a committer!). Its sole purpose is to deal with delimited files - and you're not using delimiters.
You could misuse Super CSV by creating a wrapper object (containing your List) whose toString() method simply concatenates all of the values together, then passing that single object to writer.write(), but it's an awful hack.
I'd recommend either finding another library more suited to your problem, or writing your own solution.

Related

How to make Gson read value as String?

My JSON objects look like this
{"phoneNbr":"123456789","firstName":"Mark","previousNames":[{"previous1":"Peter","previous2":"Steve"}]}
{"phoneNbr":"234567891","firstName":"Hank","previousNames":null}
The previousNames values can be anything. I want it to be treated a STRING always. However when I try to parse it, GSON complaints because it expects array.
PersonJsonDAO class looks like this
private String phoneNbr;
private String firstName;
private String previousNames;
I try to parse it but GSON says Expected a string but was BEGIN_ARRAY
PersonJsonDAO personJsonDAO= new Gson().fromJson(jsonString, PersonJsonDAO.class);
How can I force GSON to accept previousNames as String?
GSON is treating it as an array, because it is indeed an array :)
I can think of 4 different alternatives to meet your desired behavior:
A preprocessing step of turning everything after '"previousNames":' into a string, by searching for the first occurance of '"previousNames":[', inserting there a '"', backspacing all the double quotes, till the occurrence of ']', before which I would add another double quote.
a much easier solution, if you don't mind the slight computational overhead, which in your case is probably tiny, just parse into a JSON as a first step, like you did, but declaring previousNames as an array of Strings, and then calling:
personJsonDAO.getString("previousNames");
However, this will leave you with previousNames field as an array of Strings.
Another option is to leave it as a JSonObject in the deserilization process, like this:
class PersonJsonDAO {
....
#SerializedName("previousNames")
JsonObject previousNames;
....
}
If the above alternatives are not enough, and you insist on having the previousNames field as a String, then the most comprehensive and correct approach would be to override the desiarilzation process of GSON, calling super for all behaviours, except when meeting the previousNames culprit, which you would return as a String.

Java Stream BufferedReader file stream

I am using Java 8 Streams to create stream from a csv file.
I am using BufferedReader.lines(), I read the docs for BufferedReader.lines():
After execution of the terminal stream operation there are no guarantees that the reader will be at a specific position from which to read the next character or line.
public class Streamy {
public static void main(String args[]) {
Reader reader = null;
BufferedReader breader = null;
try {
reader = new FileReader("refined.csv");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
breader = new BufferedReader(reader);
long l1 = breader.lines().count();
System.out.println("Line Count " + l1); // this works correctly
long l2 = breader.lines().count();
System.out.println("Line Count " + l2); // this gives 0
}
}
It looks like after reading the file for first time, reader does not get to beginning of the file. What is the way around for this problem
It looks like after reading the file for first time, reader does not get to beginning of the file.
No - and I don't know why you would expect it to given the documentation you quoted. Basically, the lines() method doesn't "rewind" the reader before starting, and may not even be able to. (Imagine the BufferedReader wraps an InputStreamReader which wraps a network connection's InputStream - once you've read the data, it's gone.)
What is the way around for this problem
Two options:
Reopen the file and read it from scratch
Save the result of lines() to a List<String>, so that you're then not reading from the file at all the second time. For example:
List<String> lines = breader.lines().collect(Collectors.toList());
As an aside, I'd strongly recommend using Files.newBufferedReader instead of FileReader - the latter always uses the platform default encoding, which isn't generally a good idea.
And for that matter, to load all the lines into a list, you can just use Files.readAllLines... or Files.lines if you want the lines as a stream rather than a list. (Note the caveats in the comments, however.)
Probably the cited fragment from JavaDoc needs to be clarified. Usually you would expect that after reading the whole file reader will point to the end of the file. But using streams it depends on whether short-circuit terminal operation is used and whether the stream is parallel. For example, if you use
String magicLine = breader.lines()
.filter(str -> str.startsWith("magic"))
.findAny()
.orElse(null);
Your reader will likely to stop after the first found line (because no need to read further) or read the whole input file if such line is not found. If you make the same operation in parallel stream, then the resulting position will be unpredictable, because input will be split to some implementation-dependent chunks where the search will be performed. That's why it's written this way in the documentation.
As for workaround ways, please read the #JonSkeet answer. And consider closing your streams via try-with-resource construct.
If there are no guarantees that the reader will be at a specific line, why wouldn't you create two readers?
reader1=new FileReader("refined.csv");
reader2=new FileReader("refined.csv");

C++(Visual Studio 2012): Copying a function's parameter char* to a dynamically allocated one

I have this structure defined and a class in my project. It is a class that holds id numbers generated by GetIdUsingThisString(char *), which is a function that loads a texture file into GPU and returns an id(OpenGL).
The problem is, when I try to read a specific file, the program crashes. When I run this program in VS with debugging it works fine, but running .exe crashes the program(or running without debugging from MSVS). By using just-n-time debugger I have found out that, for num of that specific file, Master[num].name actually contains "\x5" added(concatenation) at the end of the file path, and this is only generated for this one file. Nothing out of this method could do it, and I also use this type of slash / in paths, not \ .
struct WIndex{
char* name;
int id;
};
class Test_Class
{
public:
Test_Class(void);
int AddTex(char* path);
struct WIndex* Master;
TextureClass* tex;
//some other stuff...
};
Constructor:
Test_Class::Test_Class(void)
{
num=0;
Master=(WIndex*)malloc(1*sizeof(WIndex));
Master[0].name=(char*)malloc(strlen("Default")*sizeof(char));
strcpy(Master[0].name,"Default");
Master[0].id=GetIdUsingThisString(Master[0].name);
}
Adding a new texture:(The bug)
int Test_Class::AddTex(char* path)
{
num++;
Master=(WIndex*)realloc(Master,(num+1)*sizeof(WIndex));
Master[num].name=(char*)malloc(strlen(path)*sizeof(char));
strcpy(Master[num].name,path);<---HERE
Master[num].id=GetIdUsingThisString(path);
return Master[num].id;
}
At runtime, calling AddTex with this file would have path with the right value, while Master[num].name will show this modified value after strcpy(added "\x5").
Question:
Is there something wrong with copying(strcpy) to a dynamically allocated string? If i use char name[255] as a part of the WIndex structure, everything works fine.
More info:
This exact file is called "flat blanc.tga". If I put it in a folder where I intended it to be, fread in GetIdUsingThisString throws corrupted heap errors. If I put it in a different folder it is ok. If I change it's name to anything else, it's ok again. If I put a different file and give it that same name, it is ok too(!!!). I need the program to be bug free of this kind of things because I won't know which textures will be loaded(if I knew I could simply replace them).
Master[num].name=(char*)malloc(strlen(path)*sizeof(char));
Should be
Master[num].name=(char*)malloc( (strlen(path)+1) * sizeof(char));
There was not place for the terminating NULL character
From http://www.cplusplus.com/reference/cstring/strcpy/:
Copies the C string pointed by source into the array pointed by
destination, including the terminating null character (and
stopping at that point).
The same happens here:
Master[0].name=(char*)malloc(strlen("Default")*sizeof(char));
strcpy(Master[0].name,"Default");
Based on the definitions (below) - you should use strlen(string)+1 for malloc.
A C string is as long as the number of characters between the beginning of the string and the terminating null character (without including the terminating null character itself).
The strcpy() function shall copy the string pointed to by s2 (including the terminating null byte)
Also see discussions in How to allocate the array before calling strcpy?

Which files are ignored as input by mapper?

I'm chaining multiple MapReduce jobs and want to pass along/store some meta information (e.g. configuration or name of original input) with the results. At least the file "_SUCCESS" and also anything in the directory "_logs" seams to be ignored.
Are there any filename patterns which are by default ignored by the InputReader? Or is this just a fixed limited list?
The FileInputFormat uses the following hiddenFileFilter by default:
private static final PathFilter hiddenFileFilter = new PathFilter(){
public boolean accept(Path p){
String name = p.getName();
return !name.startsWith("_") && !name.startsWith(".");
}
};
So if you uses any FileInputFormat (such as TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat), the hidden files (the file name starts with "_" or ".") will be ignored.
You can use FileInputFormat.setInputPathFilter to set your custom PathFilter. Remember that the hiddenFileFilter is always active.

Creating custom InputFormat and RecordReader for Binary Files in Hadoop MapReduce

I'm writing a M/R job that processes large time-series-data files written in binary format that looks something like this (new lines here for readability, actual data is continuous, obviously):
TIMESTAMP_1---------------------TIMESTAMP_1
TIMESTAMP_2**********TIMESTAMP_2
TIMESTAMP_3%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%TIMESTAMP_3
.. etc
Where timestamp is simply a 8 byte struct, identifiable as such by the first 2 bytes. The actual data is bounded between duplicate value timestamps, as displayed above, and contains one or more predefined structs. I would like to write a custom InputFormat that will emit the key/value pair to the mappers:
< TIMESTAMP_1, --------------------- >
< TIMESTAMP_2, ********** >
< TIMESTAMP_3, %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% >
Logically, I'd like to keep track of the current TIMESTAMP, and aggregate all the data until that TIMESTAMP is detected again, then send out my <TIMESTAMP, DATA> pair as a record. My problem is syncing between splits inside the RecordReader, so if a certain reader receives the following split
# a split occurs inside my data
reader X: TIMESTAMP_1--------------
reader Y: -------TIMESTAMP_1 TIMESTAMP_2****..
# or inside the timestamp
or even: #######TIMES
TAMP_1-------------- ..
What's a good way to approach this? Do I have an easy way to access the file offsets such that my CustomRecordReader can sync between splits and not lose data? I feel I have some conceptual gaps on how splits are handled, so perhaps an explanation of these may help. thanks.
In general it is not simple to create input format which support splits, since you should be able to find out where to move from the split boundary to get consistent records. XmlInputFormat is good example of format doing so.
I would suggest first consider if you indeed need splittable inputs? You can define your input format as not splittable and not have all these issues.
If you files are generally not much larger then block size - you loose nothing. If they do - you will loose part of the data locality.
You can subclass the concrete subclass of FileInputFormat, for example, SeqenceFileAsBinaryInputFormat, and override the isSplitable() method to return false:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat;
public class NonSplitableBinaryFile extends SequenceFileAsBinaryInputFormat{
#Override
protected boolean isSplitable(FileSystem fs, Path file) {
return false;
}
#Override
public RecordReader getRecordReader(InputSplit split, JobConf job,
Reporter reporter) throws IOException {
//return your customized record reader here
}
}

Resources