Read in big csv file, validate and write out using uniVocity parser - validation

I need to parse a big csv file (2gb). The values have to be validated, the rows containing "bad" fields must be dropped and a new file containing only valid rows ought to be output.
I've selected uniVocity parser library to do that. Please help me to understand whether this library is well-suited for the task and what approach should be used.
Given the file size, what is the best way to organize read->validate->write in uniVocity ? Read in all rows at once or use iterator style ? Where parsed and validated rows should be stored before they are written to file ?
Is there a way in Univocity to access row's values by index ? Something like row.getValue(3) ?

I'm the author of this library, let me try to help you out:
First, do not try to read all rows at once as you will fill your memory with LOTS of data.
You can get the row values by index.
The faster approach to read/validate/write would be by using a RowProcessor that has a CsvWriter and decides when to write or skip a row. I think the following code will help you a bit:
Define the output:
private CsvWriter createCsvWriter(File output, String encoding){
CsvWriterSettings settings = new CsvWriterSettings();
//configure the writer ...
try {
return new CsvWriter(new OutputStreamWriter(new FileOutputStream(output), encoding), settings);
} catch (IOException e) {
throw new IllegalArgumentException("Error writing to " + output.getAbsolutePath(), e);
}
}
Redirect the input
//this creates a row processor for our parser. It validates each row and sends them to the csv writer.
private RowProcessor createRowProcessor(File output, String encoding){
final CsvWriter writer = createCsvWriter(output, encoding);
return new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
if (shouldWriteRow(row)) {
writer.writeRow(row);
} else {
//skip row
}
}
private boolean shouldWriteRow(String[] row) {
//your validation here
return true;
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
};
}
Configure the parser:
public void readAndWrite(File input, File output, String encoding) {
CsvParserSettings settings = new CsvParserSettings();
//configure the parser here
//tells the parser to send each row to them custom processor, which will validate and redirect all rows to the CsvWriter
settings.setRowProcessor(createRowProcessor(output, encoding));
CsvParser parser = new CsvParser(settings);
try {
parser.parse(new InputStreamReader(new FileInputStream(input), encoding));
} catch (IOException e) {
throw new IllegalStateException("Unable to open input file " + input.getAbsolutePath(), e);
}
}
For better performance you can also wrap the row processor in a ConcurrentRowProcessor.
settings.setRowProcessor(new ConcurrentRowProcessor(createRowProcessor(output, encoding)));
With this, the writing of rows will be performed in a separate thread.

Related

Spring Batch FlatFileItemReader Handle Additional Fields Added to CSV

So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};

Save Multiple Records using Web Flux and Mongo DB

I'm working on a project which uses Spring web Flux and mongo DB and I'm very new to reactive programming and WebFlux.
I have scenario of saving into 3 collections using one service. For each collection im generating id using a sequence and then save them. I have FieldMaster which have List on them and every Field Info has List . I need to save FieldMaster, FileInfo and FieldOption. Below is the Code i'm using. The code works only when i'm running on debugging mode, otherwise it get blocked on below line
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue());
Here is the full code
public Mono< FieldMaster > createMasterData(Mono< FieldMaster > fieldmaster)
{
return fieldmaster.flatMap(fm -> {
return sequencesCollection.getNextSequence(FIELDMASTER).flatMap(seqVal -> {
LOGGER.info("Generated Sequence value :" + seqVal.getSeqValue());
fm.setId(Integer.parseInt(seqVal.getSeqValue()));
List<FieldInfo> fieldInfo = fm.getFieldInfo();
fieldInfo.forEach(field -> {
// saving Field Goes Here
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue()); // stops execution at this line
LOGGER.info("Generated Sequence value Field Sequence:" + field_seq_id);
field.setId(field_seq_id);
field.setMasterFieldRefId(fm.getId());
mongoTemplate.save(field).block();
LOGGER.info("Field Details Saved");
List<FieldOption> fieldOption = field.getFieldOptions();
fieldOption.forEach(option -> {
// saving Field Option Goes Here
Integer opt_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDOPTION).block().getSeqValue());
LOGGER.info("Generated Sequence value Options Sequence:" + opt_seq_id);
option.setId(opt_seq_id);
option.setFieldRefId(field_seq_id);
mongoTemplate.save(option).log().block();
LOGGER.info("Field Option Details Saved");
});
});
return mongoTemplate.save(fm).log();
});
});
}
First in reactive programming is not good to use .block because you turn nonblocking code to blocking. if you want to get from a stream and save in 3 streams you can do like that.
There are many different ways to do that for performance purpose but it depends of the amount of data.
here you have a sample using simple data and using concat operator but there are even zip and merge. it depends from your needs.
public void run(String... args) throws Exception {
Flux<Integer> dbData = Flux.range(0, 10);
dbData.flatMap(integer -> Flux.concat(saveAllInFirstCollection(integer), saveAllInSecondCollection(integer), saveAllInThirdCollection(integer))).subscribe();
}
Flux<Integer> saveAllInFirstCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInSecondCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInThirdCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}

Parallel Stream repeating items

I am retrieving big chunks of data from DB and using this data to write it somewhere else. In order to avoid a long processing time, I'm trying to use parallel streams to write it. When I run this as sequential streams, it works perfectly. However, if I change it to parallel, the behavior is odd: it prints the same object multiple times (more than 10).
#PostConstruct
public void retrieveAllTypeRecords() throws SQLException {
logger.info("Retrieve batch of Type records.");
try {
Stream<TypeRecord> typeQueryAsStream = jdbcStream.getTypeQueryAsStream();
typeQueryAsStream.forEach((type) -> {
logger.info("Printing Type with field1: {} and field2: {}.", type.getField1(), type.getField2()); //the same object gets printed here multiple times
//write this object somewhere else
});
logger.info("Completed full retrieval of Type data.");
} catch (Exception e) {
logger.error("error: " + e);
}
}
public Stream<TypeRecord> getTypeQueryAsStream() throws SQLException {
String sql = typeRepository.getQueryAllTypesRecords(); //retrieves SQL query in String format
TypeMapper typeMapper = new TypeMapper();
JdbcStream.StreamableQuery query = jdbcStream.streamableQuery(sql);
Stream<TypeRecord> stream = query.stream()
.map(row -> {
return typeMapper.mapRow(row); //maps columns values to object values
});
return stream;
}
public class StreamableQuery implements Closeable {
(...)
public Stream<SqlRow> stream() throws SQLException {
final SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
#Override
public boolean hasNext() {
return !rowSet.isLast();
}
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
return StreamSupport.stream(supplier, Spliterator.CONCURRENT, true); //this boolean sets the stream as parallel
}
}
I've also tried using typeQueryAsStream.parallel().forEach((type) but the result is the same.
Example of output:
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[main] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
Well, look at you code,
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
…
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
You are returning the same object every time. You achieve your desired effects by implicitly modifying the state of this object when calling rowSet.next().
This obviously can’t work when multiple threads try to access that single object concurrently. Even buffering some items, to hand them over to another thread will cause trouble. Therefore, such interference can cause problems with sequential streams as well, as soon as stateful intermediate operations are involved, like sorted or distinct.
Assuming that typeMapper.mapRow(row) will produce an actual data item which has no interference to other data items, you should integrate this step into the stream source, to create a valid stream.
public Stream<TypeRecord> stream(TypeMapper typeMapper) throws SQLException {
SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
SqlRow sqlRow = new SqlRowAdapter(rowSet);
Spliterator<TypeRecord> sp = new Spliterators.AbstractSpliterator<TypeRecord>(
Long.MAX_VALUE, Spliterator.CONCURRENT|Spliterator.ORDERED) {
#Override
public boolean tryAdvance(Consumer<? super TypeRecord> action) {
if(!rowSet.next()) return false;
action.accept(typeMapper.mapRow(sqlRow));
return true;
}
};
return StreamSupport.stream(sp, true); //this boolean sets the stream as parallel
}
Note that for a lot of use cases, like this one, implementing a Spliterator is simpler than implementing an Iterator (which needs to be wrapped via spliteratorUnknownSize anyway). Also, there is no need to encapsulate this instantiation into a Supplier.
As a final note, the current implementation does not perform well for streams with an unknown size, as it treats Long.MAX_VALUE like a very large number, ignoring the “unknown” semantic assigned to it by the specification. It will be very beneficial to the parallel performance to provide an estimate size, it doesn’t need to be precise, in fact, with the current implementation, even a completely made up number, say 1000 may perform better than correctly using Long.MAX_VALUE to denote an entirely unknown size.

Loading Files in UDF

I have a requirement of populating a field based on the evaluation of a UDF. The input to the UDF would be some other fields in the input and as well as an csv sheet. Presently, the approach I have taken is to load the CSV file, group it ALL and then pass it as a bag to the UDF along with other required parameters. However, its taking a very long time to complete the process (roughly about 3 hours) for source data of 170k records and as well as csv records of about 150k.
I'm sure there must be much better efficient way to handle this and hence need your inputs.
source_alias = LOAD 'src.csv' USING
PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray);
csv_alias = LOAD 'csv_file.csv' USING
PigStorage(',') AS (c1:chararray,c2:chararray,c3:chararray);
grpd_csv_alias = GROUP csv_alias ALL;
final_alias = FOREACH source_alias GENERATE f1 AS f1,
myUDF(grpd_csv_alias, f2) AS derived_f2;
Here is my UDF on a high level.
public class myUDF extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String f2Response = "N";
DataBag csvAliasBag = (DataBag)input.get(0);
String f2 = (String) input.get(1);
try {
Iterator<Tuple> bagIterator = csvAliasBag.iterator();
while (bagIterator.hasNext()) {
Tuple localTuple = (Tuple)bagIterator.next();
String col1 = ((String)localTuple.get(1)).trim().toLowerCase();
String col2 = ((String)localTuple.get(2)).trim().toLowerCase();
String col3 = ((String)localTuple.get(3)).trim().toLowerCase();
String col4 = ((String)localTuple.get(4)).trim().toLowerCase();
<Custom logic to populate f2Response based on the value in f2 and as well as col1, col2, col3 and col4>
}
}
return f2Response;
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
I believe the process is taking too long because of building and passing csv_alias to the UDF for each row in the source file.
Is there any better way to handle this?
Thanks
For small files, you can put them on the distributed cache. This copies the file to each task node as a local file then you load it yourself. Here's an example from the Pig docs UDF section. I would not recommend parsing the file each time, however. Store your results in a class variable and check to see if it's been initialized. If the csv is on the local file system, use getShipFiles. If the csv you're using is on HDFS, used the getCachedFiles method. Notice that for HDFS there's a file path followed by a # and some text. To the left of the # is the HDFS path and to the right is the name you want it to be called when it's copied to the local file system.
public class Udfcachetest extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String concatResult = "";
FileReader fr = new FileReader("./smallfile1");
BufferedReader d = new BufferedReader(fr);
concatResult +=d.readLine();
fr = new FileReader("./smallfile2");
d = new BufferedReader(fr);
concatResult +=d.readLine();
return concatResult;
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/user/pig/tests/data/small#smallfile1"); // This is hdfs file
return list;
}
public List<String> getShipFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/home/hadoop/pig/smallfile2"); // This local file
return list;
}
}

Hadoop Pig one line contain more than one record

Currently, I got a data file which process line by line, most of the line contain one record I need, such as: id,name,total
But some line contain more than one record, such as such as:id1,name1,total1,id2,name2,total2
I wrote my load function, and try to return the tuple which compose of list of tuples. But I don't know how can I process data as below?
((id1,name1,total1),(id2,name2,total2))...
And another question is about loadfun, if I found some line contain invalid value, shall I return an empty tuple or just set the line reader to the next line?
Thanks.
I got a solution, which is define my own load or store.
For load, define the file input.
for the store, define the output in my my put next function as below.
#Override
public void putNext(Tuple t) throws IOException {
List<Object> all = t.getAll();
for (Object o : all) {
logger.info(o.getClass());
Tuple tuple = (Tuple) o;
try {
recordWriter.write(null, new Text(tuple.toString()));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}

Resources