Parallel Stream repeating items - java-8

I am retrieving big chunks of data from DB and using this data to write it somewhere else. In order to avoid a long processing time, I'm trying to use parallel streams to write it. When I run this as sequential streams, it works perfectly. However, if I change it to parallel, the behavior is odd: it prints the same object multiple times (more than 10).
#PostConstruct
public void retrieveAllTypeRecords() throws SQLException {
logger.info("Retrieve batch of Type records.");
try {
Stream<TypeRecord> typeQueryAsStream = jdbcStream.getTypeQueryAsStream();
typeQueryAsStream.forEach((type) -> {
logger.info("Printing Type with field1: {} and field2: {}.", type.getField1(), type.getField2()); //the same object gets printed here multiple times
//write this object somewhere else
});
logger.info("Completed full retrieval of Type data.");
} catch (Exception e) {
logger.error("error: " + e);
}
}
public Stream<TypeRecord> getTypeQueryAsStream() throws SQLException {
String sql = typeRepository.getQueryAllTypesRecords(); //retrieves SQL query in String format
TypeMapper typeMapper = new TypeMapper();
JdbcStream.StreamableQuery query = jdbcStream.streamableQuery(sql);
Stream<TypeRecord> stream = query.stream()
.map(row -> {
return typeMapper.mapRow(row); //maps columns values to object values
});
return stream;
}
public class StreamableQuery implements Closeable {
(...)
public Stream<SqlRow> stream() throws SQLException {
final SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
#Override
public boolean hasNext() {
return !rowSet.isLast();
}
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
return StreamSupport.stream(supplier, Spliterator.CONCURRENT, true); //this boolean sets the stream as parallel
}
}
I've also tried using typeQueryAsStream.parallel().forEach((type) but the result is the same.
Example of output:
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[main] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.

Well, look at you code,
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
…
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
You are returning the same object every time. You achieve your desired effects by implicitly modifying the state of this object when calling rowSet.next().
This obviously can’t work when multiple threads try to access that single object concurrently. Even buffering some items, to hand them over to another thread will cause trouble. Therefore, such interference can cause problems with sequential streams as well, as soon as stateful intermediate operations are involved, like sorted or distinct.
Assuming that typeMapper.mapRow(row) will produce an actual data item which has no interference to other data items, you should integrate this step into the stream source, to create a valid stream.
public Stream<TypeRecord> stream(TypeMapper typeMapper) throws SQLException {
SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
SqlRow sqlRow = new SqlRowAdapter(rowSet);
Spliterator<TypeRecord> sp = new Spliterators.AbstractSpliterator<TypeRecord>(
Long.MAX_VALUE, Spliterator.CONCURRENT|Spliterator.ORDERED) {
#Override
public boolean tryAdvance(Consumer<? super TypeRecord> action) {
if(!rowSet.next()) return false;
action.accept(typeMapper.mapRow(sqlRow));
return true;
}
};
return StreamSupport.stream(sp, true); //this boolean sets the stream as parallel
}
Note that for a lot of use cases, like this one, implementing a Spliterator is simpler than implementing an Iterator (which needs to be wrapped via spliteratorUnknownSize anyway). Also, there is no need to encapsulate this instantiation into a Supplier.
As a final note, the current implementation does not perform well for streams with an unknown size, as it treats Long.MAX_VALUE like a very large number, ignoring the “unknown” semantic assigned to it by the specification. It will be very beneficial to the parallel performance to provide an estimate size, it doesn’t need to be precise, in fact, with the current implementation, even a completely made up number, say 1000 may perform better than correctly using Long.MAX_VALUE to denote an entirely unknown size.

Related

Kafka Streams exactly-once re-balance aggregation state data loss

Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within session.timeout.ms), without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
aggregation.
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Rebalance
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(SESSION_STATE_STORE),
mediaKeySerde,
sessionStateSerde
);
builder.addStateStore(storeBuilder);
KStream<String, IncomingData> incomingData = builder.stream(
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
.selectKey(...)
.groupByKey(...);
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
#Override
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
}
},
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.withValueSerde(aggregatedDataSerde)
);
aggregatedMedia.toStream()
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
#Override
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
}
})
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
}
and:
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
#Override
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
}
Override
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
}
};
}
}

Update KTable based on partial data attributes

I am trying to update a KTable with partial data of an object.
Eg. User object is
{"id":1, "name":"Joe", "age":28}
The object is being streamed into a topic and grouped by key into KTable.
Now the user object is updated partially as follows {"id":1, "age":33} and streamed into table. But the updated table looks as follows {"id":1, "name":null, "age":28}.
The expected output is {"id":1, "name":"Joe", "age":33}.
How can I use Kafka streams and spring cloud streams to achieve the expected output. Any suggestions would be appreciated. Thanks.
Here is the code
#Bean
public Function<KStream<String, User>, KStream<String, User>> process() {
return input -> input.map((key, user) -> new KeyValue<String, User>(user.getId(), user))
.groupByKey(Grouped.with(Serdes.String(), new JsonSerde<>(User.class))).reduce((user1, user2) -> {
user1.merge(user2);
return user1;
}, Materialized.as("allusers")).toStream();
}
and modified the User object with below code:
public void merge(Object newObject) {
assert this.getClass().getName().equals(newObject.getClass().getName());
for (Field field : this.getClass().getDeclaredFields()) {
for (Field newField : newObject.getClass().getDeclaredFields()) {
if (field.getName().equals(newField.getName())) {
try {
field.set(this, newField.get(newObject) == null ? field.get(this) : newField.get(newObject));
} catch (IllegalAccessException ignore) {
}
}
}
}
}
Is this the right approach or any other approach in KStreams?
I've tested your merge code, and it seems to be working as expected. But since your result after the reduce is {"id":1, "name":null, "age":28}, I can think of two things:
Your state isn't being updated at all, since no attribute has changed.
Maybe you have a serialization problem, since the string attribute is null, but the other int attributes are fine.
My guess is that, because you are mutating the original object and return the same value, kafka streams doesn't detect that as a change and won't store the new state. Actually, you shouldn't mutate your object, since it could lead to non-determinism depending on your pipeline.
Try to change your merge function to create a new User object, and see if the behavior changes.
So here is the recommended generic approach for merging the 2 objects, please feel to comment here. For this to work the the object being merged should have an empty constructor.
public <T> T mergeObjects(T first, T second) {
Class<?> clazz = first.getClass();
Field[] fields = clazz.getDeclaredFields();
Object newObject = null;
try {
newObject = clazz.getDeclaredConstructor().newInstance();
for (Field field : fields) {
field.setAccessible(true);
Object value1 = field.get(first);
Object value2 = field.get(second);
Object value = (value2 == null) ? value1 : value2;
field.set(newObject, value);
}
} catch (InstantiationException | IllegalAccessException | IllegalArgumentException
| InvocationTargetException | NoSuchMethodException | SecurityException e) {
e.printStackTrace();
}
return (T) newObject;
}

Save Multiple Records using Web Flux and Mongo DB

I'm working on a project which uses Spring web Flux and mongo DB and I'm very new to reactive programming and WebFlux.
I have scenario of saving into 3 collections using one service. For each collection im generating id using a sequence and then save them. I have FieldMaster which have List on them and every Field Info has List . I need to save FieldMaster, FileInfo and FieldOption. Below is the Code i'm using. The code works only when i'm running on debugging mode, otherwise it get blocked on below line
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue());
Here is the full code
public Mono< FieldMaster > createMasterData(Mono< FieldMaster > fieldmaster)
{
return fieldmaster.flatMap(fm -> {
return sequencesCollection.getNextSequence(FIELDMASTER).flatMap(seqVal -> {
LOGGER.info("Generated Sequence value :" + seqVal.getSeqValue());
fm.setId(Integer.parseInt(seqVal.getSeqValue()));
List<FieldInfo> fieldInfo = fm.getFieldInfo();
fieldInfo.forEach(field -> {
// saving Field Goes Here
Integer field_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDINFO).block().getSeqValue()); // stops execution at this line
LOGGER.info("Generated Sequence value Field Sequence:" + field_seq_id);
field.setId(field_seq_id);
field.setMasterFieldRefId(fm.getId());
mongoTemplate.save(field).block();
LOGGER.info("Field Details Saved");
List<FieldOption> fieldOption = field.getFieldOptions();
fieldOption.forEach(option -> {
// saving Field Option Goes Here
Integer opt_seq_id = Integer.parseInt(sequencesCollection.getNextSequence(FIELDOPTION).block().getSeqValue());
LOGGER.info("Generated Sequence value Options Sequence:" + opt_seq_id);
option.setId(opt_seq_id);
option.setFieldRefId(field_seq_id);
mongoTemplate.save(option).log().block();
LOGGER.info("Field Option Details Saved");
});
});
return mongoTemplate.save(fm).log();
});
});
}
First in reactive programming is not good to use .block because you turn nonblocking code to blocking. if you want to get from a stream and save in 3 streams you can do like that.
There are many different ways to do that for performance purpose but it depends of the amount of data.
here you have a sample using simple data and using concat operator but there are even zip and merge. it depends from your needs.
public void run(String... args) throws Exception {
Flux<Integer> dbData = Flux.range(0, 10);
dbData.flatMap(integer -> Flux.concat(saveAllInFirstCollection(integer), saveAllInSecondCollection(integer), saveAllInThirdCollection(integer))).subscribe();
}
Flux<Integer> saveAllInFirstCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInSecondCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}
Flux<Integer> saveAllInThirdCollection(Integer integer) {
System.out.println(integer);
//process and save in collection
return Flux.just(integer);
}

Hibernate queries getting slower and slower

I'm working on a process that checks and updates data from Oracle database. I'm using hibernate and spring framework in my application.
The application reads a csv file, processes the content, then persiste entities :
public class Main() {
Input input = ReadCSV(path);
EntityList resultList = Process.process(input);
WriteResult.write(resultList);
...
}
// Process class that loops over input
public class Process{
public EntityList process(Input input) :
EntityList results = ...;
...
for(Line line : input.readLine()){
results.add(ProcessLine.process(line))
...
}
return results;
}
// retrieving and updating entities
Class ProcessLine {
#Autowired
DomaineRepository domaineRepository;
#Autowired
CompanyDomaineService companydomaineService
#Transactional
public MyEntity process(Line line){
// getcompanyByXX is CrudRepository method with #Query that returns an entity object
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
if(companyToDetach == null || companyToAttach == null){
throw new CustomException("Custom Exception");
}
// AttachCompany retrieves some entity relationEntity, then removes companyToDetach and adds CompanyToAttach. this updates relationEntity.company attribute.
companydomaineService.attachCompany(companyToAttach, companyToDetach);
return companyToAttach;
}
}
public class WriteResult{
#Autowired
DomaineRepository domaineRepository;
#Transactional
public void write(EntityList results) {
for (MyEntity result : results){
domaineRepository.save(result)
}
}
}
The application works well on files with few lines, but when i try to process large files (200 000 lines), the performance slows drastically, and i get a SQL timeout.
I suspect cache issues, but i'm wondering if saving all the entities at the end of the processing isn't a bad practice ?
The problem is your for loop which is doing individual saves on the result and thus does single inserts slowing it down. Hibernate and spring support batch inserts and should be done when ever possible.
something like domaineRepository.saveAll(results)
Since you are processing lot of data it might be better to do things in batches so instead of getting one company to attach you should get a list of companies to attach processes those then get a list of companies to detach and process those
public EntityList process(Input input) :
EntityList results;
List<Code> companiesToAdd = new ArrayList<>();
List<Siret> companiesToRemove = new ArrayList<>();
for(Line line : input.readLine()){
companiesToAdd.add(line.getCode());
companiesToRemove.add(line.getSiret());
...
}
results = process(companiesToAdd, companiesToRemove);
return results;
}
public MyEntity process(List<Code> companiesToAdd, List<Siret> companiesToRemove) {
List<MyEntity> attachList = domaineRepository.getCompanyByCodeIn(companiesToAdd);
List<MyEntity> detachList = domaineRepository.getCompanyBySiretIn(companiesToRemove);
if (attachList.isEmpty() || detachList.isEmpty()) {
throw new CustomException("Custom Exception");
}
companydomaineService.attachCompany(attachList, detachList);
return attachList;
}
The above code is just sudo code to point you in the right direction, will need to work out what works for you.
For every line you read you are doing 2 read operations here
MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
You can read more than one line and us the in query and then process that list of companies

Querying single database row using rxjava2

I am using rxjava2 for the first time on an Android project, and am doing SQL queries on a background thread.
However I am having trouble figuring out the best way to do a simple SQL query, and being able to handle the case where the record may or may not exist. Here is the code I am using:
public Observable<Record> createRecordObservable(int id) {
Callable<Record> callback = new Callable<Record>() {
#Override
public Record call() throws Exception {
// do the actual sql stuff, e.g.
// select * from Record where id = ?
return record;
}
};
return Observable.fromCallable(callback).subscribeOn(Schedulers.computation());
}
This works well when there is a record present. But in the case of a non-existent record matching the id, it treats it like an error. Apparently this is because rxjava2 doesn't allow the Callable to return a null.
Obviously I don't really want this. An error should be only if the database failed or something, whereas a empty result is perfectly valid. I read somewhere that one possible solution is wrapping Record in a Java 8 Optional, but my project is not Java 8, and anyway that solution seems a bit ugly.
This is surely such a common, everyday task that I'm sure there must be a simple and easy solution, but I couldn't find one so far. What is the recommended pattern to use here?
Your use case seems appropriate for the RxJava2 new Observable type Maybe, which emit 1 or 0 items.
Maybe.fromCallable will treat returned null as no items emitted.
You can see this discussion regarding nulls with RxJava2, I guess that there is no many choices but using Optional alike in other cases where you need nulls/empty values.
Thanks to #yosriz, I have it working with Maybe. Since I can't put code in comments, I'll post a complete answer here:
Instead of Observable, use Maybe like this:
public Maybe<Record> lookupRecord(int id) {
Callable<Record> callback = new Callable<Record>() {
#Override
public Record call() throws Exception {
// do the actual sql stuff, e.g.
// select * from Record where id = ?
return record;
}
};
return Maybe.fromCallable(callback).subscribeOn(Schedulers.computation());
}
The good thing is the returned record is allowed to be null. To detect which situation occurred in the subscriber, the code is like this:
lookupRecord(id)
.observeOn(AndroidSchedulers.mainThread())
.subscribe(new Consumer<Record>() {
#Override
public void accept(Record r) {
// record was loaded OK
}
}, new Consumer<Throwable>() {
#Override
public void accept(Throwable throwable) {
// there was an error
}
}, new Action() {
#Override
public void run() {
// there was an empty result
}
});

Resources