I'm using a [java] kafka-producer to push data to kafka-topic x and a [java] high level consumer/bulkProcessor to read from topic x and index data to elasticsearch. The producer pushes 10 docs each time. When I start my java code for bulkProcessor for the first time after running producer, I see only 9 records being pushed to ES, all with "_version": 1. The 10th record is not in ES.
But somehow, beforeBulk() and afterBulk() methods show the follwoing results.
Going to execute new bulk composed of 10 actions
Executed bulk composed of 10 actions
This moment onwards, if I remove the elasticsearch index and use the producer, I see 10 records consistently. I have no idea why this is happening. Any help is appreciated.
Note: ES version 2.2.0
Kafka: 0.9.0.0
EDIT [Added relevant code]
public Consumer(KafkaStream a_stream, int a_threadNumber, String esHost, String esCluster, int bulkSize, String topic) {
/*Create transport client*/
BulkProcessor bulkProcessor;
this.bulkProcessor = BulkProcessor.builder(client, new BulkProcessor.Listener() {
public void beforeBulk(long executionId, BulkRequest request) {
System.out.format("Going to execute new bulk composed of %d actions\n", request.numberOfActions());
}
public void afterBulk(long executionId, BulkRequest request, BulkResponse response) {
System.out.format("Executed bulk composed of %d actions\n", response.getItems().length);
}
public void afterBulk(long executionId, BulkRequest request, Throwable failure) {
System.out.format("Error executing bulk", failure);
}
}).setBulkActions(bulkSize)
.setBulkSize(new ByteSizeValue(200, ByteSizeUnit.MB))
.setFlushInterval(TimeValue.timeValueSeconds(1))
.build();
}
public void run() {
ConsumerIterator<byte[], byte[]> it = m_stream.iterator();
while (it.hasNext()) {
byte[] x = it.next().message();
try {
bulkProcessor.add(new IndexRequest(index, type, id.toString()).source(modifyMsg(x).toString()));
}
catch (Exception e) {
logger.warn("bulkProcessor failed: " + m_threadNumber + e.getMessage());
}
}
logger.info("Shutting down Thread: " + m_threadNumber);
}
Docs going to ES are of the following form:
{"index":"temp1","type":"temp2","id":"0","event":"we're doomed"}
{"index":"temp1","type":"temp2","id":"1","event":"we're doomed"}
{"index":"temp1","type":"temp2","id":"2","event":"we're doomed"}
...
{"index":"temp1","type":"temp2","id":"9","event":"we're doomed"}
[EDIT]
If I add the following line in my run() method the problem is gone.
public void run() {
...
bulkProcessor.add(new IndexRequest("")); //Added this line
while (it.hasNext()) {
...
}
...
}
I feel like such a fool. In the line bulkProcessor.add(new IndexRequest(index, type, id.toString()).source(modifyMsg(x).toString())); the method modifyMsg() was initializing index, type and id, which was set to empty string in the constructor. That's why my first index request was failing as it had invalid index name.
Related
createIndexWithCustomMappings(String indexName, String fieldsMapping){CreateIndexResponse createIndexResponse = client.admin().indices()
.prepareCreate(index).setSettings(fieldsMapping).execute().get();}
I have a code which creates the index in elastic search in a spring boot application. Currently the client used is transport client which is now depreciated as per elastic search documentation and now is replaced by High Level Rest Client.
For Creating Index using High Level Rest Client. I have seen this code.
CreateIndexRequest request = new CreateIndexRequest(indexName);
CreateIndexResponse createIndexResponse = client.indices().create(request, RequestOptions.DEFAULT);
Here fieldsMapping is a json file which has details regarding analyzer, tokenizer, filter and is passed as String to this method. I am not able to find methods in java rest high level client to incorporate setSettings(fieldsMapping).execute().get() as done above with transport client.
Any Idea on how this setSettings(fieldMappings) can work java high level rest client
You can use the implementation from the ElasticsearchRestTemplate itself.
Using Elasticsearch 6.x:
This is how you create the index with settings:
#Override
public boolean createIndex(String indexName, Object settings) {
CreateIndexRequest request = new CreateIndexRequest(indexName);
if (settings instanceof String) {
request.settings(String.valueOf(settings), Requests.INDEX_CONTENT_TYPE);
} else if (settings instanceof Map) {
request.settings((Map) settings);
} else if (settings instanceof XContentBuilder) {
request.settings((XContentBuilder) settings);
}
try {
return client.indices().create(request, RequestOptions.DEFAULT).isAcknowledged();
} catch (IOException e) {
throw new ElasticsearchException("Error for creating index: " + request.toString(), e);
}
}
This is how you update the mappings for the index:
#Override
public boolean putMapping(String indexName, String type, Object mapping) {
Assert.notNull(indexName, "No index defined for putMapping()");
Assert.notNull(type, "No type defined for putMapping()");
PutMappingRequest request = new PutMappingRequest(indexName).type(type);
if (mapping instanceof String) {
request.source(String.valueOf(mapping), XContentType.JSON);
} else if (mapping instanceof Map) {
request.source((Map) mapping);
} else if (mapping instanceof XContentBuilder) {
request.source((XContentBuilder) mapping);
}
try {
return client.indices().putMapping(request, RequestOptions.DEFAULT).isAcknowledged();
} catch (IOException e) {
throw new ElasticsearchException("Failed to put mapping for " + indexName, e);
}
}
Using Elasticsearch 7.x:
You need to create a variable IndexCoordinates.of("indexName")
Get the IndexOperations from the ElasticSearchTemplate for that index
Create your index via the indexOperations variable like this:
IndexOperations indexOperations = elasticsearchTemplate.indexOps(indexCoordinates);
String indexSettings = "" //Pass json string here
String mappingJson = "" //Pass json string here
Document mapping = Document.parse(mappingJson);
Map<String, Object> settings = JacksonUtil.fromString(indexSettings, new TypeReference<>() {});
indexOperations.create(settings, mapping);
indexOperations.refresh(); //(Optional) refreshes the doc count
It really depends on which spring-data-elasticsearch you are using. Feel free to checkout the documentation as well:
https://docs.spring.io/spring-data/elasticsearch/docs/current/reference/html/#new-features
Hope this helps with your elasticsearch journey! Feel free to ask more questions regarding the java implementation :)
Running 3 Kafka Streams instances with exactly-once, but experiencing loss of data when restarting one of the streams instances (the other 2 doing re-balance).
If I restart the instance quickly (within session.timeout.ms), without the other 2 doing re-balance, everything is working as expected.
Input and output topics are created with 6 partitions.
Running 3 Kafka brokers.
Producing data with a single python producer in a loop (acks='all').
Outputting data to SQL with a single Kafka Connect configured with consumer.override.isolation.level=read_committed
I am expecting the aggregated data to have the same count as the output of my python loop. And this works just fine as long as Kafka Streams is not going into re-balance state.
In short the streams instance does:
Collect session data, and updating a session state.
Delta updates on the session state are then re-partitioned and summed using windowed
aggregation.
Grepping through my own debug output I'm inclined to believe the problem is related to transferring the aggregation state:
Record A which is an update to session X is adding 0 to the aggregation.
Output from the aggregation is now 6
Record B which is an update to session X is adding 1 to the aggregation.
Output from the aggregation is now 7
Rebalance
Update to session X (which may or may not be a replay or Record A) is adding 0 to the aggregation.
Output from the aggregation is now 6
Simplified and stripped out version of the code: (Not really a Java developer, so sorry for non-optimal syntax)
public static void main(String[] args) throws Exception {
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE);
final StoreBuilder<KeyValueStore<MediaKey, SessionState>> storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(SESSION_STATE_STORE),
mediaKeySerde,
sessionStateSerde
);
builder.addStateStore(storeBuilder);
KStream<String, IncomingData> incomingData = builder.stream(
SESSION_TOPIC, Consumed.with(Serdes.String(), mediaDataSerde));
KGroupedStream<MediaKey, AggregatedData> mediaData = incomingData
.transform(new SessionProcessingSupplier(SESSION_STATE_STORE), SESSION_STATE_STORE)
.selectKey(...)
.groupByKey(...);
KTable<Windowed<MediaKey>, AggregatedData> aggregatedMedia = mediaData
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
new Initializer<AggregatedData>() {...},
new Aggregator<MediaKey, AggregatedData, AggregatedData>() {
#Override
public AggregatedData apply(MediaKey key, AggregatedData input, AggregatedData aggregated) {
// ... Add stuff to "aggregated"
return aggregated
}
},
Materialized.<MediaKey, AggregatedData, WindowStore<Bytes, byte[]>>as("aggregated-media")
.withValueSerde(aggregatedDataSerde)
);
aggregatedMedia.toStream()
.map(new KeyValueMapper<Windowed<MediaKey>, AggregatedData, KeyValue<MediaKey, PostgresOutput>>() {
#Override
public KeyValue<MediaKey, PostgresOutput> apply(Windowed<MediaKey> mediaidKey, AggregatedData data) {
// ... Some re-formatting and then
return new KeyValue<>(mediaidKey.key(), output);
}
})
.to(POSTGRES_TOPIC, Produced.with(mediaKeySerde, postgresSerde));
final Topology topology = builder.build();
final KafkaStreams streams = new KafkaStreams(topology, props);
// Shutdown hook
}
and:
public class SessionProcessingSupplier implements TransformerSupplier<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> {
#Override
public Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>> get() {
return new Transformer<String, Processing.IncomingData, KeyValue<String, Processing.AggregatedData>>() {
#Override
public void init(ProcessorContext processorContext) {
this.context = processorContext;
this.stateStore = (KeyValueStore<String, Processing.SessionState>) context.getStateStore(sessionStateStoreName);
}
Override
public KeyValue<String, Processing.AggregatedData> transform(String sessionid, Processing.IncomingData data) {
Processing.SessionState state = this.stateStore.get(sessionid);
// ... Update or create session state
return new KeyValue<String, Processing.AggregatedData>(sessionid, output);
}
};
}
}
I am trying to update a KTable with partial data of an object.
Eg. User object is
{"id":1, "name":"Joe", "age":28}
The object is being streamed into a topic and grouped by key into KTable.
Now the user object is updated partially as follows {"id":1, "age":33} and streamed into table. But the updated table looks as follows {"id":1, "name":null, "age":28}.
The expected output is {"id":1, "name":"Joe", "age":33}.
How can I use Kafka streams and spring cloud streams to achieve the expected output. Any suggestions would be appreciated. Thanks.
Here is the code
#Bean
public Function<KStream<String, User>, KStream<String, User>> process() {
return input -> input.map((key, user) -> new KeyValue<String, User>(user.getId(), user))
.groupByKey(Grouped.with(Serdes.String(), new JsonSerde<>(User.class))).reduce((user1, user2) -> {
user1.merge(user2);
return user1;
}, Materialized.as("allusers")).toStream();
}
and modified the User object with below code:
public void merge(Object newObject) {
assert this.getClass().getName().equals(newObject.getClass().getName());
for (Field field : this.getClass().getDeclaredFields()) {
for (Field newField : newObject.getClass().getDeclaredFields()) {
if (field.getName().equals(newField.getName())) {
try {
field.set(this, newField.get(newObject) == null ? field.get(this) : newField.get(newObject));
} catch (IllegalAccessException ignore) {
}
}
}
}
}
Is this the right approach or any other approach in KStreams?
I've tested your merge code, and it seems to be working as expected. But since your result after the reduce is {"id":1, "name":null, "age":28}, I can think of two things:
Your state isn't being updated at all, since no attribute has changed.
Maybe you have a serialization problem, since the string attribute is null, but the other int attributes are fine.
My guess is that, because you are mutating the original object and return the same value, kafka streams doesn't detect that as a change and won't store the new state. Actually, you shouldn't mutate your object, since it could lead to non-determinism depending on your pipeline.
Try to change your merge function to create a new User object, and see if the behavior changes.
So here is the recommended generic approach for merging the 2 objects, please feel to comment here. For this to work the the object being merged should have an empty constructor.
public <T> T mergeObjects(T first, T second) {
Class<?> clazz = first.getClass();
Field[] fields = clazz.getDeclaredFields();
Object newObject = null;
try {
newObject = clazz.getDeclaredConstructor().newInstance();
for (Field field : fields) {
field.setAccessible(true);
Object value1 = field.get(first);
Object value2 = field.get(second);
Object value = (value2 == null) ? value1 : value2;
field.set(newObject, value);
}
} catch (InstantiationException | IllegalAccessException | IllegalArgumentException
| InvocationTargetException | NoSuchMethodException | SecurityException e) {
e.printStackTrace();
}
return (T) newObject;
}
I am trying to develop a batch process using Spring Batch + Spring Boot (Java config), but I have a problem doing so. I have a software that has a database and a Java API, and I read records from there. The batch process should retrieve all the documents which expiration date is less than a certain date, update the date, and save them again in the same database.
My first approach was reading the records 100 by 100; so the ItemReader retrieve 100 records, I process them 1 by 1, and finally I write them again. In the reader, I put this code:
public class DocumentItemReader implements ItemReader<Document> {
public List<Document> documents = new ArrayList<>();
#Override
public Document read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if(documents.isEmpty()) {
getDocuments(); // This method retrieve 100 documents and store them in "documents" list.
if(documents.isEmpty()) return null;
}
Document doc = documents.get(0);
documents.remove(0);
return doc;
}
}
So, with this code, the reader reads from the database until no records are found. When the "getDocuments()" method doesn't retrieve any documents, the List is empty and the reader returns null (so the Job finish). Everything worked fine here.
However, the problem appears if I want to use several threads. In this case, I started using the Partitioner approach instead of Multi-threading. The reason of doing that is because I read from the same database, so if I repeat the full step with several threads, all of them will find the same records, and I cannot use pagination (see below).
Another problem is that database records are updated dynamically, so I cannot use pagination. For example, let's suppose I have 200 records, and all of them are going to expire soon, so the process is going to retrieve them. Now imagine I retrieve 10 with one thread, and before anything else, that thread process one and update it in the same database. The next thread cannot retrieve from 11 to 20 records, as the first record is not going to appear in the search (as it has been processed, its date has been updated, and then it doesn't match the query).
It is a little difficult to understand, and some things may sound strange, but in my project:
I am forced to use the same database to read and write.
I can have millions of documents, so I cannot read all the records at the same time. I need to read them 100 by 100, or 500 by 500.
I need to use several threads.
I cannot use pagination, as the query to the databse will retrieve different documents each time it is executed.
So, after hours thinking, I think the unique possible solution is to repeat the job until the query retrives no documents. Is this possible? I want to do something like the step does: Do something until null is returned - repeat the job until the query return zero records.
If this is not a good approach, I will appreciate other possible solutions.
Thank you.
Maybe you can add a partitioner to your step that will :
Select all the ids of the datas that needs to be updated (and other columns if needed)
Split them in x (x = gridSize parameter) partitions and write them in temporary file (1 by partition).
Register the filename to read in the executionContext
Then your reader is not reading from the database anymore but from the partitioned file.
Seem complicated but it's not that much, here is an example which handle millions of record using JDBC query but it can be easily transposed for your use case :
public class JdbcToFilePartitioner implements Partitioner {
/** number of records by database fetch */
private int fetchSize = 100;
/** working directory */
private File tmpDir;
/** limit the number of item to select */
private Long nbItemMax;
#Override
public Map<String, ExecutionContext> partition(final int gridSize) {
// Create contexts for each parttion
Map<String, ExecutionContext> executionsContexte = createExecutionsContext(gridSize);
// Fill partition with ids to handle
getIdsAndFillPartitionFiles(executionsContexte);
return executionsContexte;
}
/**
* #param gridSize number of partitions
* #return map of execution context, one for each partition
*/
private Map<String, ExecutionContext> createExecutionsContext(final int gridSize) {
final Map<String, ExecutionContext> map = new HashMap<>();
for (int partitionId = 0; partitionId < gridSize; partitionId++) {
map.put(String.valueOf(partitionId), createContext(partitionId));
}
return map;
}
/**
* #param partitionId id of the partition to create context
* #return created executionContext
*/
private ExecutionContext createContext(final int partitionId) {
final ExecutionContext context = new ExecutionContext();
String fileName = tmpDir + File.separator + "partition_" + partitionId + ".txt";
context.put(PartitionerConstantes.ID_GRID.getCode(), partitionId);
context.put(PartitionerConstantes.FILE_NAME.getCode(), fileName);
if (contextParameters != null) {
for (Entry<String, Object> entry : contextParameters.entrySet()) {
context.put(entry.getKey(), entry.getValue());
}
}
return context;
}
private void getIdsAndFillPartitionFiles(final Map<String, ExecutionContext> executionsContexte) {
List<BufferedWriter> fileWriters = new ArrayList<>();
try {
// BufferedWriter for each partition
for (int i = 0; i < executionsContexte.size(); i++) {
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(executionsContexte.get(String.valueOf(i)).getString(
PartitionerConstantes.FILE_NAME.getCode())));
fileWriters.add(bufferedWriter);
}
// Fetching the datas
ScrollableResults results = runQuery();
// Get the result and fill the files
int currentPartition = 0;
int nbWriting = 0;
while (results.next()) {
fileWriters.get(currentPartition).write(results.get(0).toString());
fileWriters.get(currentPartition).newLine();
currentPartition++;
nbWriting++;
// If we already write on all partitions, we start again
if (currentPartition >= executionsContexte.size()) {
currentPartition = 0;
}
// If we reach the max item to read we stop
if (nbItemMax != null && nbItemMax != 0 && nbWriting >= nbItemMax) {
break;
}
}
// closing
results.close();
session.close();
for (BufferedWriter bufferedWriter : fileWriters) {
bufferedWriter.close();
}
} catch (IOException | SQLException e) {
throw new UnexpectedJobExecutionException("Error writing partition file", e);
}
}
private ScrollableResults runQuery() {
...
}
}
I am retrieving big chunks of data from DB and using this data to write it somewhere else. In order to avoid a long processing time, I'm trying to use parallel streams to write it. When I run this as sequential streams, it works perfectly. However, if I change it to parallel, the behavior is odd: it prints the same object multiple times (more than 10).
#PostConstruct
public void retrieveAllTypeRecords() throws SQLException {
logger.info("Retrieve batch of Type records.");
try {
Stream<TypeRecord> typeQueryAsStream = jdbcStream.getTypeQueryAsStream();
typeQueryAsStream.forEach((type) -> {
logger.info("Printing Type with field1: {} and field2: {}.", type.getField1(), type.getField2()); //the same object gets printed here multiple times
//write this object somewhere else
});
logger.info("Completed full retrieval of Type data.");
} catch (Exception e) {
logger.error("error: " + e);
}
}
public Stream<TypeRecord> getTypeQueryAsStream() throws SQLException {
String sql = typeRepository.getQueryAllTypesRecords(); //retrieves SQL query in String format
TypeMapper typeMapper = new TypeMapper();
JdbcStream.StreamableQuery query = jdbcStream.streamableQuery(sql);
Stream<TypeRecord> stream = query.stream()
.map(row -> {
return typeMapper.mapRow(row); //maps columns values to object values
});
return stream;
}
public class StreamableQuery implements Closeable {
(...)
public Stream<SqlRow> stream() throws SQLException {
final SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
#Override
public boolean hasNext() {
return !rowSet.isLast();
}
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
return StreamSupport.stream(supplier, Spliterator.CONCURRENT, true); //this boolean sets the stream as parallel
}
}
I've also tried using typeQueryAsStream.parallel().forEach((type) but the result is the same.
Example of output:
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[main] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
[ForkJoinPool.commonPool-worker-1] INFO TypeService - Saving Type with field1: L6797 and field2: P1433.
Well, look at you code,
final SqlRow sqlRow = new SqlRowAdapter(rowSet);
Supplier<Spliterator<SqlRow>> supplier = () -> Spliterators.spliteratorUnknownSize(new Iterator<SqlRow>() {
…
#Override
public SqlRow next() {
if (!rowSet.next()) {
throw new NoSuchElementException();
}
return sqlRow;
}
}, Spliterator.CONCURRENT);
You are returning the same object every time. You achieve your desired effects by implicitly modifying the state of this object when calling rowSet.next().
This obviously can’t work when multiple threads try to access that single object concurrently. Even buffering some items, to hand them over to another thread will cause trouble. Therefore, such interference can cause problems with sequential streams as well, as soon as stateful intermediate operations are involved, like sorted or distinct.
Assuming that typeMapper.mapRow(row) will produce an actual data item which has no interference to other data items, you should integrate this step into the stream source, to create a valid stream.
public Stream<TypeRecord> stream(TypeMapper typeMapper) throws SQLException {
SqlRowSet rowSet = new ResultSetWrappingSqlRowSet(preparedStatement.executeQuery());
SqlRow sqlRow = new SqlRowAdapter(rowSet);
Spliterator<TypeRecord> sp = new Spliterators.AbstractSpliterator<TypeRecord>(
Long.MAX_VALUE, Spliterator.CONCURRENT|Spliterator.ORDERED) {
#Override
public boolean tryAdvance(Consumer<? super TypeRecord> action) {
if(!rowSet.next()) return false;
action.accept(typeMapper.mapRow(sqlRow));
return true;
}
};
return StreamSupport.stream(sp, true); //this boolean sets the stream as parallel
}
Note that for a lot of use cases, like this one, implementing a Spliterator is simpler than implementing an Iterator (which needs to be wrapped via spliteratorUnknownSize anyway). Also, there is no need to encapsulate this instantiation into a Supplier.
As a final note, the current implementation does not perform well for streams with an unknown size, as it treats Long.MAX_VALUE like a very large number, ignoring the “unknown” semantic assigned to it by the specification. It will be very beneficial to the parallel performance to provide an estimate size, it doesn’t need to be precise, in fact, with the current implementation, even a completely made up number, say 1000 may perform better than correctly using Long.MAX_VALUE to denote an entirely unknown size.