Duplication with Chunks in Spring Batch - spring

I have a huge file, I need to read it and dump it into DB. Any invalid records(invalid length, duplicate keys, etc), if present need to be written into a Error Report. Due to the huge size of the file we tried using the chunk-size(commit-interval) as 1000/5000/10000. In the process I found that the data was being processed redundantly due to the usage of chunks and thus my Error Report is incorrect, it not only has the actual invalid records from the input-file but also the duplicates from the chunks.
Code snippet:
#Bean
public Step readAndWriteStudentInfo() {
return stepBuilderFactory.get("readAndWriteStudentInfo")
.<Student, Student>chunk(5000).reader(studentFileReader()).faultTolerant()
.skipPolicy(skipper)..listener(listener).processor(new ItemProcessor<Student, Student>() {
#Override
public Student process(Student Student) throws Exception {
if(processedRecords.contains(Student)){
return null;
}else {
processedRecords.add(Student);
return Student;
}
}
}).writer(studentDBWriter()).build();
}
#Bean
public ItemReader<Student> studentFileReader() {
FlatFileItemReader<Student> reader = new FlatFileItemReader<>();
reader.setResource(new FileSystemResource(studentInfoFileName));
reader.setLineMapper(new DefaultLineMapper<Student>() {
{
setLineTokenizer(new FixedLengthTokenizer() {
{
setNames(classProperties50);
setColumns(range50);
}
});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Student>() {
{
setTargetType(Student.class);
}
});
}
});
reader.setSaveState(false);
reader.setLinesToSkip(1);
reader.setRecordSeparatorPolicy(new TrailerSkipper());
return reader;
}
#Bean
public ItemWriter<Student> studentDBWriter() {
JdbcBatchItemWriter<Student> writer = new JdbcBatchItemWriter<>();
writer.setSql(insertQuery);
writer.setDataSource(datSource);
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<Student>());
return writer;
}
I've tried with various chunk sizes, 10, 100, 1000, 5000. The accuracy of my error report deteriorates with the increase in chunk size. Writing to Error Report is happening from my implementation of Skip Policy, kindly do let me know if that code is required too to help me out.
How do I ensure that my writer picks up unique set of records in each chunk?
Skipper Implementation:
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
String exception = t.getClass().getSimpleName();
if (t instanceof FileNotFoundException) {
return false;
}
switch (exception) {
case "FlatFileParseException":
FlatFileParseException ffpe = (FlatFileParseException) t;
String errorMessage = "Line no = " + ffpe.getLineNumber() + " " + ffpe.getMessage() + " Record is ["
+ ffpe.getInput() + "].\n";
writeToRecon(errorMessage);
return true;
case "SQLException":
SQLException sE = (SQLException) t;
String sqlErrorMessage = sE.getErrorCode() + " Record is [" + sE.getCause() + "].\n";
writeToRecon(sqlErrorMessage);
return true;
case "BatchUpdateException":
BatchUpdateException batchUpdateException = (BatchUpdateException) t;
String btchUpdtExceptionMsg = batchUpdateException.getMessage() + " " + batchUpdateException.getCause();
writeToRecon(btchUpdtExceptionMsg);
return true;
default:
return false;
}

Related

How to delete alarge amount of data one by one from a table with their relations using transactional annotation

I have a large amount of data that I want to purge from the database, there are about 6 tables of which 3 have a many to many relationship with cascadeType. All the others are log and history tables independent of the 3 others
i want to purge this data one by one and if any of them have error while deleting i have to undo only the current record and show it in console and keep deleting the others
I am trying to use transactional annotation with springboot but all purging stops if an error occurs
how to manage this kind of need?
here is what i did :
#Transactional
private void purgeCards(List<CardEntity> cardsTobePurge) {
List<Long> nextCardsNumberToUpdate = getNextCardsWhichWillNotBePurge(cardsTobePurge);
TransactionTemplate lTransTemplate = new TransactionTemplate(transactionManager);
lTransTemplate.setPropagationBehavior(TransactionTemplate.PROPAGATION_REQUIRED);
lTransTemplate.execute(new TransactionCallback<Object>() {
#Override
public Object doInTransaction(TransactionStatus status) {
cardsTobePurge.forEach(cardTobePurge -> {
Long nextCardNumberOfCurrent = cardTobePurge.getNextCard();
if (nextCardsNumberToUpdate.contains(nextCardNumberOfCurrent)) {
CardEntity cardToUnlik = cardRepository.findByCardNumber(nextCardNumberOfCurrent);
unLink(cardToUnlik);
}
log.info(BATCH_TITLE + " Removing card Number : " + cardTobePurge.getCardNumber() + " with Id : "
+ cardTobePurge.getId());
List<CardHistoryEntity> historyEntitiesOfThisCard = cardHistoryRepository.findByCard(cardTobePurge);
List<LogCreationCardEntity> logCreationEntitiesForThisCard = logCreationCardRepository
.findByCardNumber(cardTobePurge.getCardNumber());
List<LogCustomerMergeEntity> logCustomerMergeEntitiesForThisCard = logCustomerMergeRepository
.findByCard(cardTobePurge);
cardHistoryRepository.deleteAll(historyEntitiesOfThisCard);
logCreationCardRepository.deleteAll(logCreationEntitiesForThisCard);
logCustomerMergeRepository.deleteAll(logCustomerMergeEntitiesForThisCard);
cardRepository.delete(cardTobePurge);
});
return Boolean.TRUE;
}
});
}
As a solution to my question:
I worked with TransactionTemplate to be able to manage transactions manually
so if an exception is raised a rollback will only be applied for the current iteration and will continue to process other cards
private void purgeCards(List<CardEntity> cardsTobePurge) {
int[] counter = { 0 }; //to simulate the exception
List<Long> nextCardsNumberToUpdate = findNextCardsWhichWillNotBePurge(cardsTobePurge);
cardsTobePurge.forEach(cardTobePurge -> {
Long nextCardNumberOfCurrent = cardTobePurge.getNextCard();
CardEntity cardToUnlik = null;
counter[0]++; //to simulate the exception
if (nextCardsNumberToUpdate.contains(nextCardNumberOfCurrent)) {
cardToUnlik = cardRepository.findByCardNumber(nextCardNumberOfCurrent);
}
purgeCard(cardTobePurge, nextCardsNumberToUpdate, cardToUnlik, counter);
});
}
private void purgeCard(#NonNull CardEntity cardToPurge, List<Long> nextCardsNumberToUpdate, CardEntity cardToUnlik,
int[] counter) {
TransactionTemplate lTransTemplate = new TransactionTemplate(transactionManager);
lTransTemplate.setPropagationBehavior(TransactionTemplate.PROPAGATION_REQUIRED);
lTransTemplate.execute(new TransactionCallbackWithoutResult() {
#Override
public void doInTransactionWithoutResult(TransactionStatus status) {
try {
if (cardToUnlik != null)
unLink(cardToUnlik);
log.info(BATCH_TITLE + " Removing card Number : " + cardToPurge.getCardNumber() + " with Id : "
+ cardToPurge.getId());
List<CardHistoryEntity> historyEntitiesOfThisCard = cardHistoryRepository.findByCard(cardToPurge);
List<LogCreationCardEntity> logCreationEntitiesForThisCard = logCreationCardRepository
.findByCardNumber(cardToPurge.getCardNumber());
List<LogCustomerMergeEntity> logCustomerMergeEntitiesForThisCard = logCustomerMergeRepository
.findByCard(cardToPurge);
cardHistoryRepository.deleteAll(historyEntitiesOfThisCard);
logCreationCardRepository.deleteAll(logCreationEntitiesForThisCard);
logCustomerMergeRepository.deleteAll(logCustomerMergeEntitiesForThisCard);
cardRepository.delete(cardToPurge);
if (counter[0] == 2)//to simulate the exception
throw new Exception();//to simulate the exception
} catch (Exception e) {
status.setRollbackOnly();
if (cardToPurge != null)
log.error(BATCH_TITLE + " Problem with card Number : " + cardToPurge.getCardNumber()
+ " with Id : " + cardToPurge.getId(), e);
else
log.error(BATCH_TITLE + "Card entity is null", e);
}
}
});
}

Confluent HDFS Connector is losing messages

Community, could you please help me to understand why ~3% of my messages don't end up in HDFS? I wrote a simple producer in JAVA to generate 10 million messages.
public static final String TEST_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"str2\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public KafkaProducerWrapper(String topic) throws UnknownHostException {
// store topic name
this.topic = topic;
// initialize kafka producer
Properties config = new Properties();
config.put("client.id", InetAddress.getLocalHost().getHostName());
config.put("bootstrap.servers", "myserver-1:9092");
config.put("key.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
config.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
config.put("schema.registry.url", "http://myserver-1:8089");
config.put("acks", "all");
producer = new KafkaProducer(config);
// parse schema
Schema.Parser parser = new Schema.Parser();
schema = parser.parse(TEST_SCHEMA);
}
public void send() {
// generate key
int key = (int) (Math.random() * 20);
// generate record
GenericData.Record r = new GenericData.Record(schema);
r.put("str1", "text" + key);
r.put("str2", "text2" + key);
r.put("int1", key);
final ProducerRecord<String, GenericRecord> record = new ProducerRecord<>(topic, "K" + key, (GenericRecord) r);
producer.send(record, new Callback() {
public void onCompletion(RecordMetadata metadata, Exception e) {
if (e != null) {
logger.error("Send failed for record {}", record, e);
messageErrorCounter++;
return;
}
logger.debug("Send succeeded for record {}", record);
messageCounter++;
}
});
}
public String getStats() { return "Messages sent: " + messageCounter + "/" + messageErrorCounter; }
public long getMessageCounter() {
return messageCounter + messageErrorCounter;
}
public void close() {
producer.close();
}
public static void main(String[] args) throws InterruptedException, UnknownHostException {
// initialize kafka producer
KafkaProducerWrapper kafkaProducerWrapper = new KafkaProducerWrapper("my-test-topic");
long max = 10000000L;
for (long i = 0; i < max; i++) {
kafkaProducerWrapper.send();
}
logger.info("producer-demo sent all messages");
while (kafkaProducerWrapper.getMessageCounter() < max)
{
logger.info(kafkaProducerWrapper.getStats());
Thread.sleep(2000);
}
logger.info(kafkaProducerWrapper.getStats());
kafkaProducerWrapper.close();
}
And I use the Confluent HDFS Connector in standalone mode to write data to HDFS. The configuration is as follows:
name=hdfs-consumer-test
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=my-test-topic
hdfs.url=hdfs://my-cluster/kafka-test
hadoop.conf.dir=/etc/hadoop/conf/
flush.size=100000
rotate.interval.ms=20000
# increase timeouts to avoid CommitFailedException
consumer.session.timeout.ms=300000
consumer.request.timeout.ms=310000
heartbeat.interval.ms= 60000
session.timeout.ms= 100000
The connector writes the data into HDFS, but after waiting for 20000 ms (due to rotate.interval.ms) not all messages are received.
scala> spark.read.avro("/kafka-test/topics/my-test-topic/partition=*/my-test-topic*")
.count()
res0: Long = 9749015
Any idea what is the reason for this behavior? Where is my mistake? I'm using Confluent 3.0.1/Kafka 10.0.0.1.
Are you seeing the last few messages are not moved to HDFS? If so, it's likely you are running into the issue described here https://github.com/confluentinc/kafka-connect-hdfs/pull/100
Try sending one more message to the topic after the rotate.interval.ms has expired to validate this is what you are running into. If you need to rotate based on time, it's probably a good idea to upgrade to pickup the fix.

how to wait for ParseCloud.callFunctionInBackgroud

I have a Utility function which I use to check if user has done few operations or not. If not, Then I need to send the results back to MainActivity.
public static boolean checkUserIsDone(String receiverPhoneNumber, String senderNumber ) {
hasApp = false;
HashMap<String, Object> parseParams = new HashMap<>();
parseParams.put("phoneNumber", receiverPhoneNumber);
parseParams.put("senderNumber", senderNumber);
ParseCloud.callFunctionInBackground("GetUserByPhoneNumber", parseParams, new FunctionCallback<String>() {
public void done(String result, ParseException e) {
if (e == null && result.equals("success")) {
hasApp = true;
} else {
hasApp = false;
Log.e("Error: ", e.getMessage());
}
}
});
return hasApp;
}

Reading a file with newlines as a tuple in pig

Is it possible to change the record delimiter from newline to some other string so as to read a file with newlines into a single tuple in pig.
Yes.
A = LOAD '...' USING PigStorage(',') AS (...); //comma is the delimeter for fields
SET textinputformat.record.delimiter '<delimeter>'; // record delimeter, by default it is `\n`. You can change to any delimeter.
As mentioned here
You can use PigStorage
A = LOAD '/some/path/COMMA-DELIM-PREFIX*' USING PigStorage(',') AS (f1:chararray, ...);
B = LOAD '/some/path/SEMICOLON-DELIM-PREFIX*' USING PigStorage('\t') AS (f1:chararray, ...);
You can even try writing load/store UDF.
There is java code example for both load and store.
Load Functions : LoadFunc abstract class has the main methods for loading data and for most use cases it would suffice to extend it. You can read more here
Example
The loader implementation in the example is a loader for text data
with line delimiter as '\n' and '\t' as default field delimiter (which
can be overridden by passing a different field delimiter in the
constructor) - this is similar to current PigStorage loader in Pig.
The implementation uses an existing Hadoop supported Inputformat -
TextInputFormat - as the underlying InputFormat.
public class SimpleTextLoader extends LoadFunc {
protected RecordReader in = null;
private byte fieldDel = '\t';
private ArrayList<Object> mProtoTuple = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
private static final int BUFFER_SIZE = 1024;
public SimpleTextLoader() {
}
/**
* Constructs a Pig loader that uses specified character as a field delimiter.
*
* #param delimiter
* the single byte character that is used to separate fields.
* ("\t" is the default.)
*/
public SimpleTextLoader(String delimiter) {
this();
if (delimiter.length() == 1) {
this.fieldDel = (byte)delimiter.charAt(0);
} else if (delimiter.length() > 1 & & delimiter.charAt(0) == '\\') {
switch (delimiter.charAt(1)) {
case 't':
this.fieldDel = (byte)'\t';
break;
case 'x':
fieldDel =
Integer.valueOf(delimiter.substring(2), 16).byteValue();
break;
case 'u':
this.fieldDel =
Integer.valueOf(delimiter.substring(2)).byteValue();
break;
default:
throw new RuntimeException("Unknown delimiter " + delimiter);
}
} else {
throw new RuntimeException("PigStorage delimeter must be a single character");
}
}
#Override
public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
return null;
}
Text value = (Text) in.getCurrentValue();
byte[] buf = value.getBytes();
int len = value.getLength();
int start = 0;
for (int i = 0; i < len; i++) {
if (buf[i] == fieldDel) {
readField(buf, start, i);
start = i + 1;
}
}
// pick up the last field
readField(buf, start, len);
Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple);
mProtoTuple = null;
return t;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
}
private void readField(byte[] buf, int start, int end) {
if (mProtoTuple == null) {
mProtoTuple = new ArrayList<Object>();
}
if (start == end) {
// NULL value
mProtoTuple.add(null);
} else {
mProtoTuple.add(new DataByteArray(buf, start, end));
}
}
#Override
public InputFormat getInputFormat() {
return new TextInputFormat();
}
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
in = reader;
}
#Override
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
Store Functions : StoreFunc abstract class has the main methods for storing data and for most use cases it should suffice to extend it
Example
The storer implementation in the example is a storer for text data
with line delimiter as '\n' and '\t' as default field delimiter (which
can be overridden by passing a different field delimiter in the
constructor) - this is similar to current PigStorage storer in Pig.
The implementation uses an existing Hadoop supported OutputFormat -
TextOutputFormat as the underlying OutputFormat.
public class SimpleTextStorer extends StoreFunc {
protected RecordWriter writer = null;
private byte fieldDel = '\t';
private static final int BUFFER_SIZE = 1024;
private static final String UTF8 = "UTF-8";
public PigStorage() {
}
public PigStorage(String delimiter) {
this();
if (delimiter.length() == 1) {
this.fieldDel = (byte)delimiter.charAt(0);
} else if (delimiter.length() > 1delimiter.charAt(0) == '\\') {
switch (delimiter.charAt(1)) {
case 't':
this.fieldDel = (byte)'\t';
break;
case 'x':
fieldDel =
Integer.valueOf(delimiter.substring(2), 16).byteValue();
break;
case 'u':
this.fieldDel =
Integer.valueOf(delimiter.substring(2)).byteValue();
break;
default:
throw new RuntimeException("Unknown delimiter " + delimiter);
}
} else {
throw new RuntimeException("PigStorage delimeter must be a single character");
}
}
ByteArrayOutputStream mOut = new ByteArrayOutputStream(BUFFER_SIZE);
#Override
public void putNext(Tuple f) throws IOException {
int sz = f.size();
for (int i = 0; i < sz; i++) {
Object field;
try {
field = f.get(i);
} catch (ExecException ee) {
throw ee;
}
putField(field);
if (i != sz - 1) {
mOut.write(fieldDel);
}
}
Text text = new Text(mOut.toByteArray());
try {
writer.write(null, text);
mOut.reset();
} catch (InterruptedException e) {
throw new IOException(e);
}
}
#SuppressWarnings("unchecked")
private void putField(Object field) throws IOException {
//string constants for each delimiter
String tupleBeginDelim = "(";
String tupleEndDelim = ")";
String bagBeginDelim = "{";
String bagEndDelim = "}";
String mapBeginDelim = "[";
String mapEndDelim = "]";
String fieldDelim = ",";
String mapKeyValueDelim = "#";
switch (DataType.findType(field)) {
case DataType.NULL:
break; // just leave it empty
case DataType.BOOLEAN:
mOut.write(((Boolean)field).toString().getBytes());
break;
case DataType.INTEGER:
mOut.write(((Integer)field).toString().getBytes());
break;
case DataType.LONG:
mOut.write(((Long)field).toString().getBytes());
break;
case DataType.FLOAT:
mOut.write(((Float)field).toString().getBytes());
break;
case DataType.DOUBLE:
mOut.write(((Double)field).toString().getBytes());
break;
case DataType.BYTEARRAY: {
byte[] b = ((DataByteArray)field).get();
mOut.write(b, 0, b.length);
break;
}
case DataType.CHARARRAY:
// oddly enough, writeBytes writes a string
mOut.write(((String)field).getBytes(UTF8));
break;
case DataType.MAP:
boolean mapHasNext = false;
Map<String, Object> m = (Map<String, Object>)field;
mOut.write(mapBeginDelim.getBytes(UTF8));
for(Map.Entry<String, Object> e: m.entrySet()) {
if(mapHasNext) {
mOut.write(fieldDelim.getBytes(UTF8));
} else {
mapHasNext = true;
}
putField(e.getKey());
mOut.write(mapKeyValueDelim.getBytes(UTF8));
putField(e.getValue());
}
mOut.write(mapEndDelim.getBytes(UTF8));
break;
case DataType.TUPLE:
boolean tupleHasNext = false;
Tuple t = (Tuple)field;
mOut.write(tupleBeginDelim.getBytes(UTF8));
for(int i = 0; i < t.size(); ++i) {
if(tupleHasNext) {
mOut.write(fieldDelim.getBytes(UTF8));
} else {
tupleHasNext = true;
}
try {
putField(t.get(i));
} catch (ExecException ee) {
throw ee;
}
}
mOut.write(tupleEndDelim.getBytes(UTF8));
break;
case DataType.BAG:
boolean bagHasNext = false;
mOut.write(bagBeginDelim.getBytes(UTF8));
Iterator<Tuple> tupleIter = ((DataBag)field).iterator();
while(tupleIter.hasNext()) {
if(bagHasNext) {
mOut.write(fieldDelim.getBytes(UTF8));
} else {
bagHasNext = true;
}
putField((Object)tupleIter.next());
}
mOut.write(bagEndDelim.getBytes(UTF8));
break;
default: {
int errCode = 2108;
String msg = "Could not determine data type of field: " + field;
throw new ExecException(msg, errCode, PigException.BUG);
}
}
}
#Override
public OutputFormat getOutputFormat() {
return new TextOutputFormat<WritableComparable, Text>();
}
#Override
public void prepareToWrite(RecordWriter writer) {
this.writer = writer;
}
#Override
public void setStoreLocation(String location, Job job) throws IOException {
job.getConfiguration().set("mapred.textoutputformat.separator", "");
FileOutputFormat.setOutputPath(job, new Path(location));
if (location.endsWith(".bz2")) {
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
} else if (location.endsWith(".gz")) {
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
}
}
}

C3P0 Statement.close deadlock

Google returns lots of people with deadlock issues in C3P0, but none of the solutions appear to apply (most people suggest setting maxStatements = 0 and maxStatementsPerConnection = 0, both of which we have).
I am using a ComboPooledDataSource from C3P0, initialised as;
cpds = new ComboPooledDataSource();
cpds.setDriverClass("org.postgresql.Driver");
cpds.setJdbcUrl("jdbc:postgresql://" + host + ":5432/" + database);
cpds.setUser(user);
cpds.setPassword(pass);
My query function looks like;
public static List<Map<String, Object>> query(String q) {
Connection c = null;
Statement s = null;
ResultSet r = null;
try {
c = cpds.getConnection();
s = c.createStatement();
s.executeQuery(q);
r = s.getResultSet();
/* parse result set into results List<Map> */
return results;
}
catch(Exception e) { MyUtils.logException(e); }
finally {
closeQuietly(r);
closeQuietly(s);
closeQuietly(c);
}
return null;
}
No queries are returning, despite the query() method reaching the return results; line. The issue is that the finally block is hanging. I have determined that the closeQuietly(s); is the line that is hanging indefinitely.
The closeQuietly() method in question is as you would expect;
public static void closeQuietly(Statement s) {
try { if(s != null) s.close(); }
catch(Exception e) { MyUtils.logException(e); }
}
Why would this method hang on s.close()? I guess it is something to do with the way I am using C3P0.
My complete C3P0 configuration (almost entirely defaults) can be viewed here -> http://pastebin.com/K8XDdiBg
MyUtils.logException(); looks something like;
public static void logException(Exception e) {
StackTraceElement ste[] = e.getStackTrace();
String message = " !ERROR!: ";
for(int i = 0; i < ste.length; i++) {
if(ste[i].getClassName().contains("packagename")) {
message += String.format("%s at %s:%d", e.toString(), ste[i].getFileName(), ste[i].getLineNumber());
break;
}
}
System.err.println(message);
}
Everything runs smoothly if I remove the closeQuietly(s); line. Both closing the ResultSet and Connection object work without problem - apart from Connection starvation of course.

Resources