I am using the Spring batch to develop CSV feed files. I had used a writer similar to the one given below to generate my output file.
#Bean
public FlatFileItemWriter<CustomObj> writer()
{
BeanWrapperFieldExtractor<CustomObj> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[] {"name", "emailAddress", "dob"});
DelimitedLineAggregator<CustomObj> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<CustomObj> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<CustomObj> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
//writer.setHeaderCallback(headerCallback);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
But now we have a requirement to make this writer generic such that we no longer pass the object, instead pass a Map<String, String> and the object values are now stored in the Map of key Value pairs
Eg: name-> abc , emailAddress->abc#xyz.com, dob -> 10-10-20
We tried to use the writer similar to the below one,
But the problem here is that as there is no FieldExtractor set and thus the header and the values may become out of sync.
The PassThroughFieldExtractor just passes all the values in the collection(Map) in any order. even if the Map contains more fields, it prints all the fields.
Header and the values are not bound together in this case.
Is there any way to implement a custom field extractor which will make sure even if we change the ordering of the header, the ordering of the values remain consistent with the header.
#Bean
public FlatFileItemWriter<Map<String,String>> writer()
{
DelimitedLineAggregator<Map<String,String>> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<Map<String,String>> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(new PassThroughFieldExtractor<>());
FlatFileItemWriter<Map<String,String>> writer = new FlatFileItemWriter<>();
writer.setResource(outputResource);
writer.setAppendAllowed(true);
writer.setLineAggregator(lineAggregator);
return writer;
}
output
name;emailAddress;dob
abc#xyz.com;abc;10-10-20
expected Output
case 1:
name;emailAddress;dob
abc;abc#xyz.com;10-10-20
case 2:
emailAddress;dob
abc#xyz.com;10-10-20
you need a custom fieldset extractor that extracts values from the map in the same order as the headers. Spring Batch does not provides such an extractor so you need to implement it yourself. For example, you can pass the headers at construction time to the extractor, and extract values from the map according to the headers order.
Related
So I've got a csv file that's being ingested on a scheduled basis. The csv file has a set of columns with their names specified in the header row, the catch is, new columns are constantly being added to this csv. Currently, when a new field is added, the ingest flow breaks and I get a FlatFileParseException. I have to go in and update the code with the new column names in order to have it work again.
What I'm looking to accomplish, is instead, when new columns are added, have the code correctly pick out the columns it needs, and not throw an exception.
#Bean
#StepScope
FlatFileItemReader<Foo> fooReader(
...
) {
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = { "foo", "bar" };
setNames(fooColumnNames);
// setStrict(false);
}};
return new FlatFileItemReader<>() {{
setLineMapper(new DefaultLineMapper<>() {{
setLineTokenizer(fooLineTokenizer);
setFieldSetMapper(new BeanWrapperFieldSetMapper<>() {{
setTargetType(Foo.class);
}});
}});
...
}};
}
I've tried using setStrict(false) in the lineTokenizer, and this gets rid of the exception, however the problem then becomes fields being set to the wrong values from the new columns that were added, as opposed to the original columns the data was being pulled from.
Any ideas on how to add a bit more fault-tolerance to this flow, so I don't have to constantly update the fooColumnNames whenever columns are added to the csv?
I tried modifying the code using the setStrict(false) parameter and toying with custom implementations of lineTokenizer, but still struggling to get fault-tolerance when new columns are added to the csv
I don't know about fault tolerance, but it could possible to retrieve columns dynamically
Add a listener to your step to retrieve columns in beforeStep and pass it to your stepExecutionContext
public class ColumnRetrieverListener implements StepExecutionListener {
private Resource resource;
//Getter and setter
#Override
public void beforeStep(StepExecution stepExecution) {
String[] columns = getColumns();
stepExecution.getExecutionContext().put("columns", columns);
}
private String[] getColumns() {
//Parse first line of resource to get columns
}
}
Use the data passed to execution context to set lineTokenizer
final DelimitedLineTokenizer fooLineTokenizer = new DelimitedLineTokenizer(",") {{
final String[] fooColumnNames = (String[]) stepExecution.getExecutionContext().get("columns");
setNames(fooColumnNames);
}};
I am developing a Spring Boot application.
I have an input.csv file to read and write values to DB, to do so I am using spring batch framework.
I am able to read/insert values to DB. Now I have created an output.csv file on which I have created a header with some column and setting a few column values extracted from input.csv file.
But for one column in output.csv file, I want to record the response i.e 00->success, 11->fail
Ex: My header looks like {"S_NO;NAME;Dept;Salary;Rsp_code"}
In this, for the first 4 columns values, I am setting from input.csv file.
For column rsp_code, I want to set hardcoded values 00 and 11.
if one row is successfully inserted to DB set rsp_code values like 00 otherwise 11.
Is there a way to do this? MY header code as following:
ItemWriter<Valueholder> databaseCsvItemWriter() {
FlatFileItemWriter<Valueholder> csvFileWriter = new FlatFileItemWriter<>();
String exportFileHeader = "S_NO;NAME;Dept;Salary;Rsp_code";
StringHeaderWriter headerWriter = new StringHeaderWriter(exportFileHeader);
csvFileWriter.setHeaderCallback(headerWriter);
String exportFilePath = "C:\\temp\\useroutput.csv";
csvFileWriter.setResource(new FileSystemResource(exportFilePath));
LineAggregator<Valueholder> lineAggregator = createStudentLineAggregator();
csvFileWriter.setLineAggregator(lineAggregator);
return csvFileWriter;
}
private LineAggregator<Valueholder> createStudentLineAggregator() {
DelimitedLineAggregator<Valueholder> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setDelimiter(";");
FieldExtractor<Valueholder> fieldExtractor = createStudentFieldExtractor();
lineAggregator.setFieldExtractor(fieldExtractor);
return lineAggregator;
}
private FieldExtractor<Valueholder> createStudentFieldExtractor() {
BeanWrapperFieldExtractor<Valueholder> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[] {"id", "name", "dept", "salary"});
return extractor;
}
Data transformation is a typical use case for an item processor. So you can use an item processor to do the mapping. The advantage of this approach is that you can:
unit test the mapping in isolation
re-use the item processor in another step/job if needed
In my custom processor I have added below field
public static final PropertyDescriptor CACHE_VALUE = new PropertyDescriptor.Builder()
.name("Cache Value")
.description("Cache Value")
.required(true)
.addValidator(StandardValidators.NON_EMPTY_VALIDATOR)
.expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES)
.build();
Where I expect to read flowfile attributes like ${fieldName}
as well as regex like .* to read full content or some part of content like $.nodename.subnodename
For that I have added below code
for (FlowFile flowFile : flowFiles) {
final String cacheKey = context.getProperty(CACHE_KEY).evaluateAttributeExpressions(flowFile).getValue();
String cacheValue = null;
cacheValue = context.getProperty(CACHE_VALUE).evaluateAttributeExpressions(flowFile).getValue();
if (".*".equalsIgnoreCase(cacheValue.trim())) {
final ByteArrayOutputStream bytes = new ByteArrayOutputStream();
session.exportTo(flowFile, bytes);
cacheValue = bytes.toString();
}
cache.put(cacheKey, cacheValue);
session.transfer(flowFile, REL_SUCCESS);
}
How to achieve this one some part of content like $.nodename.subnodename.
Do I need to parse the json or is there any other way?
You will either have to parse the JSON yourself, or use an EvaluateJsonPath processor before reaching this processor to extract content values out to attributes via JSON Path expressions, and then in your custom code, reference the value of the attribute.
I am loading a file from HDFS into a JavaRDD and wanted to update that RDD. For that I am converting it to IndexedRDD (https://github.com/amplab/spark-indexedrdd) and I am not able to as I am getting Classcast Exception.
Basically I will make key value pair and update the key. IndexedRDD supports update. Is there any way to convert ?
JavaPairRDD<String, String> mappedRDD = lines.flatMapToPair( new PairFlatMapFunction<String, String, String>()
{
#Override
public Iterable<Tuple2<String, String>> call(String arg0) throws Exception {
String[] arr = arg0.split(" ",2);
System.out.println( "lenght" + arr.length);
List<Tuple2<String, String>> results = new ArrayList<Tuple2<String, String>>();
results.addAll(results);
return results;
}
});
IndexedRDD<String,String> test = (IndexedRDD<String,String>) mappedRDD.collectAsMap();
The collectAsMap() returns a java.util.Map containing all the entries from your JavaPairRDD, but nothing related to Spark. I mean, that function is to collect the values in one node and work with plain Java. Therefore, you cannot cast it to IndexedRDD or any other RDD type as its just a normal Map.
I haven't used IndexedRDD, but from the examples you can see that you need to create it by passing to its constructor a PairRDD:
// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()
So in your code it should be:
IndexedRDD<String,String> test = new IndexedRDD<String,String>(mappedRDD.rdd());
I am trying to update multiple value in index using Java Api through Elastic Search Script. But not able to update fields.
Sample code :-
1:
UpdateResponse response = request.setScript("ctx._source").setScriptParams(scriptParams).execute().actionGet();
2:
UpdateResponse response = request.setScript("ctx._source.").setScriptParams(scriptParams).execute().actionGet();
if I mentioned .(dot) in ("ctx._source.") getting illegalArgument Exception and if i do not use dot, not getting any exception but values not getting updated in Index.
Can any one tell me the solutions to resolve this.
First of all, your script (ctx._source) doesn't do anything, as one of the commenters already pointed out. If you want to update, say, field "a", then you would need a script like:
ctx._source.a = "foobar"
This would assign the string "foobar" to field "a". You can do more than simple assignment, though. Check out the docs for more details and examples:
http://www.elasticsearch.org/guide/reference/api/update/
Updating multiple fields with one script is also possible. You can use semicolons to separate different MVEL instructions. E.g.:
ctx._source.a = "foo"; ctx._source.b = "bar"
In Elastic search have an Update Java API. Look at the following code
client.prepareUpdate("index","typw","1153")
.addScriptParam("assignee", assign)
.addScriptParam("newobject", responsearray)
.setScript("ctx._source.assignee=assignee;ctx._source.responsearray=newobject ").execute().actionGet();
Here, assign variable contains object value and response array variable contains list of data.
You can do the same using spring java client using the following code. I am also listing the dependencies used in the code.
import org.elasticsearch.action.update.UpdateRequest;
import org.elasticsearch.index.query.QueryBuilder;
import org.springframework.data.elasticsearch.core.query.UpdateQuery;
import org.springframework.data.elasticsearch.core.query.UpdateQueryBuilder;
private UpdateQuery updateExistingDocument(String Id) {
// Add updatedDateTime, CreatedDateTime, CreateBy, UpdatedBy field in existing documents in Elastic Search Engine
UpdateRequest updateRequest = new UpdateRequest().doc("UpdatedDateTime", new Date(), "CreatedDateTime", new Date(), "CreatedBy", "admin", "UpdatedBy", "admin");
// Create updateQuery
UpdateQuery updateQuery = new UpdateQueryBuilder().withId(Id).withClass(ElasticSearchDocument.class).build();
updateQuery.setUpdateRequest(updateRequest);
// Execute update
elasticsearchTemplate.update(updateQuery);
}
XContentType contentType =
org.elasticsearch.client.Requests.INDEX_CONTENT_TYPE;
public XContentBuilder getBuilder(User assign){
try {
XContentBuilder builder = XContentFactory.contentBuilder(contentType);
builder.startObject();
Map<String,?> assignMap=objectMap.convertValue(assign, Map.class);
builder.field("assignee",assignMap);
return builder;
} catch (IOException e) {
log.error("custom field index",e);
}
IndexRequest indexRequest = new IndexRequest();
indexRequest.source(getBuilder(assign));
UpdateQuery updateQuery = new UpdateQueryBuilder()
.withType(<IndexType>)
.withIndexName(<IndexName>)
.withId(String.valueOf(id))
.withClass(<IndexClass>)
.withIndexRequest(indexRequest)
.build();