FlatFileItemWriterBuilder-headerCallback() get number of rows written - spring

Is it possible to get the total number of rows written from FlatFileItemWriter.headerCallback()?
I am a spring-batch nubee and I looked at putting count of lines into header of flat file and Spring Batch - Counting Processed Rows.
However I can't seem to implement the logic using the advice given there. It makes sense the writer count will only be available after the file is processed. However I am trying to get the row-count just before the file is officially written.
I tried to look for a hook like #AfterStep and grab the total rows, but I keep going in circles.
#Bean
#StepScope
public FlatFileItemWriter<MyFile> generateMyFileWriter(Long jobId,Date eventDate) {
String filePath = "C:\MYFILE\COMPLETED";
Resource file = new FileSystemResource(filePath);
DelimitedLineAggregator<MyFile> myFileLineAggregator = new DelimitedLineAggregator<>();
myFileLineAggregator.setDelimiter(",");
myFileLineAggregator.setFieldExtractor(getMyFileFieldExtractor());
return new FlatFileItemWriterBuilder<MyFile>()
.name("my-file-writer")
.resource(file)
.headerCallback(new MyFileHeaderWriter(file.getFilename()))
.lineAggregator(myFileLineAggregator)
.build();
}
private FieldExtractor<MyFile> getMyFileFieldExtractor() {
final String[] fieldNames = new String[]{
"typeRecord",
"idSystem"
};
return item -> {
BeanWrapperFieldExtractor<MyFile> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(fieldNames);
return extractor.extract(item);
};
}
Notice I am using the MyFileHeaderWriter.java class(below) in the headerCallback(new MyFileHeaderWriter(file.getFilename())) (above). I am trying to initialize the value of qtyRecordsCreated below.
class MyFileHeaderWriter implements FlatFileHeaderCallback {
private final String header;
private String dtxCreated;
private String tmxCreated;
private String fileName;//15 byte file name private String qtyRecordsCreated;//number of rows in file including the header row
MyFileHeaderWriter(String sbfFileName) {
SimpleDateFormat dateCreated = new SimpleDateFormat("YYDDD");
SimpleDateFormat timeCreated = new SimpleDateFormat("HHMM");
Date now = new Date();
this.dtxCreated = dateCreated.format(now);
this.tmxCreated = timeCreated.format(now);
this.fileName = sbfFileName; this.qtyRecordsCreated="";
String[] headerValues = {dtxCreated,tmxCreated,fileName,qtyRecordsCreated};
this.header = String.join(",", headerValues);
}
#Override
public void writeHeader(Writer writer) throws IOException {
writer.write(header);
}
}
How can I get the number of rows in the header row?
Can the FlatFileFooterCallback be used to fetch the number of rows and then update the header with number of rows in the file afterwards?

You can achieve this in ItemProcessor, try this it work for me
public class EmployeeProcessor implements ItemProcessor<Employee, Employee> {
#Override
public Employee process(Employee employee) throws Exception {
return employee;
}
#AfterStep
public void afterStep(StepExecution stepExecution) {
ExecutionContext stepContext = stepExecution.getExecutionContext();
stepContext.put("count", stepExecution.getReadCount());
System.out.println("COUNT" + stepExecution.getReadCount());
}
}
And in you writer to get value
int count = stepContext.getInt("count");
Hope work for you

Related

Read New File While Doing Processing For A Field In Spring Batch

I have a fixedlength input file reading by using SPRING BATCH.
I have already implemented Job, Step, Processor, etc.
Here are the sample code.
#Configuration
public class BatchConfig {
private JobBuilderFactory jobBuilderFactory;
private StepBuilderFactory stepBuilderFactory;
#Value("${inputFile}")
private Resource resource;
#Autowired
public BatchConfig(JobBuilderFactory jobBuilderFactory, StepBuilderFactory stepBuilderFactory) {
this.jobBuilderFactory = jobBuilderFactory;
this.stepBuilderFactory = stepBuilderFactory;
}
#Bean
public Job job() {
return this.jobBuilderFactory.get("JOB-Load")
.start(fileReadingStep())
.build();
}
#Bean
public Step fileReadingStep() {
return stepBuilderFactory.get("File-Read-Step1")
.<Employee,EmpOutput>chunk(1000)
.reader(itemReader())
.processor(new CustomFileProcesser())
.writer(new CustomFileWriter())
.faultTolerant()
.skipPolicy(skipPolicy())
.build();
}
#Bean
public FlatFileItemReader<Employee> itemReader() {
FlatFileItemReader<Employee> flatFileItemReader = new FlatFileItemReader<Employee>();
flatFileItemReader.setResource(resource);
flatFileItemReader.setName("File-Reader");
flatFileItemReader.setLineMapper(LineMapper());
return flatFileItemReader;
}
#Bean
public LineMapper<Employee> LineMapper() {
DefaultLineMapper<Employee> defaultLineMapper = new DefaultLineMapper<Employee>();
FixedLengthTokenizer fixedLengthTokenizer = new FixedLengthTokenizer();
fixedLengthTokenizer.setNames(new String[] { "employeeId", "employeeName", "employeeSalary" });
fixedLengthTokenizer.setColumns(new Range[] { new Range(1, 9), new Range(10, 20), new Range(20, 30)});
fixedLengthTokenizer.setStrict(false);
defaultLineMapper.setLineTokenizer(fixedLengthTokenizer);
defaultLineMapper.setFieldSetMapper(new CustomFieldSetMapper());
return defaultLineMapper;
}
#Bean
public JobSkipPolicy skipPolicy() {
return new JobSkipPolicy();
}
}
For Processing I have added some sample code What I need, But if I add BufferedReader here then it's taking more times to do the job.
#Component
public class CustomFileProcesser implements ItemProcessor<Employee, EmpOutput> {
#Override
public EmpOutput process(Employee item) throws Exception {
EmpOutput emp = new EmpOutput();
emp.setEmployeeSalary(checkSal(item.getEmployeeSalary()));
return emp;
}
public String checkSal(String sal) {
// need to read the another file
// required to do some kind of validation
// after that final result need to return
File f1 = new File("C:\\Users\\John\\New\\salary.txt");
FileReader fr;
try {
fr = new FileReader(f1);
BufferedReader br = new BufferedReader(fr);
String s = br.readLine();
while (s != null) {
String value = s.substring(5, 7);
if(value.equals(sal))
sal = value;
else
sal = "5000";
s = br.readLine();
}
} catch (Exception e) {
e.printStackTrace();
}
return sal;
}
// other fields need to check by reading different different file.
// These new files contains more than 30k records.
// all are fixedlength file.
// I need to get the field by giving the index
}
While doing the processing for one or more field, I need to check In another file by reading that file (it's a file I will read from fileSystem/Cloud).
While processing the data for 5 fields I need to read 5 different different file again, I will check the fields details inside those file and then I will gererate the result , That result will process forther.
You can cache the content of the file in memory and do your check against the cache instead of re-reading the entire file from disk for each item.
You can find an example here: Spring Batch With Annotation and Caching.

How to remove r-00000 extention from reducer output in mapreduce

I am able to rename my reducer output file correctly but r-00000 is still persisting .
I have used MultipleOutputs in my reducer class .
Here is details of the that .Not sure what am i missing or what extra i have to do?
public class MyReducer extends Reducer<NullWritable, Text, NullWritable, Text> {
private Logger logger = Logger.getLogger(MyReducer.class);
private MultipleOutputs<NullWritable, Text> multipleOutputs;
String strName = "";
public void setup(Context context) {
logger.info("Inside Reducer.");
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
}
#Override
public void reduce(NullWritable Key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
final String valueStr = value.toString();
StringBuilder sb = new StringBuilder();
sb.append(strArrvalueStr[0] + "|!|");
multipleOutputs.write(NullWritable.get(), new Text(sb.toString()),strName);
}
}
public void cleanup(Context context) throws IOException,
InterruptedException {
multipleOutputs.close();
}
}
I was able to do it explicitly after my job finishes and thats ok for me.No delay in the job
if (b){
DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd-HHmm");
Calendar cal = Calendar.getInstance();
String strDate=dateFormat.format(cal.getTime());
FileSystem hdfs = FileSystem.get(getConf());
FileStatus fs[] = hdfs.listStatus(new Path(args[1]));
if (fs != null){
for (FileStatus aFile : fs) {
if (!aFile.isDir()) {
hdfs.rename(aFile.getPath(), new Path(aFile.getPath().toString()+".txt"));
}
}
}
}
A more suitable approach to the problem would be changing the OutputFormat.
For eg :- If you are using TextOutputFormatClass, just get the source code of the TextOutputFormat class and modify the below method to get the proper filename (without r-00000). We need to then set the modified output format in the driver.
public synchronized static String getUniqueFile(TaskAttemptContext context, String name, String extension) {
/*TaskID taskId = context.getTaskAttemptID().getTaskID();
int partition = taskId.getId();*/
StringBuilder result = new StringBuilder();
result.append(name);
/*
* result.append('-');
* result.append(TaskID.getRepresentingCharacter(taskId.getTaskType()));
* result.append('-'); result.append(NUMBER_FORMAT.format(partition));
* result.append(extension);
*/
return result.toString();
}
So whatever name is passed through the multiple outputs, filename will be created according to it.

How to Load Coprocessor Step by Step

can anyone shall explain how to load regionCoprocessor trough shell.i can not getting proper information about loading and deploying Coprocessor.Thanks in Advance
Please follow the steps below:
Step 1: Create an interface and extend org.apache.hadoop.hbase.ipc.CoprocessorProtocol
Step 2: Define the method in the interface you want to execute once the co-processor call is made
Step 3: Create an instance of HTable
Step 4: Call the HTable.coprocessorExec() method with all required parameters
Please find the example below:
In the example, we are trying to get list of students whose registration number falls within some range which we are interested in.
Creating Interface Protocol:
public interface CoprocessorTestProtocol extends org.apache.hadoop.hbase.ipc.CoprocessorProtocol{
List<Student> getStudentList(byte[] startRegistrationNumber, byte[] endRegistrationNumber) throws IOException;
}
Sample Student Class:
public class Student implements Serializable{
byte[] registrationNumber;
String name;
public void setRegistrationNumber(byte[] registrationNumber){
this.registrationNumber = registrationNumber;
}
public byte[] getRegistrationNumber(){
return this.registrationNumber;
}
public void setName(String name){
this.name = name;
}
public int getName(){
return this.name;
}
public String toString(){
return "Student[ registration number = " + Bytes.toInt(this.getRegistrationNumber()) + " name = " + this.getName() + " ]"
}
}
Model Class: [Where the business logic to get data from HBase is written]
public class MyModel extends org.apache.hadoop.hbase.coprocessor.BaseEndpointCoprocessor implements CoprocessorTestProtocol{
#Override
List<Student> getStudentList(byte[] startRegistrationNumber, byte[] endRegistrationNumber){
Scan scan = new Scan();
scan.setStartRow(startRegistrationNumber);
scan.setStopRow(endRegistrationNumber);
InternalScanner scanner = ((RegionCoprocessorEnvironment) getEnvironment()).getRegion().getScanner(scan);
List<KeyValue> currentTempObj = new ArrayList<KeyValue>();
List<Student> studentList = new ArrayList<Student>();
try{
Boolean hasNext = false;
Student student;
do{
currentTempObj.clear();
hasNext = scanner.next(currentTempObj);
if(!currentTempObj.isEmpty()){
student = new Student();
for(KeyValue keyValue: currentTempObj){
bytes[] qualifier = keyValue.getQualifier();
if(Arrays.equals(qualifier, Bytes.toBytes("registrationNumber")))
student.setRegistrationNumber(keyValue.getValue());
else if(Arrays.equals(qualifier, Bytes.toBytes("name")))
student.setName(Bytes.toString(keyValue.getValue()));
}
StudentList.add(student);
}
}while(hasNext);
}catch (Exception e){
// catch the exception the way you want
}
finally{
scanner.close();
}
}
}
Client class: [where the call to co-processor is made]
public class MyClient{
if (args.length < 2) {
System.out.println("Usage : startRegistrationNumber endRegistrationNumber");
return;
}
public List<Student> displayStudentInfo(int startRegistrationNumber, int endRegistrationNumber){
final byte[] startKey=Bytes.toBytes(startRegistrationNumber);
final byte[] endKey=Bytes.toBytes(endRegistrationNumber);
String zkPeers = SystemInfo.getHBaseZkConnectString();
Configuration configuration=HBaseConfiguration.create();
configuration.set(HConstants.ZOOKEEPER_QUORUM, zkPeers);
HTableInterface table = new HTable(configuration, TABLE_NAME);
Map<byte[],List<Student>> allRegionOutput;
allRegionOutput = table.coprocessorExec(CoprocessorTestProtocol.class, startKey,endKey,
new Batch.Call<CoprocessorTestProtocol, List<Student>>() {
public List<Student> call(CoprocessorTestProtocol instance)throws IOException{
return instance.getStudentList(startKey, endKey);
}
});
table.close();
List<Student> anotherList = new ArrayList<Student>();
for (List<Student> studentData: allRegionOutput.values()){
anotherList.addAll(studentData);
}
return anotherList;
}
public static void main(String args){
if (args.length < 2) {
System.out.println("Usage : startRegistrationNumber endRegistrationNumber");
return;
}
int startRegistrationNumber = args[0];
int endRegistrationNumber = args[1];
for (Student student : displayStudentInfo(startRegistrationNumber, endRegistrationNumber)){
System.out.println(student);
}
}
}
Please Note: Please have a special look on Scanner.next(Object) method in the example. This returns boolean and stores the current object in Object argument

Internal references in propertyfiles for Spring MessageSource

I have several webpages with similar forms on them.
One field that exists in several of the pages are email-address.
I want to be able to use a page specific message code, but I would like to be able to reference another message code in order to have a single declaration. In this way, I can change the look of the email-adress label one place and have it changed in all the webpages, but at the same time, I'm able to change the text for a single page with only propertyfile updates.
I'm looking for functionality like this:
message.properties:
label.email=Email address
webpage1.label.email=${label.email}
webpage2.label.email=${label.email}
However,
when using the following jsp-code:
<spring:message code="webpage1.label.email"/>
I get the literal ${label.email} instead of "Email address" in my webpages.
Any hints?
You can replace the DefaultPropertiesPersister with this one:
This will allow you to reference other entries, e.g.:
user=User
user.add=Add ${user}
user.delete=Delete ${user}
Simply specify this persister with your MessageSource, e.g. messageSource.setPropertiesPersister(new RecursivePropertiesPersister());
Source:
public class RecursivePropertiesPersister extends DefaultPropertiesPersister {
private final static Pattern PROP_PATTERN = Pattern.compile("\\$'?\\{'?([^}']*)'?\\}'?");
private final static String CURRENT = "?LOOP?";
#Override
public void load(Properties props, Reader reader) throws IOException {
Properties propsToLoad = new Properties();
super.load(propsToLoad, reader);
replace(propsToLoad, props);
}
#Override
public void load(Properties props, InputStream is) throws IOException {
Properties propsToLoad = new Properties();
super.load(propsToLoad, is);
replace(propsToLoad,props);
}
protected void replace ( Properties src, Properties dest) {
for (Map.Entry entry: src.entrySet()) {
String key = (String) entry.getKey();
String value = (String)entry.getValue();
replace(src, dest, key, value);
}
}
protected String replace(Properties src, Properties dest, String key, String value) {
String replaced = (String) dest.get(key);
if (replaced != null) {
// already replaced (or loop), just return the string
return replaced;
}
dest.put(key,CURRENT); // prevent loops
final Matcher matcher = PROP_PATTERN.matcher(value);
final StringBuffer sb = new StringBuffer();
while (matcher.find()) {
final String subkey = matcher.group(1);
final String replacement = (String)src.get(subkey);
matcher.appendReplacement(sb,replace(src,dest,subkey,replacement));
}
matcher.appendTail(sb);
final String resolved = sb.toString();
dest.put(key, resolved);
return resolved;
}
}

Distributed Cache Hadoop not retrieving the file content

I am getting some garbage like value instead of the data from the file I want to use as distributed cache.
The Job Configuration is as follows:
Configuration config5 = new Configuration();
JobConf conf5 = new JobConf(config5, Job5.class);
conf5.setJobName("Job5");
conf5.setOutputKeyClass(Text.class);
conf5.setOutputValueClass(Text.class);
conf5.setMapperClass(MapThree4c.class);
conf5.setReducerClass(ReduceThree5.class);
conf5.setInputFormat(TextInputFormat.class);
conf5.setOutputFormat(TextOutputFormat.class);
DistributedCache.addCacheFile(new URI("/home/users/mlakshm/ap1228"), conf5);
FileInputFormat.setInputPaths(conf5, new Path(other_args.get(5)));
FileOutputFormat.setOutputPath(conf5, new Path(other_args.get(6)));
JobClient.runJob(conf5);
In the Mapper, I have the following code:
public class MapThree4c extends MapReduceBase implements Mapper<LongWritable, Text,
Text, Text >{
private Set<String> prefixCandidates = new HashSet<String>();
Text a = new Text();
public void configure(JobConf conf5) {
Path[] dates = new Path[0];
try {
dates = DistributedCache.getLocalCacheFiles(conf5);
System.out.println("candidates: "+candidates);
String astr = dates.toString();
a = new Text(astr);
} catch (IOException ioe) {
System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
}
}
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer st = new StringTokenizer(line);
st.nextToken();
String t = st.nextToken();
String uidi = st.nextToken();
String uidj = st.nextToken();
String check = null;
output.collect(new Text(line), a);
}
}
The output value, I am getting from this mapper is:[Lorg.apache.hadoop.fs.Path;#786c1a82
instead of the value from the distributed cache file.
That looks like what you get when you call toString() on an array and if you look at the javadocs for DistributedCache.getLocalCacheFiles(), that is what it returns. If you need to actually read the contents of the files in the cache, you can open/read them with the standard java APIs.
From your code:
Path[] dates = DistributedCache.getLocalCacheFiles(conf5);
Implies that:
String astr = dates.toString(); // is a pointer to the above array (ie.dates) which is what you see in the output as [Lorg.apache.hadoop.fs.Path;#786c1a82.
You need to do the following to see the actual paths:
for(Path cacheFile: dates){
output.collect(new Text(line), new Text(cacheFile.getName()));
}

Resources