Can I use SpringBatch to transfer file(video/txt) present of http and read it and write it on my pc? - spring-boot

Right now i am trying to write an http file(sample:"https://www.w3.org/TR/PNG/iso_8859-1.txt") to my local pc using springBatch. But unable to read the file.
My Reader file is as follows:
public FlatFileItemReader<String> reader(){
FlatFileItemReader<String> reader = new FlatFileItemReader<String>();
reader.setResource(new UrlResource("https://www.w3.org/TR/PNG/iso_8859-1.txt"));
reader.setRecordSeparatorPolicy(new RecordSeparatorPolicy() {
#Override
public boolean isEndOfRecord(String s) {
if (s.length() == 100)
return true;
return false;
}
#Override
public String postProcess(String s) {
return s;
}
#Override
public String preProcess(String s) {
return s;
}
});
return reader;
}
I am reading 100 length chunks from the input and passing it to writer. But i am getting this error:
org.springframework.batch.item.file.FlatFileParseException: Unexpected end of file before record complete
at org.springframework.batch.item.file.FlatFileItemReader.applyRecordSeparatorPolicy(FlatFileItemReader.java:303) ~[spring-batch-infrastructure-4.2.4.RELEASE.jar:4.2.4.RELEASE]
at org.springframework.batch.item.file.FlatFileItemReader.readLine(FlatFileItemReader.java:220) ~[spring-batch-infrastructure-4.2.4.RELEASE.jar:4.2.4.RELEASE]
at org.springframework.batch.item.file.FlatFileItemReader.doRead(FlatFileItemReader.java:178) ~[spring-batch-infrastructure-4.2.4.RELEASE.jar:4.2.4.RELEASE]
...
How to read complete .txt file from http and process and write it to desire location?
Also what if instead of txt file, I will pass video file to reader(What should be FlatFileItemReader<???> type in this case?

You don't need that custom record separator policy, you can use a PassThroughLineMapper instead.
Also what if instead of txt file, I will pass video file to reader(What should be FlatFileItemReader type in this case?
You can use byte[] as item type or use the SimpleBinaryBufferedReaderFactory.

Related

Spring boot SFTP, dynamic directory in SFTP

I tried to upload files to dynamic directory to SFTP. When I uploaded some files, the first file always uploaded to the last directory. Then after that rest file will be uploaded to the correct directory. When I did debug mode, I saw that every first file would be uploaded to temporaryDirectory which is the code already set up by spring. I don't know how to set the value of this temporaryDirectory to the right value. Please, help me to solve the problem.
Or maybe you guys have other way to upload and create proper dynamic directory. Please let me know.
Here is the code:
private String sftpRemoteDirectory = "documents/"
#MessagingGateway
public interface UploadGateway {
#Gateway(requestChannel = "toSftpChannel")
void upload(File file, #Header("dirName") String dirName);
}
#Bean
#ServiceActivator(inputChannel = "toSftpChannel")
public MessageHandler handler() {
SftpMessageHandler handler = new SftpMessageHandler(sftpSessionFactory());
SimpleDateFormat formatter = new SimpleDateFormat("yyMMdd");
String newDynamicDirectory = "E" + formatter.format(new Date())+String.format("%04d",Integer.parseInt("0001") + 1);
handler.setRemoteDirectoryExpression(new LiteralExpression(sftpRemoteDirectory + newDynamicDirectory));
handler.setFileNameGenerator(message -> {
String dirName = (String) message.getHeaders().get("dirName");
handler.setRemoteDirectoryExpression(new LiteralExpression(sftpRemoteDirectory + dirName));
handler.setAutoCreateDirectory(true);
if (message.getPayload() instanceof File) {
return (((File) message.getPayload()).getName());
} else {
throw new IllegalArgumentException("File expected as payload!");
}
});
return handler;
}
You are using a LiteralExpression, evaluated just once, you need an expression that's evaluated at runtime.
handler.setRemoteDirectoryExpressionString("'" + sftpRemoteDirectory/ + "'" + headers['dirName']);

spring cloud stream file source app - History of Processed files and polling files under sub directory

I'm building a data pipeline with Spring Cloud Stream File Source app at the start of the pipeline. I need some help with working around some missing features
My File source app (based on org.springframework.cloud.stream.app:spring-cloud-starter-stream-source-file) works perfectly well excepting missing features that I need help with. I need
To delete files after polled and messaged
Poll into the subdirectories
With respect to item 1, I read that the delete feature doesn't exist in the file source app (it is available on sftp source). Every time the app is restarted, the files that were processed in the past will be re-picked, can the history of files processed made permanent? Is there an easy alternative?
To support those requirements you definitely need to modify the code of the mentioned File Source project: https://docs.spring.io/spring-cloud-stream-app-starters/docs/Einstein.BUILD-SNAPSHOT/reference/htmlsingle/#_patching_pre_built_applications
I would suggest to fork the project and poll it from GitHub as is, since you are going to modify existing code of the project. Then you follow instruction in the mentioned doc how to build the target binder-specific artifact which will be compatible with SCDF environment.
Now about the questions:
To poll sub-directories for the same file pattern, you need to configure a RecursiveDirectoryScanner on the Files.inboundAdapter():
/**
* Specify a custom scanner.
* #param scanner the scanner.
* #return the spec.
* #see FileReadingMessageSource#setScanner(DirectoryScanner)
*/
public FileInboundChannelAdapterSpec scanner(DirectoryScanner scanner) {
Note that all the filters must be configured on this DirectoryScanner instead.
There is going to be a warning otherwise:
// Check that the filter and locker options are _NOT_ set if an external scanner has been set.
// The external scanner is responsible for the filter and locker options in that case.
Assert.state(!(this.scannerExplicitlySet && (this.filter != null || this.locker != null)),
() -> "When using an external scanner the 'filter' and 'locker' options should not be used. " +
"Instead, set these options on the external DirectoryScanner: " + this.scanner);
To keep track of the files, it is better to consider to have a FileSystemPersistentAcceptOnceFileListFilter based on the external persistence store for the ConcurrentMetadataStore implementation: https://docs.spring.io/spring-integration/reference/html/#metadata-store. This must be used instead of that preventDuplicates(), because FileSystemPersistentAcceptOnceFileListFilter ensure only once logic for us as well.
Deleting file after sending might not be a case, since you may just send File as is and it is has to be available on the other side.
Also, you can add a ChannelInterceptor into the source.output() and implement its postSend() to perform ((File) message.getPayload()).delete(), which is going to happen when the message has been successfully sent to the binder destination.
#EnableBinding(Source.class)
#Import(TriggerConfiguration.class)
#EnableConfigurationProperties({FileSourceProperties.class, FileConsumerProperties.class,
TriggerPropertiesMaxMessagesDefaultUnlimited.class})
public class FileSourceConfiguration {
#Autowired
#Qualifier("defaultPoller")
PollerMetadata defaultPoller;
#Autowired
Source source;
#Autowired
private FileSourceProperties properties;
#Autowired
private FileConsumerProperties fileConsumerProperties;
private Boolean alwaysAcceptDirectories = false;
private Boolean deletePostSend;
private Boolean movePostSend;
private String movePostSendSuffix;
#Bean
public IntegrationFlow fileSourceFlow() {
FileInboundChannelAdapterSpec messageSourceSpec = Files.inboundAdapter(new File(this.properties.getDirectory()));
RecursiveDirectoryScanner recursiveDirectoryScanner = new RecursiveDirectoryScanner();
messageSourceSpec.scanner(recursiveDirectoryScanner);
FileVisitOption[] fileVisitOption = new FileVisitOption[1];
recursiveDirectoryScanner.setFilter(initializeFileListFilter());
initializePostSendAction();
IntegrationFlowBuilder flowBuilder = IntegrationFlows
.from(messageSourceSpec,
new Consumer<SourcePollingChannelAdapterSpec>() {
#Override
public void accept(SourcePollingChannelAdapterSpec sourcePollingChannelAdapterSpec) {
sourcePollingChannelAdapterSpec
.poller(defaultPoller);
}
});
ChannelInterceptor channelInterceptor = new ChannelInterceptor() {
#Override
public void postSend(Message<?> message, MessageChannel channel, boolean sent) {
if (sent) {
File fileOriginalFile = (File) message.getHeaders().get("file_originalFile");
if (fileOriginalFile != null) {
if (movePostSend) {
fileOriginalFile.renameTo(new File(fileOriginalFile + movePostSendSuffix));
} else if (deletePostSend) {
fileOriginalFile.delete();
}
}
}
}
//Override more interceptor methods to capture some logs here
};
MessageChannel messageChannel = source.output();
((DirectChannel) messageChannel).addInterceptor(channelInterceptor);
return FileUtils.enhanceFlowForReadingMode(flowBuilder, this.fileConsumerProperties)
.channel(messageChannel)
.get();
}
private void initializePostSendAction() {
deletePostSend = this.properties.isDeletePostSend();
movePostSend = this.properties.isMovePostSend();
movePostSendSuffix = this.properties.getMovePostSendSuffix();
if (deletePostSend && movePostSend) {
String errorMessage = "The 'delete-file-post-send' and 'move-file-post-send' attributes are mutually exclusive";
throw new IllegalArgumentException(errorMessage);
}
if (movePostSend && (movePostSendSuffix == null || movePostSendSuffix.trim().length() == 0)) {
String errorMessage = "The 'move-post-send-suffix' is required when 'move-file-post-send' is set to true.";
throw new IllegalArgumentException(errorMessage);
}
//Add additional validation to ensure the user didn't configure a file move that will result in cyclic processing of file
}
private FileListFilter<File> initializeFileListFilter() {
final List<FileListFilter<File>> filtersNeeded = new ArrayList<FileListFilter<File>>();
if (this.properties.getFilenamePattern() != null && this.properties.getFilenameRegex() != null) {
String errorMessage = "The 'filename-pattern' and 'filename-regex' attributes are mutually exclusive.";
throw new IllegalArgumentException(errorMessage);
}
if (StringUtils.hasText(this.properties.getFilenamePattern())) {
SimplePatternFileListFilter patternFilter = new SimplePatternFileListFilter(this.properties.getFilenamePattern());
if (this.alwaysAcceptDirectories != null) {
patternFilter.setAlwaysAcceptDirectories(this.alwaysAcceptDirectories);
}
filtersNeeded.add(patternFilter);
} else if (this.properties.getFilenameRegex() != null) {
RegexPatternFileListFilter regexFilter = new RegexPatternFileListFilter(this.properties.getFilenameRegex());
if (this.alwaysAcceptDirectories != null) {
regexFilter.setAlwaysAcceptDirectories(this.alwaysAcceptDirectories);
}
filtersNeeded.add(regexFilter);
}
FileListFilter<File> createdFilter = null;
if (!Boolean.FALSE.equals(this.properties.isIgnoreHiddenFiles())) {
filtersNeeded.add(new IgnoreHiddenFileListFilter());
}
if (Boolean.TRUE.equals(this.properties.isPreventDuplicates())) {
filtersNeeded.add(new AcceptOnceFileListFilter<File>());
}
if (filtersNeeded.size() == 1) {
createdFilter = filtersNeeded.get(0);
} else {
createdFilter = new CompositeFileListFilter<File>(filtersNeeded);
}
return createdFilter;
}
}

Read in big csv file, validate and write out using uniVocity parser

I need to parse a big csv file (2gb). The values have to be validated, the rows containing "bad" fields must be dropped and a new file containing only valid rows ought to be output.
I've selected uniVocity parser library to do that. Please help me to understand whether this library is well-suited for the task and what approach should be used.
Given the file size, what is the best way to organize read->validate->write in uniVocity ? Read in all rows at once or use iterator style ? Where parsed and validated rows should be stored before they are written to file ?
Is there a way in Univocity to access row's values by index ? Something like row.getValue(3) ?
I'm the author of this library, let me try to help you out:
First, do not try to read all rows at once as you will fill your memory with LOTS of data.
You can get the row values by index.
The faster approach to read/validate/write would be by using a RowProcessor that has a CsvWriter and decides when to write or skip a row. I think the following code will help you a bit:
Define the output:
private CsvWriter createCsvWriter(File output, String encoding){
CsvWriterSettings settings = new CsvWriterSettings();
//configure the writer ...
try {
return new CsvWriter(new OutputStreamWriter(new FileOutputStream(output), encoding), settings);
} catch (IOException e) {
throw new IllegalArgumentException("Error writing to " + output.getAbsolutePath(), e);
}
}
Redirect the input
//this creates a row processor for our parser. It validates each row and sends them to the csv writer.
private RowProcessor createRowProcessor(File output, String encoding){
final CsvWriter writer = createCsvWriter(output, encoding);
return new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
if (shouldWriteRow(row)) {
writer.writeRow(row);
} else {
//skip row
}
}
private boolean shouldWriteRow(String[] row) {
//your validation here
return true;
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
};
}
Configure the parser:
public void readAndWrite(File input, File output, String encoding) {
CsvParserSettings settings = new CsvParserSettings();
//configure the parser here
//tells the parser to send each row to them custom processor, which will validate and redirect all rows to the CsvWriter
settings.setRowProcessor(createRowProcessor(output, encoding));
CsvParser parser = new CsvParser(settings);
try {
parser.parse(new InputStreamReader(new FileInputStream(input), encoding));
} catch (IOException e) {
throw new IllegalStateException("Unable to open input file " + input.getAbsolutePath(), e);
}
}
For better performance you can also wrap the row processor in a ConcurrentRowProcessor.
settings.setRowProcessor(new ConcurrentRowProcessor(createRowProcessor(output, encoding)));
With this, the writing of rows will be performed in a separate thread.

Loading Files in UDF

I have a requirement of populating a field based on the evaluation of a UDF. The input to the UDF would be some other fields in the input and as well as an csv sheet. Presently, the approach I have taken is to load the CSV file, group it ALL and then pass it as a bag to the UDF along with other required parameters. However, its taking a very long time to complete the process (roughly about 3 hours) for source data of 170k records and as well as csv records of about 150k.
I'm sure there must be much better efficient way to handle this and hence need your inputs.
source_alias = LOAD 'src.csv' USING
PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray);
csv_alias = LOAD 'csv_file.csv' USING
PigStorage(',') AS (c1:chararray,c2:chararray,c3:chararray);
grpd_csv_alias = GROUP csv_alias ALL;
final_alias = FOREACH source_alias GENERATE f1 AS f1,
myUDF(grpd_csv_alias, f2) AS derived_f2;
Here is my UDF on a high level.
public class myUDF extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String f2Response = "N";
DataBag csvAliasBag = (DataBag)input.get(0);
String f2 = (String) input.get(1);
try {
Iterator<Tuple> bagIterator = csvAliasBag.iterator();
while (bagIterator.hasNext()) {
Tuple localTuple = (Tuple)bagIterator.next();
String col1 = ((String)localTuple.get(1)).trim().toLowerCase();
String col2 = ((String)localTuple.get(2)).trim().toLowerCase();
String col3 = ((String)localTuple.get(3)).trim().toLowerCase();
String col4 = ((String)localTuple.get(4)).trim().toLowerCase();
<Custom logic to populate f2Response based on the value in f2 and as well as col1, col2, col3 and col4>
}
}
return f2Response;
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
I believe the process is taking too long because of building and passing csv_alias to the UDF for each row in the source file.
Is there any better way to handle this?
Thanks
For small files, you can put them on the distributed cache. This copies the file to each task node as a local file then you load it yourself. Here's an example from the Pig docs UDF section. I would not recommend parsing the file each time, however. Store your results in a class variable and check to see if it's been initialized. If the csv is on the local file system, use getShipFiles. If the csv you're using is on HDFS, used the getCachedFiles method. Notice that for HDFS there's a file path followed by a # and some text. To the left of the # is the HDFS path and to the right is the name you want it to be called when it's copied to the local file system.
public class Udfcachetest extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String concatResult = "";
FileReader fr = new FileReader("./smallfile1");
BufferedReader d = new BufferedReader(fr);
concatResult +=d.readLine();
fr = new FileReader("./smallfile2");
d = new BufferedReader(fr);
concatResult +=d.readLine();
return concatResult;
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/user/pig/tests/data/small#smallfile1"); // This is hdfs file
return list;
}
public List<String> getShipFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/home/hadoop/pig/smallfile2"); // This local file
return list;
}
}

Spring FlatFileReader Jagged CSV file

I'm reading data via spring batch and I'm going to dump it into a database table.
My csv file of musical facts is formatted like this:
question; valid answer; potentially another valid answer; unlikely, but another;
Where all rows have a question and at least one valid answer, but there can be more. The simple way to hold this data is to in the data in a pojo is with one field for a String and another for a List<String>.
Below is a simple line mapper to read a CSV file, but I don't know how to make the necessary changes to accommodate a jagged CSV file in this manner.
#Bean
public LineMapper<MusicalFactoid> musicalFactoidLineMapper() {
DefaultLineMapper<MusicalFactoid> musicalFactoidDefaultLineMapper = new DefaultLineMapper<>();
musicalFactoidDefaultLineMapper.setLineTokenizer(new DelimitedLineTokenizer() {{
setDelimiter(";");
setNames(new String[]{"question", "answer"}); // <- this will not work!
}});
musicalFactoidDefaultLineMapper.setFieldSetMapper(new BeanWrapperFieldSetMapper<MusicalFactoid>() {{
setTargetType(MusicalFactoid.class);
}});
return musicalFactoidDefaultLineMapper;
}
What do I need to do?
Write your own Line Mapper. As far as I see, you don't have any complex logic.
Something like this:
public MyLineMapper implements LineMapper<MusicalFactoid> {
public MusicalFactoid mapLine(String line, int lineNumber) {
MusicalFactoid dto = new MusicalFactoid();
String[] splitted = line.split(";");
dto.setQuestion(splitted[0]);
for (int idx = 1; idx < splitted.length; idx++) {
dto.addAnswer(splitted[idx]);
}
return dto;
}
}

Resources