I am reading files one by one from remote directory using SFTP. To perform this I am using LS gateway, followed by split, GET gateway and then in the end rename the file.
Sequence of files is very important for me, so the name of the file is having the count. I want that if after streaming the file using GET gateway there if some issue occurs with data or while data processing, I don`t want that the next files in the sequence gets read.
IntegrationFlows.from(() -> path, e -> e.poller(Pollers.fixedDelay(60, TimeUnit.SECONDS)))
.handle(Sftp.outboundGateway(sftpSessionFactory(), LS, "payload")
.regexFileNameFilter(".*csv"))
.split()
.handle(Sftp.outboundGateway(sftpSessionFactory(), GET, "payload.remoteDirectory + payload.filename").options(STREAM).temporaryFileSuffix("_reading"))
.handle(readCsvData(), e -> e.advice(afterReadingCsv()))
.filter(this, "checkSuccess")
.enrichHeaders(h -> h
.headerExpression(FileHeaders.RENAME_TO, "headers[file_remoteDirectory] + 'archive/' + headers[file_remoteFile]")
.headerExpression(FileHeaders.REMOTE_FILE, "headers[file_remoteFile]")
.header(FileHeaders.REMOTE_DIRECTORY, "headers[file_remoteDirectory]"))
.handle(Sftp.outboundGateway(sftpSessionFactory(), MV, "headers[file_remoteDirectory]+headers[file_remoteFile]").renameExpression("headers['file_renameTo']"))
.get();
I even don`t want to rename the file if there is some issue while data processing, I am able to stop that flow but I am not sure how I can stop the flow to read the subsequent files.
Related
I have a simple flow of GetFile -> PutHDFS. The flow works when KeepSourceFile=false.
I want to keep source file, but when I change KeepSourceFile=true the flow doesn't work.
I gave chmod 777 to the input directory, but it didn't help.
any idea what should I do?
thanks
I have two files in SFTP server which are large in size. I have one file in folder_A/A.txt. The second file is in folder_B/B.txt. I want to append contents of B.txt to A.txt and store them in folder_C/C.txt in SFTP server. One way is to download the files and read the content create new file and then upload the file to SFTP folder_C/C.txt . Is there any efficient way to do this task using SprinBoot without actually downloading the files and do the same over network?
Something like this:
RemoteFileTemplate<LsEntry> template = new RemoteFileTemplate<>(sftpSessionFactory);
template.execute((SessionCallbackWithoutResult<LsEntry>) session -> {
session.append(session.readRaw("folder_A/A.txt"), "folder_C/C.txt");
session.append(session.readRaw("folder_B/B.txt"), "folder_C/C.txt");
});
See more info in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/sftp.html#sftp-rft
I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.
I'm using the spring batch + spring integration sftp on my project.
I don't want the downloading triggered on app startup. I want the download process be triggered in step1 and go on step2 AFTER all the files download to local, not sure howt to implement that
You need to use <int-sftp:outbound-gateway> with MGET command:
The message payload resulting from an mget operation is a List<File> object - a List of File objects, each representing a retrieved file.
The remote directory is provided in the file_remoteDirectory header, and the pattern for the filenames is provided in the file_remoteFile header.
https://docs.spring.io/spring-integration/docs/4.3.12.RELEASE/reference/html/sftp.html#sftp-outbound-gateway
In Java DSL it looks like:
.handle(Sftp.outboundGateway(sessionFactory(), AbstractRemoteFileOutboundGateway.Command.MGET,
"payload")
.options(AbstractRemoteFileOutboundGateway.Option.RECURSIVE)
.regexFileNameFilter("(subSftpSource|.*1.txt)")
.localDirectoryExpression("'" + getTargetLocalDirectoryName() + "' + #remoteDirectory")
.localFilenameExpression("#remoteFileName.replaceFirst('sftpSource', 'localTarget')"))
where the payload is a SpEL expression for remote directory evaluation. In this case it is just really a payload of request message:
String dir = "sftpSource/";
registration.getInputChannel().send(new GenericMessage<>(dir + "*"));
If your remote directory is static and isn't changed from the Batch, you can use it as LiteralExpression - expresion="'myRemoteDir'" in the XML definition.
Since the result of this MGET command is a List<File> you should consider to use Splitter as the next step.
I have a flow, the first processor is GetFile which reads from a source dir and runs every [x] secs or minutes.
If I would copy a file in the source dir and GetFile starts to read the file at that moment in time, would I get partial data over the wire ?
Yes that can happen. A common pattern is to copy the file into the source dir with a dot at the front such that it gets excluded from the GetFile at first, then once the file is complete it can be renamed and then GetFile would pick up the entire thing.