Use Spring Integration to merge two files into a single file via SFTP - spring

I have two files in SFTP server which are large in size. I have one file in folder_A/A.txt. The second file is in folder_B/B.txt. I want to append contents of B.txt to A.txt and store them in folder_C/C.txt in SFTP server. One way is to download the files and read the content create new file and then upload the file to SFTP folder_C/C.txt . Is there any efficient way to do this task using SprinBoot without actually downloading the files and do the same over network?

Something like this:
RemoteFileTemplate<LsEntry> template = new RemoteFileTemplate<>(sftpSessionFactory);
template.execute((SessionCallbackWithoutResult<LsEntry>) session -> {
session.append(session.readRaw("folder_A/A.txt"), "folder_C/C.txt");
session.append(session.readRaw("folder_B/B.txt"), "folder_C/C.txt");
});
See more info in docs: https://docs.spring.io/spring-integration/docs/current/reference/html/sftp.html#sftp-rft

Related

create a CSV file in ADLS from databricks

I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

how to determine if a file is completely downloaded using kqueue?

I want to implement a function which monitor a directory and perform some action when a new file is downloaded from the Internet, but found it difficult to determine if the file is completely downloaded, is there a way to do that?
Usually tools that show the hash of a file will give the state of a file - this should be compared to the hash of another file - if identical then we know the file has downloaded successfully.
md5 (native to bsd) is available - but is only practical on a local file -
If you are retrieving the remote file via HTTP , then there is no way to get the hash of the file without downloading it first (whether it is to STDOUT or piped to file , using wget -O- or curl )
If the file host has a second file that contains the md5 hash of the file being downloaded - then a comparison of the locally downloaded hash is comparable to the hash provided by the file provider.
To do anything more swish will require a comprehensive program to be written - such as the combination of this question and accepted answer :
Python Compare local and remote file MD5 Hash
Besides MD5, there is a simple way to do this:
Partially downloaded file usually has a temporary filename, and it will be renamed to original filename after fully downloaded. You can make your program to ignore or monitor only certain filename extensions.

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

Reading files from multiple directories in Logstash?

I read my log files (cron_log, auth_log, mail_log, etc) using this config:
file{
path => '/path/to/log/file/*_log'
}
So I read my log files and check:
if(path) ~= "cron" -----match--------
if(path) ~= "auth" -----match--------
Now I have a directories like: Server1 Server2 Server3......In Server 1 there are subdirectories: authlog cronlog.....Inside authlog there are subdirectories date wise (like 2014.05.26, 2014.05.27) which finally contain log file for the day, which I have to parse.
So presently I was having one config file which use to read files using *_log and I use to run that config file and all log files present in /path/to/log/file/*_log were parsed.
Now I have to read from many directories (as explained above).
Will I have to write separate config file for each directory??
What's the best way to achieve this using logstash??
Ruby globs interpret ** as including all subdirectories.
So, for example, you could give the file input a path such as:
/path/to/date/folders/**/*_log

Resources