[Fluentd]How to Unzip files in fluentd - ruby

I am trying to process log files with .gz extension in fluentd using cat_sweep plugin, and failed in my attempt. As shown in the below config, I am trying to process all files under /opt/logfiles/* location. However when the file format is .gz, cat_sweep is unable to process the file, and starts deleting the file, but if I unzip the file manually inside the /opt/logfiles/ location, cat_sweep is able to process, the file.
<source>
#type cat_sweep
file_path_with_glob /opt/logfiles/*
format none
tag raw.log
waiting_seconds 0
remove_after_processing true
processing_file_suffix .processing
error_file_suffix .error
run_interval 5
</source>
So now I need some plugin that can unzip a given file. I tried searching for plugins that can unzip a zipped file. I came close when I found about the plugin, which acts like a terminal, where I can use something like gzip -d file_path
Link to the plugin:
http://docs.fluentd.org/v0.12/articles/in_exec
But the problem I see here, is that I cannot send the path of the file to be unzipped at run-time.
Can someone help me with some pointers?

Looking at your requirement, you can still achieve it by using in_exec module,
What you have to do is, to simply create a shell script which accepts path to look for .gz files and the wildcard pattern to match file names. And inside the shell script you can unzip files inside the folder_path that was passed with the given wildcard pattern. Basically your shell execution should look like:
sh unzip.sh <folder_path_to_monitor> <wildcard_to_files>
And use the above command in in_exec tag in your config. And your config will look like:
<source>
#type exec
format json
tag unzip.sh
command sh unzip.sh <folder_path_to_monitor> <wildcard_to_files>
run_interval 10s
</source>

Related

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

dockerfile copy list of files, when list is taken from a local file

I've got a file containing a list of paths that I need to copy by Dockerfile's COPY command on docker build.
My use case is such: I've got a python requirements.txt file, when inside I'm calling multiple other requirements files inside the project, with -r PATH.
Now, I want to docker COPY all the requirements files alone, run pip install, and then copy the rest of the project (for cache and such). So far i haven't managed to do so with docker COPY command.
No need of help on fetching the paths from the file - I've managed that - just if it is possible to be done - how?
thanks!
Not possible in the sense that the COPY directive allows it out of the box, however if you know the extensions you can use a wildcard for the path such as COPY folder*something*name somewhere/.
For simple requirements.txt fetching that could be:
# but you need to distinguish it somehow
# otherwise it'll overwrite the files and keep the last one
# e.g. rename package/requirements.txt to package-requirements.txt
# and it won't be an issue
COPY */requirements.txt ./
RUN for item in $(ls requirement*);do pip install -r $item;done
But if it gets a bit more complex (as in collecting only specific files, by some custom pattern etc), then, no. However for that case simply use templating either by a simple F-string, format() function or switch to Jinja, create a Dockerfile.tmpl (or whatever you'd want to name a temporary file), then collect the paths, insert into the templated Dockerfile and once ready dump to a file and execute afterwards with docker build.
Example:
# Dockerfile.tmpl
FROM alpine
{{replace}}
# organize files into coherent structures so you don't have too many COPY directives
files = {
"pattern1": [...],
"pattern2": [...],
...
}
with open("Dockerfile.tmpl", "r") as file:
text = file.read()
insert = "\n".join([
f"COPY {' '.join(values)} destination/{key}/"
for key, values in files.items()
])
with open("Dockerfile", "w") as file:
file.write(text.replace("{{replace}}", insert))
You might want to do this for example:
FROM ...
ARG files
COPY files
and run with
docker build -build-args items=`${cat list_of_files_to_copy.txt}`

wanted to use results of find command in custom script that i am building

I want to validate my XML's for well-formed ness, but some of my files are not having a single root (which is fine as per business req eg. <ri>...</ri><ri>..</ri> is valid xml in my context) , xmlwf can do this, but it flags out a file if it's not having single root, So wanted to build a custom script which internally uses xmlwf, my custom script should do below,
iterate through list of files passed as input (eg. sample.xml or s*.xml or *.xml)
for each file prepare a temporary file as <A>+contents of file+</A>
and call xmlwf on that temp file,
Can some one help on this?
You could add text to the beginning and end of the file using cat and bash, so that your file has a root added to it for validation purposes.
cat <(echo '<root>') sample.xml <(echo '</root>') | xmlwf
This way you don't need to write temporary files out.

Fail to Download File Using Jmeter

http://blazemeter.com/blog/how-performance-test-upload-and-download-scenarios-apache-jmeter
I've used above link for reference.
I'm Using Save Responses to a file with Download call.
But fail to Download file.
Can anyone tell me What Exactly i need Specify in
FileName Prefix:
Variable name :
and where to Specify download location ?
Provide absolute file path with a file name.
FileName Prefix: c:\workspace\jmeter-download\myfile
For ex: If your test is downloading a pdf file, you would see a 'myfile.pdf' under 'c:\workspace\jmeter-download'
I've used path like D:/xyzFolder/filename as Filename Prifix
it's working fine for me
and if any one wants file to download in same Dir in which your jmx file is
then just need to use ~/pre as Filename Prifix

Reading files from multiple directories in Logstash?

I read my log files (cron_log, auth_log, mail_log, etc) using this config:
file{
path => '/path/to/log/file/*_log'
}
So I read my log files and check:
if(path) ~= "cron" -----match--------
if(path) ~= "auth" -----match--------
Now I have a directories like: Server1 Server2 Server3......In Server 1 there are subdirectories: authlog cronlog.....Inside authlog there are subdirectories date wise (like 2014.05.26, 2014.05.27) which finally contain log file for the day, which I have to parse.
So presently I was having one config file which use to read files using *_log and I use to run that config file and all log files present in /path/to/log/file/*_log were parsed.
Now I have to read from many directories (as explained above).
Will I have to write separate config file for each directory??
What's the best way to achieve this using logstash??
Ruby globs interpret ** as including all subdirectories.
So, for example, you could give the file input a path such as:
/path/to/date/folders/**/*_log

Resources