Extracting words from a log file - bash

I am trying to extract job id's from a log file, and I'm having trouble extracting them in bash. I've tried using sed.
This is how my log file looks like:
> 2018-06-16 02:39:39,331 INFO org.apache.flink.client.cli.CliFrontend
> - Running 'list' command.
> 2018-06-16 02:39:39,641 INFO org.apache.flink.runtime.rest.RestClient
> - Rest client endpoint started.
> 2018-06-16 02:39:39,741 INFO org.apache.flink.client.cli.CliFrontend
> - Waiting for response...
> Waiting for response...
> 2018-06-16 02:39:39,953 INFO org.apache.flink.client.cli.CliFrontend
> - Successfully retrieved list of jobs
> ------------------ Running/Restarting Jobs -------------------
> 15.06.2018 18:49:44 : 1280dfd7b1de4c74cacf9515f371844b : jETTY HTTP Server -> servlet with content decompress -> pull from
> collections -> CSV to Avro encode -> Kafka publish (RUNNING)
> 16.06.2018 02:37:07 : aa7a691fa6c3f1ad619b6c0c4425ba1e : jETTY HTTP Server -> servlet with content decompress -> pull from
> collections -> CSV to Avro encode -> Kafka publish (RUNNING)
> --------------------------------------------------------------
> 2018-06-16 02:39:39,956 INFO org.apache.flink.runtime.rest.RestClient
> - Shutting down rest endpoint.
> 2018-06-16 02:39:39,957 INFO org.apache.flink.runtime.rest.RestClient
> - Rest endpoint shutdown complete.
I am using the following code to extract the lines containing the jobId:
extractRestResponse=`cat logFile.txt`
echo "extractRestResponse: "$extractRestResponse
w1="------------------ Running/Restarting Jobs -------------------"
w2="--------------------------------------------------------------"
extractRunningJobs="sed -e 's/.*'"$w1"'\(.*\)'"$w2"'.*/\1/' <<< $extractRestResponse"
runningJobs=`eval $extractRunningJobs`
echo "running jobs :"$runningJobs
However this doesn't give me any result. Also I notice that all newlines are lost when I print the extractRestResponse variable.
I also tried using this command but it doesn't give me any result:
extractRestResponse="sed -n '/"$w1"/,/"$w2"/{//!p}' logFile.txt"

With sed:
sed -n '/^-* Running\/Restarting Jobs -*/,/^--*/{//!p;}' logFile.txt
Explanations:
Input lines are echoed by default to the standard output after commands are applied. The -n flag suppresses this behavior
/^-* Running\/Restarting Jobs -*/,/^--*/: matches lines starting from ^-* Running\/Restarting Jobs -* up to ^--*(inclusively)
//!p;: print lines except those matching the addresses

awk to the rescue!
awk '/^-+$/{f=0} f; /^-+ Running\/Restarting Jobs -+$/{f=1}' logfile

You could improve your original substitution:
sed -e 's/.*'"$w1"'\(.*\)'"$w2"'.*/\1/' <<< $extractRestResponse
by using # as the delimiter:
sed -n "s#.*$w1\(.*\)$w2.*#\1#p" <<< $extractRestResponse
The output is the text between $w1 and $w2:
> 15.06.2018 18:49:44 : 1280dfd7b1de4c74cacf9515f371844b : jETTY HTTP Server -> servlet with content decompress -> pull from > collections -> CSV to Avro encode -> Kafka publish (RUNNING) > 16.06.2018 02:37:07 : aa7a691fa6c3f1ad619b6c0c4425ba1e : jETTY HTTP Server -> servlet with content decompress -> pull from > collections -> CSV to Avro encode -> Kafka publish (RUNNING) >

Related

baseDir issue with nextflow

This might be a very basic question for you guys, however, I am have just started with nextflow and I struggling with the simplest example.
I first explain what I have done and the problem.
Aim: I aim to make a workflow for my bioinformatics analyses as the one here (https://www.nextflow.io/example4.html)
Background: I have installed all the packages that were needed and they all work from the console without any error.
My run: I have used the same script as in example only by replacing the directory names. Here is how I have arranged the directories
location of script
~/raman/nflow/script.nf
location of Fastq files
~/raman/nflow/Data/T4_1.fq.gz
~/raman/nflow/Data/T4_2.fq.gz
Location of transcriptomic file
~/raman/nflow/Genome/trans.fa
The script
#!/usr/bin/env nextflow
/*
* The following pipeline parameters specify the refence genomes
* and read pairs and can be provided as command line options
*/
params.reads = "$baseDir/Data/T4_{1,2}.fq.gz"
params.transcriptome = "$baseDir/HumanGenome/SalmonIndex/gencode.v42.transcripts.fa"
params.outdir = "results"
workflow {
read_pairs_ch = channel.fromFilePairs( params.reads, checkIfExists: true )
INDEX(params.transcriptome)
FASTQC(read_pairs_ch)
QUANT(INDEX.out, read_pairs_ch)
}
process INDEX {
tag "$transcriptome.simpleName"
input:
path transcriptome
output:
path 'index'
script:
"""
salmon index --threads $task.cpus -t $transcriptome -i index
"""
}
process FASTQC {
tag "FASTQC on $sample_id"
publishDir params.outdir
input:
tuple val(sample_id), path(reads)
output:
path "fastqc_${sample_id}_logs"
script:
"""
fastqc "$sample_id" "$reads"
"""
}
process QUANT {
tag "$pair_id"
publishDir params.outdir
input:
path index
tuple val(pair_id), path(reads)
output:
path pair_id
script:
"""
salmon quant --threads $task.cpus --libType=U -i $index -1 ${reads[0]} -2 ${reads[1]} -o $pair_id
"""
}
Output:
(base) ntr#ser:~/raman/nflow$ nextflow script.nf
N E X T F L O W ~ version 22.10.1
Launching `script.nf` [modest_meninsky] DSL2 - revision: 032a643b56
executor > local (2)
executor > local (2)
[- ] process > INDEX (gencode) -
[28/02cde5] process > FASTQC (FASTQC on T4) [100%] 1 of 1, failed: 1 ✘
[- ] process > QUANT -
Error executing process > 'FASTQC (FASTQC on T4)'
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Command executed:
fastqc "T4" "T4_1.fq.gz T4_2.fq.gz"
Command exit status:
0
Command output:
(empty)
Command error:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
Work dir:
/home/ruby/raman/nflow/work/28/02cde5184f4accf9a05bc2ded29c50
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
I believe I have an issue with my baseDir understanding. I am assuming that the baseDir is the one where I have my file script.nf I am not sure what is going wrong and how can I fix it.
Could anyone please help or guide.
Thank you
Caused by:
Missing output file(s) `fastqc_T4_logs` expected by process `FASTQC (FASTQC on T4)`
Nextflow complains when it can't find the declared output files. This can occur even if the command completes successfully, i.e. with exit status 0. The problem here is that fastqc simply skips files that don't exist or can't be read (e.g. permissions problems), but it does produce these warnings:
Skipping 'T4' which didn't exist, or couldn't be read
Skipping 'T4_1.fq.gz T4_2.fq.gz' which didn't exist, or couldn't be read
The solution is to just make sure all files exist. Note that the fromFilePairs factory method produces a list of files in the second element. Therefore quoting a space-separated pair of filenames is also problematic. All you need is:
script:
"""
fastqc ${reads}
"""

Nextflow: Missing output file(s) expected by process

I'm currently making a start on using Nextflow to develop a bioinformatics pipeline. Below, I've created a params.files variable which contains my FASTQ files, and then input this into fasta_files channel.
The process trimming and its scripts takes this channel as the input, and then ideally, I would output all the $sample".trimmed.fq.gz into the output channel, trimmed_channel. However, when I run this script, I get the following error:
Missing output file(s) `trimmed_files` expected by process `trimming` (1)
The nextflow script I'm trying to run is:
#! /usr/bin/env nextflow
params.files = files("$baseDir/FASTQ/*.fastq.gz")
println "fastq files for trimming:$params.files"
fasta_files = Channel.fromPath(params.files)
println "files in the fasta channel: $fasta_files"
process trimming {
input:
file fasta_file from fasta_files
output:
path trimmed_files into trimmed_channel
// the shell script to be run:
"""
#!/usr/bin/env bash
mkdir trimming_report
cd /home/usr/Nextflow
#Finding and renaming my FASTQ files
for file in FASTQ/*.fastq.gz; do
[ -f "\$file" ] || continue
name=\$(echo "\$file" | awk -F'[/]' '{ print \$2 }') #renaming fastq files.
sample=\$(echo "\$name" | awk -F'[.]' '{ print \$1 }') #renaming fastq files.
echo "Found" "\$name" "from:" "\$sample"
if [ ! -e FASTQ/"\$sample"_trimmed.fq.gz ]; then
trim_galore -j 8 "\$file" -o FASTQ #trim the files
mv "\$file"_trimming_report.txt trimming_report #moves to the directory trimming report
else
echo ""\$sample".trimmed.fq.gz exists skipping trim galore"
fi
done
trimmed_files="FASTQ/*_trimmed.fq.gz"
echo \$trimmed_files
"""
}
The script in the process works fine. However, I'm wondering if I'm misunderstanding or missing something obvious. If I've forgot to include something, please let me know and any help is appreciated!
Nextflow does not export the variable trimmed_files to its own scope unless you tell it to do so using the env output qualifier, however doing it that way would not be very idiomatic.
Since you know the pattern of your output files ("FASTQ/*_trimmed.fq.gz"), simply pass that pattern as output:
path "FASTQ/*_trimmed.fq.gz" into trimmed_channel
Some things you do, but probably want to avoid:
Changing directory inside your NF process, don't do this, it entirely breaks the whole concept of nextflow's /work folder setup.
Write a bash loop inside a NF process, if you set up your channels correctly there should only be 1 task per spawned process.
Pallie has already provided some sound advice and, of course, the right answer, which is: environment variables must be declared using the env qualifier.
However, given your script definition, I think there might be some misunderstanding about how best to skip the execution of previously generated results. The cache directive is enabled by default and when the pipeline is launched with the -resume option, additional attempts to execute a process using the same set of inputs, will cause the process execution to be skipped and will produce the stored data as the actual results.
This example uses the Nextflow DSL 2 for my convenience, but is not strictly required:
nextflow.enable.dsl=2
params.fastq_files = "${baseDir}/FASTQ/*.fastq.gz"
params.publish_dir = "./results"
process trim_galore {
tag { "${sample}:${fastq_file}" }
publishDir "${params.publish_dir}/TrimGalore", saveAs: { fn ->
fn.endsWith('.txt') ? "trimming_reports/${fn}" : fn
}
cpus 8
input:
tuple val(sample), path(fastq_file)
output:
tuple val(sample), path('*_trimmed.fq.gz'), emit: trimmed_fastq_files
path "${fastq_file}_trimming_report.txt", emit: trimming_report
"""
trim_galore \\
-j ${task.cpus} \\
"${fastq_file}"
"""
}
workflow {
Channel.fromPath( params.fastq_files )
| map { tuple( it.getSimpleName(), it ) }
| set { sample_fastq_files }
results = trim_galore( sample_fastq_files )
results.trimmed_fastq_files.view()
}
Run using:
nextflow run script.nf \
-ansi-log false \
--fastq_files '/home/usr/Nextflow/FASTQ/*.fastq.gz'

Shell 'echo' error with ENV VAR been parsed from 'redis-cli INFO' output

I try to monitor Redis health with Zabbix agent and want to parse output for redis-cli INFO command:
$ redis-cli INFO
# Server
redis_version:4.0.9
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:9435c3c2879311f3
...
with shell script and compose JSON file with subset of info values:
{
"redis_version": "4.0.9",
"used_memory": 100,
"used_memory_rss": 200
}
But I have an error when I try to echo parsed values to console. This is my demo script:
#!/bin/bash
FILE_ECHO="/tmp/test_out.txt"
FILE_REDIS="/tmp/test_redis.txt"
echo "redis_version:4.0.9" >${FILE_ECHO}
/usr/bin/redis-cli INFO >${FILE_REDIS}
ECHO_VERSION=$(grep "redis_version" "${FILE_ECHO}")
REDIS_VERSION=$(grep "redis_version" "${FILE_REDIS}")
echo "{\"node\": \"${ECHO_VERSION}\"}"
echo "{\"node\": \"${REDIS_VERSION}\"}"
I expect that output will be:
{"node": "redis_version:4.0.9"}
{"node": "redis_version:4.0.9"}
in both cases, but actually I have:
{"node": "redis_version:4.0.9"}
"}node": "redis_version:4.0.9
This is my test_redis.txt.

How do I add a header to a message using the header-enricher Spring Cloud Stream App Starter?

I'm trying to add a header with a key of "order_id" and a value based on a property in the payload to my messages. I then send the result to a log sink where I can inspect the headers after the header processor. Here's the stream:
stream create --name add-header-to-message-stream
--definition
":aptly-named-destination
> add-order_id-header: header-enricher
--header.enricher.headers='order_id=payload.order.id \\n fizz=\"buzz\"'
| log
--log.expression=headers"
I do not see keys of "order_id" or "fizz" in the headers map when I tail the log sink. I'm able to deploy the stream and run data through the pipeline with no errors. How do I add headers to my messages?
This works fine for me, but only with a single header...
dataflow:>stream create foo --definition "time --fixedDelay=5 |
header-enricher --headers='foo=payload.substring(0, 1)' |
log --expression=#root " --deploy
With result
2017-06-21 08:28:38.459 INFO 70268 --- [-enricher.foo-1] log-sink : GenericMessage [payload=06/21/17 08:28:38, headers={amqp_receivedDeliveryMode=PERSISTENT, amqp_receivedRoutingKey=foo.header-enricher, amqp_receivedExchange=foo.header-enricher, amqp_deliveryTag=1, foo=0, amqp_consumerQueue=foo.header-enricher.foo, amqp_redelivered=false, id=302f1d5b-ba90
I am told that this...
--headers='foo=payload.substring(0, 1) \n bar=payload.substring(1,2)'
...or this...
--headers='foo=payload.substring(0, 1) \u000a bar=payload.substring(1,2)'
should work, but I get a parse error...
Cannot find terminating ' for string time --fixedDelay=5 | header-enricher --headers='foo=payload.substring(0, 1)
bar=payload.substring(1,2)' | log --expression=#root
...I am reaching out to the shell/deployer devs and will provide an update if I have one.
I tested with a literal value (single header) too...
dataflow:>stream create foo --definition "time --fixedDelay=5 |
header-enricher --headers='foo=\"bar\"' |
log --expression=#root " --deploy
2017-06-21 08:38:17.684 INFO 70916 --- [-enricher.foo-1] log-sink : GenericMessage [payload=06/21/17 08:38:17, headers={amqp_receivedDeliveryMode=PERSISTENT, amqp_receivedRoutingKey=foo.header-enricher, amqp_receivedExchange=foo.header-enricher, amqp_deliveryTag=8, foo=bar, amqp_consumerQueue=foo.header-enricher.foo, amqp_redelivered=false, id=a92f4908-af13-53aa-205d-e25e204d04a3, amqp_consumerTag=amq.ctag-X51lhhRWBbEDVSyzp3rGmg, contentType=text/plain, timestamp=1498048697684}]

PIG Streaming: _some_ output files are missing

The problem can be reproduced using a simple test.
The "pig" script is as follows:
SET pig.noSplitCombination true;
dataIn = LOAD 'input/Test';
DEFINE macro `TestScript` input('DummyInput.txt') output('A.csv', 'B.csv', 'C.csv', 'D.csv', 'E.csv') ship('TestScript');
dataOut = STREAM dataIn through macro;
STORE dataOut INTO 'output/Test';
The actual script is a complex R program but here is a simple "TestScript" that reproduces the problem and doesn't require R:
# Ignore the input coming from the 'DummyInput.txt' file
# For now just create some output data files
echo "File A" > A.csv
echo "File B" > B.csv
echo "File C" > C.csv
echo "File D" > D.csv
echo "File E" > E.csv
The input 'DummyInput.txt' is some dummy data for now.
Record1
Record2
Record3
For the test, I've load the the dummy data in HDFS using the following script. This will result in 200 input files.
for i in {0..199}
do
hadoop fs -put DummyInput.txt input/Test/Input$i.txt
done
When I run the pig job, it runs without errors. 200 mappers run as expected. However, I expect to see 200 files in the various HDFS directories. Instead I find that a number of the output files are missing:
1 200 1400 output/Test/B.csv
1 200 1400 output/Test/C.csv
1 189 1295 output/Test/D.csv
1 159 1078 output/Test/E.csv
The root "output/Test" has 200 files, which is correct. Folders "B.csv" and "C.csv" have 200 files as well. However, folders "D.csv" and "E.csv" have missing files.
We have looked at the logs but can't anything which points to why the local output files are not being copied from the data nodes to HDFS.

Resources