Grep does not always find the correct value from a file - shell

I am trying to extract two #define values from a C header file to use them in a shell script. So I use grep to find them and then print them. However, the variables are sometimes empty.
// main.h
#define DEVICE_NO 1
#define FW_VERSION 1
And the script file is
#!/bin/bash -
read_version()
{
echo $(grep $1 "$projectdir/Inc/main.h" | cut -d ' ' -f 3-)
}
device_no=$(read_version "DEVICE_NO")
fw_version=$(read_version "FW_VERSION")
echo "DEVICE_NO = $device_no, FW_VERSION = $fw_version"
So the expectation is that the output to be:
DEVICE_NO = 1, FW_VERSION = 1
but sometimes it turns to be
5
DEVICE_NO = , FW_VERSION = 1
It randomly misses one or both of the values. The header file does not change so it's not coming from there.
UPDATE
As commented I thought maybe the windows line ending is a problem so I piped the output to tr and removed \r but it did not make any difference. also tried var=$(grep FW_VERSION file); $(echo ${var//[$'\t\r\n']} | cut ... to no avail.
I tried using awk instead of cut but got the same result.
I redirected the error inside the command to the standard output ($(grep $1 file | cut -d ' ' -f 3 2>&1) but did not get any extra information
I split the command to a grep part and a cut, the grep never misses but the output of cut randomly gives an empty string as output.
I still have no idea where that 5 is coming from, there is nothing in cut or awk manuals that throws a 5 to either standard output or stderr.

Related

Parsing CSV records when a value is multiline

Source file looks like this:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
"facebook.com", "vuln_example2"
"reddit.com", "stupidly_long_vuln_name1"
"stackoverflow.com", ""
I've been trying to get the output to be something like this but the line breaks seem to cause me no end of problems. I'm using a "while read line" job to do this because I do some processing on the columns (e.g Vulnerability count and url in this example). This is output into a jenkins job (yuk).
The basic summary of the problem is getting the linebreaks in the csv to be output into the third column while retaining the table structure. I've got a sort of weird example of the desired output below.
||hostname ||Vulnerability count|| Vulnerability list || URL ||
|google.com |3 |vuln_example1 |http://cve.com/vuln_example1|
| | |vuln_example2 |http://cve.com/vuln_example2|
| | |vuln_example3 |http://cve.com/vuln_example3|
|facebook.com |1 |vuln_example2 |http://cve.com/vuln_example2|
|reddit.com |1 |stupidly_long_vuln_name1 |http://cve.com/stupidly_long_vuln_name1|
|stackoverflow.com |0 | ||
Looking at this... I've got a feeling it might be easier by showing some code and example output.
Parsing your input with the command line below makes the problem easier (I'm assuming the inputs are correct):
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt
This line invokes Perl to perform two regex substitutions:
s/([^"])\s*\n/\1 /g: This substitution removes an end of line if it doesn't terminate by a quote " (i.e. if a host entry, with all vulnerabilities isn't yet complete).
s/[",]//g removes all quotes and commas remaining.
For each host entry like this one:
"google.com", "vuln_example1
vuln_example2
vuln_example3"
You'll get:
google.com vuln_example1 vuln_example2 vuln_example3
Then you can assume for each line, you have an host and a set of vulnerabilities.
The given example below stores vulnerabilities in an array and loop through it, formatting and printing each line:
# Replace this by your custom function
# to get an URL for a given vulnerability
function get_vuln_url () {
# This just displays a random url for an non-empty arg
[[ -z "$1" ]] || echo "http://host/$1.htm"
}
# Format your line (see printf help)
function print_row () {
printf "%-20s|%5s|%-30s|%s\n" "$#"
}
# The perl line reformat
perl -0777 -pe 's/([^"])\s*\n/\1 /g ; s/[",]//g' < sample.txt |
while read -r line ; do
arr=(${line})
print_row "${arr[0]}" "$((${#arr[#]} - 1))" "${arr[1]}" "$(get_vuln_url ${arr[1]})"
#echo -e "${arr[0]}\t|$vul_count\t|${arr[1]}\t|$(get_vuln_url ${arr[1]})"
for v in "${arr[#]:2}" ; do
print_row " " " " "$v" "$(get_vuln_url ${arr[1]})"
done
done
Output:
google.com | 3|vuln_example1 |http://host/vuln_example1.htm
| |vuln_example2 |http://host/vuln_example1.htm
| |vuln_example3 |http://host/vuln_example1.htm
facebook.com | 1|vuln_example2 |http://host/vuln_example2.htm
reddit.com | 1|stupidly_long_vuln_name1 |http://host/stupidly_long_vuln_name1.htm
stackoverflow.com | 0| |
Update.
If you don't have Perl, and if your file doesn't have tabulations, you can use this command as a workaround instead:
tr '\n' '\t' < sample.txt | sed -r -e 's/([^"])\s*\t/\1 /g' -e 's/[",]//g' -e 's/\t/\n/g'
tr '\n' '\t' replaces all ends of line by tabulations
sed part acts like Perl line, except it deals with tabulations instead of ends of line and restores tabulations back to ends of line.

Unix bash - using cut to regex lines in a file, match regex result with another similar line

I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
do
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
do
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
some\junk
"gl_one\1\Value1
some\junk
"gl_two\1\Value2
"gl_three\1\Value3
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
}
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
>>/filepath/file.csv
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields

Alternating output in bash for loop from two grep

I'm trying to search through files and extract two pieces of relevant information every time they appear in the file. The code I currently have:
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3
samples=$(grep $str2 $file | cut -d '/' -f 8
echo $samples $reads >> reads.txt
done
It is doing each line for the file (the files have varying numbers of instances of these phrases) and gives me the output per row for each file:
PopA_15.fq 1081264
PopA_16.fq PopA_17.fq 1008416 554791
PopA_18.fq PopA_20.fq PopA_21.fq 604610 531227 595129
...
I want it to match each instance (i.e. 1st instance of both greps next two each other):
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
...
How do I do this? Thank you
Considering that your Input_file is same as sample shown and number of columns are even on each line with 1 PopA value and other will be with digit values. Following awk may help you in same.
awk '{for(i=1;i<=(NF/2);i++){print $i,$((NF/2)+i)}}' Input_file
Output will be as follows.
PopA_15.fq 1081264
PopA_16.fq 1008416
PopA_17.fq 554791
PopA_18.fq 604610
PopA_20.fq 531227
PopA_21.fq 595129
In case you want to pass output of a command to awk command then you could do like your command | awk command... no need to add Input_file to above awk command.
This is what ended up working for me...any tips for more efficient code are definitely welcome
#!/bin/bash
echo "Utilized reads from ustacks output" > reads.txt
str1="utilized reads:"
str2="Parsing"
for file in /home/desaixmg/novogene/stacks/sample01/conda_ustacks.o*; do
reads=$(grep $str1 $file | cut -d ':' -f 3)
samples=$(grep $str2 $file | cut -d '/' -f 8)
paste <(echo "$samples" | column -t) <(echo "$reads" | column -t) >> reads.txt
done
This provides the desired output described above.

sh random error when saving function output

Im trying to make a script that takes a .txt file containing lines
like:
davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46 TDDB80:
and then sort them etc. Thats just the background my problem lies here:
#!/bin/sh -x
cat $1 |
while read a
do
testsak = `echo $a | cut -f 1 -d :`; <---**
echo $testsak;
done
Where the arrow is, when I try to run this code I get some kind of weird error.
+ read a
+ cut -f+ echo 1 -d :davda103:David:Davidsson:800104-1234:TNCCC_1:TDDB46
TDDB80:
+ testsak = davda103
scriptTest.sh: testsak: Det går inte att hitta
+ echo
(I have my linux in swedish because school -.-) Anyways that error just says that it cant find... something. Any ideas what could be causing my problem?
You have extra spaces around the assignment operator, remove them:
testsak=`echo $a | cut -f 1 -d :`; <---**
The spaces around the equal sign
testsak = `echo $a | cut -f 1 -d :`; <---**
causes bash to interpret this as a command testak with arguments = and the result of the command substitution. Removing the spaces will fix the immediate error.
A much more efficient way to extract the value from a is to let read do it (and use input redirection instead of cat):
while IFS=: read testak the_rest; do
echo $testak
done < "$1"

Why is this command within my code giving different result than the same command in terminal?

**Edit: Okay, so I've tried implementing everyone's advice so far.
-I've added quotes around each variable "$1" and "$codon" to avoid whitespace.
-I've added the -ioc flag to grep to avoid caps.
-I tried using tr -d' ', however that leads to a runtime error because it says -d' ' is an invalid option.
Unfortunately I am still seeing the same problem. Or a different problem, which is that it tells me that every codon appears exactly once. Which is a different kind of wrong.
Thanks for everything so far - I'm still open to new ideas. I've updated my code below.**
I have this bash script that is supposed to count all permutations of (A C G T) in a given file.
One line of the script is not giving me the desired result and I don't know why - especially because I can enter the exact same line of code in the command prompt and get the desired result.
The line, executed in the command prompt, is:
cat dnafile | grep -o GCT | wc -l
This line tells me how many times the regular expression "GCT" appears in the file dnafile. When I run this command the result I get is 10 (which is accurate).
In the code itself, I run a modified version of the same command:
cat $1 | grep -o $codon | wc -l
Where $1 is the file name, and $codon is the 3-letter combination. When I run this from within the program, the answer I get is ALWAYS 0 (which is decidedly not accurate).
I was hoping one of you fine gents could enlighten this lost soul as to why this is not working as expected.
Thank you very, very much!
My code:
#!/bin/bash
#countcodons <dnafile> counts occurances of each codon in sequence contained within <dnafile>
if [[ $# != 1 ]]
then echo "Format is: countcodons <dnafile>"
exit
fi
nucleos=(a c g t)
allCods=()
#mix and match nucleotides to create all codons
for x in {0..3}
do
for y in {0..3}
do
for z in {0..3}
do
perm=${nucleos[$x]}${nucleos[$y]}${nucleos[$z]}
allCods=("${allCods[#]}" "$perm")
done
done
done
#for each codon, use grep to count # of occurances in file
len=${#allCods[*]}
for (( n=0; n<len; n++ ))
do
codon=${allCods[$n]}
occs=`cat "$1" | grep -ioc "$codon" | wc -l`
echo "$codon appears: $occs"
# if (( $occs > 0 ))
# then
# echo "$codon : $occs"
# fi
done
exit
You're generating your sequences in lowercase. Your code greps for gct, not GCT. You want to add the -i switch to grep. Try:
occs=`grep -ioc $codon $1`
You've got your logic backwards - you shouldn't have to read your input file once for every codon, you should only have to read it once and check each line for every codon.
You didn't supply any sample input or expected output so it's untested but something like this is the right approach:
awk '
BEGIN {
nucleosStr="a c g t"
split(nucleosStr,nucleos)
#mix and match nucleotides to create all codons
for (x in nucleos) {
for (y in nucleos) {
for (z in nucleos) {
perm = nucleos[x] nucleos[y] nucleos[z]
allCodsStr = allCodsStr (allCodsStr?" ":"") perm
}
}
}
split(allCodsStr,allCods)
}
{
#for each codon, count # of occurances in file
for (n in allCods) {
codon = allCods[n]
if ( tolower($0) ~ codon ) {
occs[n]++
}
}
}
END {
for (n in allCods) {
printf "%s appears: %d\n", allCods[n], occs[n]
}
}
' "$1"
I expect you'll see a huge performance improvement with that approach if your file is moderately large.
Try:
occs=`cat $1 | grep -o $codon | wc -l | tr -d ' '`
The problem is that wc indents the output, so $occs has a bunch of spaces at the beginning.

Resources