Output the value/word after one pattern has been found in string in variable (grep, awk, sed, pearl etc) - bash

I have a program that prints data into the console like so (separated by space):
variable1 value1
variable2 value2
variable3 value3
varialbe4 value4
EDIT: Actually the output can look like this:
data[variable1]: value1
pre[variable2] value2
variable3: value3
flag[variable4] value4
In the end I want to search for a part of the name e.g. for variable2 or variable3 but only get value2 or value3 as output.
EDIT: This single value should then be stored in a variable for further processing within the bash script.
I first tried to put all the console output into a file and process it from there with e.g.
# value3_var="$(grep "variable3" file.log | cut -d " " -f2)"
This works fine but is too slow. I need to process ~20 of these variables per run and this takes ~1-2 seconds on my system. Also I need to do this for ~500 runs. EDIT: I actually do not need to automatically process all of the ~20 'searches' automatically with one call of e.g. awk. If there is a way to do it automaticaly, it's fine, but ~20 calls in the bash script are fine here too.
Therefore I thought about putting the console output directly into a variable to remove the slow file access. But this will then eliminate the newline characters which then again makes it more complicated to process:
# console_output=$(./programm_call)
# echo $console_output
variable1 value1 variable2 value2 variable3 value3 varialbe4 value4
EDIT: IT actually looks like this:
# console_output=$(./programm_call)
# echo $console_output
data[variable1]: value1 pre[variable2] value2 variable3: value3 flag[variable4] value4
I found a solution for this kind of string arangement, but these seem only to work with a text file. At least I was not able to use the string stored in $console_output with these examples
How to print the next word after a found pattern with grep,sed and awk?
So, how can I output the next word after a found pattern, when providing a (long) string as variable?
PS: grep on my system does not know the parameter -P...

I'd suggest to use awk:
$ cat ip.txt
data[variable1]: value1
pre[variable2] value2
variable3: value3
flag[variable4] value4
$ cat var_list
variable1
variable3
$ awk 'NR==FNR{a[$1]; next}
{for(k in a) if(index($1, k)) print $2}' var_list ip.txt
value1
value3
To use output of another command as input file, use ./programm_call | awk '...' var_list - where - will indicate stdin as input.
This single value should then be stored in a variable for further processing within the bash script.
If you are doing further text processing, you could do it within awk and thus avoid a possible slower bash loop. See Why is using a shell loop to process text considered bad practice? for details.
Speed up suggestions:
Use LC_ALL=C awk '..' if input is ASCII (Note that as pointed out in comments, this doesn't apply for all cases, so you'll have to test it for your use case)
Use mawk if available, that is usually faster. GNU awk may still be faster for some cases, so again, you'll have to test it for your use case
Use ripgrep, which is usually faster than other grep programs.
$ ./programm_call | rg -No -m1 'variable1\S*\s+(\S+)' -r '$1'
value1
$ ./programm_call | rg -No -m1 'variable3\S*\s+(\S+)' -r '$1'
value3
Here, -o option is used to get only the matched portion. -r is used to get only the required text by replacing the matched portion with the value from the capture group. -m1 option is used to stop searching input once the first match is found. -N is used to disable line number prefix.

Exit after the first grep match, like so:
value3_var="$(grep -m1 "variable3" file.log | cut -d " " -f2)"
Or use Perl, also exiting after the first match. This eliminates the need for a pipe to another process:
value3_var="$(perl -le 'print $1, last if /^variable3\s+(.*)/' file.log)"

If I'm understanding your requirements correctly, how about feeding
the output of programm_call directly to the awk script instead of
assinging a shell variable.
./programm_call | awk '
# the following block is invoked line by line of the input
{
a[$1] = $2
}
# the following block is executed after all lines are read
END {
# please modify the print statement depending on your required output format
print "variable1 = " a["variable1"]
print "variable3 = " a["variable3"]
}'
Output:
variable1 = value1
variable3 = value3
As you see, the script can process all (~20) variables at once.
[UPDATE]
Assumptions including the provided information:
The ./program_call prints approx. 50 pairs of "variable value"
variable and value are delimited by blank character(s)
variable may be enclosed with [ and ]
variable may be followed by :
We have interest with up to 20 variables out of the ~50 pairs
We use just one of the 20 variables at once
We don't want to invoke ./program_call whenever accessing just one variable
We want to access the variable values from within bash script
We may use an associative array to fetch the value via the variable name
Then it will be convenient to read the variable-value pairs directly within
bash script:
#!/bin/bash
declare -A hash # declare an associative array
while read -r key val; do # read key (variable name) and value
key=${key#*[} # remove leading "[" and the characters before it
key=${key%:} # remove trailing ":"
key=${key%]} # remove trailing "]"
hash["$key"]="$val" # store the key and value pair
done < <(./program_call) # feed the output of "./program_call" to the loop
# then you can access the values via the variable name here
foo="${hash["variable2"]}" # the variable "foo" is assigned to "value2"
# do something here
bar="${hash["variable3"]}" # the variable "bar" is assigned to "value3"
# do something here
Some people criticize that bash is too slow to process text lines,
but we process just about 50 lines in this case. I tested a simulation by
generating 50 lines, processing the output with the script above,
repeating the whole process 1,000 times. It completed within a few seconds. (Meaning one batch ends within a few milliseconds.)

This is how to do the job efficiently AND robustly (your approach and all other current answers will result in false matches from some input and some values of the variables you want to search for):
$ cat tst.sh
#!/usr/bin/env bash
vars='variable2 variable3'
awk -v vars="$vars" '
BEGIN {
split(vars,tmp)
for (i in tmp) {
tags[tmp[i]":"]
tags["["tmp[i]"]"]
tags["["tmp[i]"]:"]
}
}
$1 in tags || ( (s=index($1,"[")) && (substr($1,s) in tags) ) {
print $2
}
' "${#:--}"
$ ./tst.sh file
value2
value3
$ cat file | ./tst.sh
value2
value3
Note that the only loop is in the BEGIN section where it populates a hash table (tags[]) with the strings from the input that could match your variable list so that while processing the input it doesn't have to loop, it just does a hash lookup of the current $1 which will be very efficient as well as robust (e.g. will not fail on partial matches or even regexp metachars).
As shown, it'll work whether the input is coming from a file or a pipe. If that's not all you need then edit your question to clarify your requirements and improve your example to show a case where this does not do what you want.

Related

Unix bash - using cut to regex lines in a file, match regex result with another similar line

I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
do
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
do
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
some\junk
"gl_one\1\Value1
some\junk
"gl_two\1\Value2
"gl_three\1\Value3
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
}
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
>>/filepath/file.csv
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields

how to find the position of a string in a file in unix shell script

Can you please help me solve this puzzle? I am trying to print the location of a string (i.e., line #) in a file, first to the std output, and then capture that value in a variable to be used later. The string is “my string”, the file name is “myFile” which is defined as follows:
this is first line
this is second line
this is my string on the third line
this is fourth line
the end
Now, when I use this command directly at the command prompt:
% awk ‘s=index($0, “my string”) { print “line=” NR, “position= ” s}’ myFile
I get exactly the result I want:
% line= 3, position= 9
My question is: if I define a variable VAR=”my string”, why can’t I get the same result when I do this:
% awk ‘s=index($0, $VAR) { print “line=” NR, “position= ” s}’ myFile
It just won’t work!! I even tried putting the $VAR in quotation marks, to no avail? I tried using VAR (without the $ sign), no luck. I tried everything I could possibly think of ... Am I missing something?
awk variables are not the same as shell variables. You need to define them with the -v flag
For example:
$ awk -v var="..." '$0~var{print NR}' file
will print the line number(s) of pattern matches. Or for your case with the index
$ awk -v var="$Var" 'p=index($0,var){print NR,p}' file
using all uppercase may not be good convention since you may accidentally overwrite other variables.
to capture the output into a shell variable
$ info=$(awk ...)
for multi line output assignment to shell array, you can do
$ values=( $(awk ...) ); echo ${values[0]}
however, if the output contains more than one field, it will be assigned it's own array index. You can change it with setting the IFS variable, such as
$ IFS=$(echo -en "\n\b"); values=( $(awk ...) )
which will capture the complete lines as the array values.

Bash Columns SED and BASH Commands without AWK?

I wrote 2 difference scripts but I am stuck at the same problem.
The problem is am making a table from a file ($2) that I get in args and $1 is the numbers of columns. A little bit hard to explain but I am gonna show you input and output.
The problem is now that I don't know how I can save every column now in a difference var so i can build it in my HTML code later
#printf #TR##TD#$...#/TD##TD#$...#/TD##TD#$..#/TD##/TR##TD#$...
so input look like that :
Name\tSize\tType\tprobe
bla\t4711\tfile\t888888888
abcde\t4096\tdirectory\t5555
eeeee\t333333\tblock\t6666
aaaaaa\t111111\tpackage\t7777
sssss\t44444\tfile\t8888
bbbbb\t22222\tfolder\t9999
Code :
c=1
column=$1
file=$2
echo "$( < $file)"| while read Line ; do
Name=$(sed "s/\\\t/ /g" $file | cut -d' ' -f$c,-$column)
printf "$Name \n"
#let c=c+1
#printf "<TR><TD>$Name</TD><TD>$Size</TD><TD>$Type</TD></TR>\n"
exit 0
done
Output:
Name Size Type probe
bla 4711 file 888888888
abcde 4096 directory 5555
eeeee 333333 block 6666
aaaaaa 111111 package 7777
sssss 44444 file 8888
bbbbb 22222 folder 9999
This is tailor-made job for awk. See this script:
awk -F'\t' '{printf "<tr>";for(i=1;i<=NF;i++) printf "<td>%s</td>", $i;print "</tr>"}' input
<tr><td>bla</td><td>4711</td><td>file</td><td>888888888</td></tr>
<tr><td>abcde</td><td>4096</td><td>directory</td><td>5555</td></tr>
<tr><td>eeeee</td><td>333333</td><td>block</td><td>6666</td></tr>
<tr><td>aaaaaa</td><td>111111</td><td>package</td><td>7777</td></tr>
<tr><td>sssss</td><td>44444</td><td>file</td><td>8888</td></tr>
<tr><td>bbbbb</td><td>22222</td><td>folder</td><td>9999</td></tr>
In bash:
celltype=th
while IFS=$'\t' read -a columns; do
rowcontents=$( printf '<%s>%s</%s>' "$celltype" "${columns[#]}" "$celltype" )
printf '<tr>%s</tr>\n' "$rowcontents"
celltype=td
done < <( sed $'s/\\\\t/\t/g' "$2")
Some explanations:
IFS=$'\t' read -a columns reads a line from standard input, using only the tab character to separate fields, and putting each field into a separate element of the array columns. We change IFS so that other whitespace, which could occur in a field, is not treated as a field delimiter.
On the first line read from standard input, <th> elements will be output by the printf line. After resetting the value of celltype at the end of the loop body, all subsequent rows will consist of <td> elements.
When setting the value of rowcontents, take advantage of the fact that the first argument is repeated as many times as necessary to consume all the arguments.
Input is via process substitution from the sed command, which requires a crazy amount of quoting. First, the entire argument is quoted with $'...', which tells bash to replace escaped characters. bash converts this to the literal string s/\\t/^T/g, where I am using ^T to represent a literal ASCII 09 tab character. When sed sees this argument, it performs its own escape replacement, so the search text is a literal backslash followed by a literal t, to be replaced by a literal tab character.
The first argument, the column count, is unnecessary and is ignored.
Normally, you avoid making the while loop part of a pipeline because you set parameters in the loop that you want to use later. Here, all the variables are truly local to the while loop, so you could avoid the process substitution and use a pipeline if you wish:
sed $'s/\\\\t/\t/g' "$2" | while IFS=$'\t' read -a columns; do
...
done

awk split on a different token

I am trying to initialize an array from a string split using awk.
I am expecting the tokens be delimited by ",", but somehow they don't.
The input is a string returned by curl from the address http://www.omdbapi.com/?i=&t=the+campaign
I've tried to remove any extra carriage return or things that could cause confusion, but in all clients I have checked it looks to be a single line string.
{"Title":"The Campaign","Year":"2012","Rated":"R", ...
and this is the ouput
-metadata {"Title":"The **-metadata** Campaign","Year":"2012","Rated":"R","....
It should have been
-metadata {"Title":"The Campaign"
Here's my piece of code:
__tokens=($(echo $omd_response | awk -F ',' '{print}'))
for i in "${__tokens[#]}"
do
echo "-metadata" $i"
done
Any help is welcome
I would take seriously the comment by #cbuckley: Use a json-aware tool rather than trying to parse the line with simple string tools. Otherwise, your script will break if a quoted-string has an comma inside, for example.
At any event, you don't need awk for this exercise, and it isn't helping you because the way awk breaks the string up is only of interest to awk. Once the string is printed to stdout, it is still the same string as always. If you want the shell to use , as a field delimiter, you have to tell the shell to do so.
Here's one way to do it:
(
OLDIFS=$IFS
IFS=,
tokens=($omd_response)
IFS=$OLDIFS
for token in "${tokens[#]}"; do
# something with token
done
)
The ( and ) are just to execute all that in a subshell, making the shell variables temporaries. You can do it without.
First, please accept my apologies: I don't have a recent bash at hand so I can't try the code below (no arrays!)
But it should work, or if not you should be able to tweak it to work (or ask underneath, providing a little context on what you see, and I'll help fix it)
nb_fields=$(echo "${omd_response}" | tr ',' '\n' | wc -l | awk '{ print $1 }')
#The nb_fields will be correct UNLESS ${omd_response} contains a trailing "\",
#in which case it would be 1 too big, and below would create an empty
# __tokens[last_one], giving an extra `-metadata ""`. easily corrected if it happens.
#the code below assume there is at least 1 field... You should maybe check that.
#1) we create the __tokens[] array
for field in $( seq 1 $nb_fields )
do
#optionnal: if field is 1 or $nb_fields, add processing to get rid of the { or } ?
${__tokens[$field]}=$(echo "${omd_response}" | cut -d ',' -f ${field})
done
#2) we use it to output what we want
for i in $( seq 1 $nb_fields )
do
printf '-metadata "%s" ' "${__tokens[$i]}"
#will output all on 1 line.
#You could add a \n just before the last ' so it goes each on different lines
done
so I loop on field numbers, instead of on what could be some space-or-tab separated values

Bash script: regexp reading numerical parameters from text file

Greetings!
I have a text file with parameter set as follows:
NameOfParameter Value1 Value2 Value3 ...
...
I want to find needed parameter by its NameOfParameter using regexp pattern and return a selected Value to my Bash script.
I tried to do this with grep, but it returns a whole line instead of Value.
Could you help me to find as approach please?
It was not clear if you want all the values together or only one specific one. In either case, use the power of cut command to cut the columns you want from a file (-f 2- will cut columns 2 and on (so everything except parameter name; -d " " will ensure that the columns are considered to be space-separated as opposed to default tab-separated)
egrep '^NameOfParameter ' your_file | cut -f 2- -d " "
Bash:
values=($(grep '^NameofParameter '))
echo ${values[0]} # NameofParameter
echo ${values[1]} # Value1
echo ${values[2]} # Value2
# etc.
for value in ${values[#:1]} # iterate over values, skipping NameofParameter
do
echo "$value"
done

Resources