Convert a key:value file w/ comments into JSON document with UNIX tools

Convert a key:value file w/ comments into JSON document with UNIX tools - bash

I have a file in a subset of YAML with data such as the below:
# This is a comment
# This is another comment
spark:spark.ui.enabled: 'false'
spark:spark.sql.adaptive.enabled: 'true'
yarn:yarn.nodemanager.log.retain-seconds: '259200'
I need to convert that into a JSON document looking like this (note that strings containing booleans and integers still remain strings):
{
"spark:spark.ui.enabled": "false",
"spark:spark.sql.adaptive.enabled": "true",
"yarn:yarn.nodemanager.log.retain-seconds", "259200"
}
The closest I got was this:
cat << EOF > ./file.yaml
> # This is a comment
> # This is another comment
>
>
> spark:spark.ui.enabled: 'false'
> spark:spark.sql.adaptive.enabled: 'true'
> yarn:yarn.nodemanager.log.retain-seconds: '259200'
> EOF
echo {$(cat file.yaml | grep -o '^[^#]*' | sed '/^$/d' | awk -F": " '{sub($1, "\"&\""); print}' | paste -sd "," - )}
which apart from looking rather gnarly doesn't give the correct answer, it returns:
{"spark:spark.ui.enabled": 'false',"spark:spark.sql.adaptive.enabled": 'true',"dataproc:dataproc.monitoring.stackdriver.enable": 'true',"spark:spark.submit.deployMode": 'cluster'}
which, if I pipe to jq causes a parse error.
I'm hoping I'm missing a much much easier way of doing this but I can't figure it out. Can anyone help?

Implemented in pure jq (tested with version 1.6):
#!/usr/bin/env bash
jq_script=$(cat <<'EOF'
def content_for_line:
"^[[:space:]]*([#]|$)" as $ignore_re | # regex for comments, blank lines
"^(?<key>.*): (?<value>.*)$" as $content_re | # regex for actual k/v pairs
"^'(?<value>.*)'$" as $quoted_re | # regex for values in single quotes
if test($ignore_re) then {} else # empty lines add nothing to the data
if test($content_re) then ( # non-empty: match against $content_re
capture($content_re) as $content | # ...and put the groups into $content
$content.key as $key | # string before ": " becomes $key
(if ($content.value | test($quoted_re)) then # if value contains literal quotes...
($content.value | capture($quoted_re)).value # ...take string from inside quotes
else
$content.value # no quotes to strip
end) as $value | # result of the above block becomes $value
{"\($key)": "\($value)"} # and return a map from one key to one value
) else
# we get here if a line didn't match $ignore_re *or* $content_re
error("Line \(.) is not recognized as a comment, empty, or valid content")
end
end;
# iterate over our input lines, passing each one to content_for_line and merging the result
# into the object we're building, which we eventually return as our result.
reduce inputs as $item ({}; . + ($item | content_for_line))
EOF
)
# jq -R: read input as raw strings
# jq -n: don't read from stdin until requested with "input" or "inputs"
jq -Rn "$jq_script" <file.yaml >file.json
Unlike syntax-unaware tools, this can never generate output that isn't valid JSON; and it can easily be extended with application-specific logic (f/e, to emit some values but not others as numeric literals rather than string literals) by adding an additional filter stage to inspect and modify the output of content_for_line.

Here's a no-frills but simple solution:
def tidy: sub("^ *'?";"") | sub(" *'?$";"");
def kv: split(":") | [ (.[:-1] | join(":")), (.[-1]|tidy)];
reduce (inputs| select( test("^ *#|^ *$")|not) | kv) as $row ({};
.[$row[0]] = $row[1] )
Invocation
jq -n -R -f tojson.jq input.txt

You can do it all in awk using gsub and sprintf, for example:
(edit to add "," separating json records)
awk 'BEGIN {ol=0; print "{" }
/^[^#]/ {
if (ol) print ","
gsub ("\047", "\042")
$1 = sprintf (" \"%s\":", substr ($1, 1, length ($1) - 1))
printf "%s %s", $1, $2
ol++
}
END { print "\n}" }' file.yaml
(note: though jq is the proper tool for json formatting)
Explanation
awk 'BEGIN { ol=0; print "{" } call awk setting the output line variable ol=0 for "," output control and printing the header "{",
/^[^#]/ { only match non-comment lines,
if (ol) print "," if the output line ol is greater than zero, output a trailing ","
gsub ("\047", "\042") replace all single-quotes with double-quotes,
$1 = sprintf (" \"%s\":", substr ($1, 1, length ($1) - 1)) add 2 leading spaces and double-quotes around the first field (except for the last character) and then append a ':' at the end.
print $1, $2 output the reformatted fields,
ol++ increment the output line count, and
END { print "}" }' close by printing the "}" footer
Example Use/Output
Just select/paste the awk command above (changing the filename as needed)
$ awk 'BEGIN {ol=0; print "{" }
> /^[^#]/ {
> if (ol) print ","
> gsub ("\047", "\042")
> $1 = sprintf (" \"%s\":", substr ($1, 1, length ($1) - 1))
> printf "%s %s", $1, $2
> ol++
> }
> END { print "\n}" }' file.yaml
{
"spark:spark.ui.enabled": "false",
"spark:spark.sql.adaptive.enabled": "true"
}

Related

lowercase and remove punctuation from a csv

I have a giant file (6gb) which is a csv and the rows look like so:
"87687","institute Polytechnic, Brazil"
"342424","university of India, India"
"24343","univefrsity columbia, Bogata, Colombia"
and I would like to remove all punctuation and lower the case of second column yielding:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
what would be the most efficient way to do this on the terminal?
Tried:
cat TEXTFILE | tr -d '[:punct:]' > OUTFILE
problem: resultant is not in lowercase and tr seems to act on both columns not just the ssecond.

With a real CSV parser in Perl, the robust/reliable way, using just one process.
As far as it's line by line, the 6GB requirement of file size should not be an issue.
#!/usr/bin/perl
use strict; use warnings; # harness
use Text::CSV; # load the needed module (install it)
use feature qw/say/; # say = print("...\n")
# create an instance of a new CSV parser
my $csv = Text::CSV->new({ auto_diag => 1 });
# open a File Handle or exit with error
open my $fh, "<:encoding(utf8)", "file.csv" or die "file.csv: $!";
while (my $row = $csv->getline ($fh)) { # parse line by line
$_ = $row->[1]; # parse only column 2
s/[\s[:punct:]]//g; # removes both space(s) and punct(s)
$_ = lc $_; # Lower Case current value $_
$row->[1] = qq/"$_"/; # edit changes and (re)"quote"
say join ",", #$row; # display the whole current row
}
close $fh; # close the File Handle
Output
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
install
cpan Text::CSV

Here's an approach using xsv and process substitution:
paste -d, \
<(xsv select 1 infile.csv) \
<(xsv select 2 infile.csv | sed 's/[[:blank:][:punct:]]*//g;s/.*/\L&/')
The sed command first removes all blanks and punctuation, then lowercases the entire match.
This also works when the first field contains blanks and commas, and retains quoting where required.

Using sed
$ sed -E ':a;s/([^,]*,)([^ ,]*)[ ,]([[:alpha:]]+)/\1\L\2\3/;ta' input_file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia

I suggest using this awk solution, which should work with any version of awk:
awk 'BEGIN{FS=OFS="\",\""} {
gsub(/[^[:alnum:]"]+/, "", $2); $2 = tolower($2)} 1' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Details:
We make "," input and output field separators in BEGIN block
gsub(/[^[:alnum:]"]+/, "", $2): Strip all non-alphanumeric characters except "
$2 = tolower($2): Lowercase second column

One GNU awk (for gensub()) idea:
awk '
BEGIN { FS=OFS="\"" }
{ $4=gensub(/[^[:alnum:]]/,"","g",tolower($4)) }
1'
This generates:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"

Another sed approach -
sed -E 's/ +//g; s/([^"]),/\1/g; s/"([^"]*)"/"\L\1"/g' file
I don't like how that leaves no flexibility, and makes you rewrite the logic if you find something else you want to remove, though.
Another in awk -
awk -F'[", ]+' '
{ printf "\"%s\",\"", $2;
for(c=3;c<=NF;c++) printf "%s", tolower($c);
print "\"";
}' file
This approach lets you define and add any additional offending characters into the field delimiters without editing your logic.
$: pat=$"[\"',_;:!##\$%)(* -]+"
$: echo "$pat"
["',_;:!##$%)(* -]+
$: cat file
"87687","institute 'Polytechnic, Brazil"
"342424","university; of-India, India"
"24343","univefrsity )columbia, Bogata, Colombia"
$: awk -F"$pat" '{printf "\"%s\",\"", $2; for(c=3;c<=NF;c++) printf "%s", tolower($c); print "\"" }' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
(I hate the way that lone single quote throws the markup color/format parsing off, lol)

Another way using ruby. Edited the data to show only the second field is modified.
% ruby -r 'csv' -e 'f = open("file");
CSV.parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Using FastCSV gives a huge speedup
gem install fastcsv
% ruby -r 'fastcsv' -e 'f = open("file");
FastCSV.raw_parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Data
% cat file
"8768, 7","institute Polytechnic, Brazil"
"342 424","university of India, India"
"243 43","univefrsity columbia, Bogata, Colombia"

With your shown samples and attempts please try following GNU awk code using match function of it. Using regex (^"[^"]*",")([^"]*)(".*)$ in match function which will create 3 capturing groups and will store the value into arr and respectively I am fetching the values of it later in program to meet OP's requirement.
awk '
match($0,/(^"[^"]*",")([^"]*)(".*)$/,arr){
gsub(/[^[:alnum:]]+/,"",arr[2])
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file

This might work for you (GNU sed):
sed -E s'/("[^"]*",)/\1\n/;h;s/.*\n//;s/[[:punct:] ]//g;s/.*/"\L&"/;H;g;s/\n.*\n//' file
Divide and rule.
Partition the line into two fields, make a copy, process the second field removing punctuation and spaces, re-quote and lowercase and then re-assemble the fields
An alternative, perhaps?
sed -E ':a;s/^("[^"]*",".*)[^[:alpha:]"](.*)/\L\1\2/;ta' file

Here is a way to do so in PHP.
Note: PHP will not output double quotes unless needed by the first column. The second column will never need double quotes, it has no space or special characters.
$max_line_length = 100;
if (($fp = fopen("file.csv", "r")) !== FALSE) {
while (($data = fgetcsv($fp, $max_line_length, ",")) !== FALSE) {
$data[1] = strtolower(preg_replace('/[\s[:punct:]]/', '', $data[1]));
fputcsv(STDOUT, $data, ',', '"');
}
fclose($fp);
}

Print string variable that stores the output of a command in Bash [duplicate]

This question already has answers here:
Add a prefix string to beginning of each line
(18 answers)
Closed last month.
I need to place the output of a command in Bash into a string variable.
Each value should be separated by a space. There are many options to do that but I cannot use mapfileor read options (I'm using Bash < 4 version in macOS).
This is the output of the command:
values="$(mycommand | awk 'NR > 2 { printf "%s\n", $2 }')"
where mycommand is just a cloud command that gets some values like:
echo $values
mycommand output: (which I think is a string ending with \n for each value)
55369972
75369973
85369974
95369975
This is what I'm trying to do:
Here I should print the values like (I need to iterate over the variable values so I can print each value individually).
desired output in the foor loop
value: 55369972
value: 75369973
value: 85369974
value: 95369975
but I'm getting this:
value: 55369972 75369973 85369974 95369975
# Getting the id field of the values
values="$(mycommand| awk 'NR > 2 { printf "%s\n", $2 }')"
# Replacing the new line with a space so I can iterate over each value
new_values="${values//$'\n'/ }"
# new_values=("${values//$'\n'/ }")
# Checking if I can print each value correctly
for i in "${new_values[#]}"
# for i in "$new_values"
do
echo "value: ${i}"
done
Also, I cannot use things like
# shellcheck disable=xxx
values=($(echo "${values}" | tr "\n" " "))
As I'm getting error messages when checking the code...
Any idea what I'm doing wrong in my code?

try this:
#!/bin/bash
values="$(mycommand | awk 'NR > 2 { printf "%s\n", $2 }')"
for v in $values; do
echo value: $v
done

Your step that replaces the newlines with spaces renders it as a string. If you want to split that string into a list, you should put it in brackets (based on this answer )
This should do what you are expecting:
# Getting the id field of the values
values="$(mycommand| awk 'NR > 2 { printf "%s\n", $2 }')"
# Replacing the new line with a space
new_values=("${values//$'\n'/ }")
# Checking if I can print the values correctly
for i in ${new_values}
do
echo "value: ${i}"
done
where new_values=("${values//$'\n'/ }") is the crucial part, then you need to avoid putting it in quotes when you iterate it (or you turn it back into a string)

Since I can't paste code into the comments, I post an answer but the credits go to #akathimy above.
This works for me (solution #1):
#!/bin/bash
# Getting the id field of the values
values="55369972 75369973 85369974 95369975"
#
for v in $values; do
echo value: "$v"
done
and this also (solution #2):
#!/bin/bash
# Getting the id field of the values
values="55369972
75369973
85369974
95369975"
#
for v in $values; do
echo value: "$v"
done
Edit: And what about this one (solution #3)? :
#!/bin/bash
# Getting the id field of the values
values=("55369972
75369973
85369974
95369975")
#
for v in ${values[#]}; do
echo value: "$v"
done
This last one works for me, and perhaps also for you. Let me know.

Creating users from .txt file [duplicate]

Why doesn't work the following bash code?
for i in $( echo "emmbbmmaaddsb" | split -t "mm" )
do
echo "$i"
done
expected output:
e
bb
aaddsb

The recommended tool for character subtitution is sed's command s/regexp/replacement/ for one regexp occurence or global s/regexp/replacement/g, you do not even need a loop or variables.
Pipe your echo output and try to substitute the characters mm witht the newline character \n:
echo "emmbbmmaaddsb" | sed 's/mm/\n/g'
The output is:
e
bb
aaddsb

Since you're expecting newlines, you can simply replace all instances of mm in your string with a newline. In pure native bash:
in='emmbbmmaaddsb'
sep='mm'
printf '%s\n' "${in//$sep/$'\n'}"
If you wanted to do such a replacement on a longer input stream, you might be better off using awk, as bash's built-in string manipulation doesn't scale well to more than a few kilobytes of content. The gsub_literal shell function (backending into awk) given in BashFAQ #21 is applicable:
# Taken from http://mywiki.wooledge.org/BashFAQ/021
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ $1 ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\\/\\\\}" -v rep="${2//\\/\\\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index($0, str)) {
# append everything up to the search string, and the replacement string
out = out substr($0, 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
$0 = substr($0, i + len);
}
# append whatever is left
out = out $0;
print out;
}
'
}
...used, in this context, as:
gsub_literal "mm" $'\n' <your-input-file.txt >your-output-file.txt

A more general example, without replacing the multi-character delimiter with a single character delimiter is given below :
Using parameter expansions : (from the comment of #gniourf_gniourf)
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
A more crude kind of way
#!/bin/bash
# main string
str="LearnABCtoABCSplitABCaABCString"
# delimiter string
delimiter="ABC"
#length of main string
strLen=${#str}
#length of delimiter string
dLen=${#delimiter}
#iterator for length of string
i=0
#length tracker for ongoing substring
wordLen=0
#starting position for ongoing substring
strP=0
array=()
while [ $i -lt $strLen ]; do
if [ $delimiter == ${str:$i:$dLen} ]; then
array+=(${str:strP:$wordLen})
strP=$(( i + dLen ))
wordLen=0
i=$(( i + dLen ))
fi
i=$(( i + 1 ))
wordLen=$(( wordLen + 1 ))
done
array+=(${str:strP:$wordLen})
declare -p array
Reference - Bash Tutorial - Bash Split String

With awk you can use the gsub to replace all regex matches.
As in your question, to replace all substrings of two or more 'm' chars with a new line, run:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, "\n" ); print; }'
e
bb
aaddsb
The ‘g’ in gsub() stands for “global,” which means replace everywhere.
You may also ask to print just N match, for example:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, " " ); print $2; }'
bb

Howto split a string on a multi-character delimiter in bash?

Why doesn't work the following bash code?
for i in $( echo "emmbbmmaaddsb" | split -t "mm" )
do
echo "$i"
done
expected output:
e
bb
aaddsb

The recommended tool for character subtitution is sed's command s/regexp/replacement/ for one regexp occurence or global s/regexp/replacement/g, you do not even need a loop or variables.
Pipe your echo output and try to substitute the characters mm witht the newline character \n:
echo "emmbbmmaaddsb" | sed 's/mm/\n/g'
The output is:
e
bb
aaddsb

Since you're expecting newlines, you can simply replace all instances of mm in your string with a newline. In pure native bash:
in='emmbbmmaaddsb'
sep='mm'
printf '%s\n' "${in//$sep/$'\n'}"
If you wanted to do such a replacement on a longer input stream, you might be better off using awk, as bash's built-in string manipulation doesn't scale well to more than a few kilobytes of content. The gsub_literal shell function (backending into awk) given in BashFAQ #21 is applicable:
# Taken from http://mywiki.wooledge.org/BashFAQ/021
# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
# STR cannot be empty
[[ $1 ]] || return
# string manip needed to escape '\'s, so awk doesn't expand '\n' and such
awk -v str="${1//\\/\\\\}" -v rep="${2//\\/\\\\}" '
# get the length of the search string
BEGIN {
len = length(str);
}
{
# empty the output string
out = "";
# continue looping while the search string is in the line
while (i = index($0, str)) {
# append everything up to the search string, and the replacement string
out = out substr($0, 1, i-1) rep;
# remove everything up to and including the first instance of the
# search string from the line
$0 = substr($0, i + len);
}
# append whatever is left
out = out $0;
print out;
}
'
}
...used, in this context, as:
gsub_literal "mm" $'\n' <your-input-file.txt >your-output-file.txt

A more general example, without replacing the multi-character delimiter with a single character delimiter is given below :
Using parameter expansions : (from the comment of #gniourf_gniourf)
#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
array+=( "${s%%"$delimiter"*}" );
s=${s#*"$delimiter"};
done;
declare -p array
A more crude kind of way
#!/bin/bash
# main string
str="LearnABCtoABCSplitABCaABCString"
# delimiter string
delimiter="ABC"
#length of main string
strLen=${#str}
#length of delimiter string
dLen=${#delimiter}
#iterator for length of string
i=0
#length tracker for ongoing substring
wordLen=0
#starting position for ongoing substring
strP=0
array=()
while [ $i -lt $strLen ]; do
if [ $delimiter == ${str:$i:$dLen} ]; then
array+=(${str:strP:$wordLen})
strP=$(( i + dLen ))
wordLen=0
i=$(( i + dLen ))
fi
i=$(( i + 1 ))
wordLen=$(( wordLen + 1 ))
done
array+=(${str:strP:$wordLen})
declare -p array
Reference - Bash Tutorial - Bash Split String

With awk you can use the gsub to replace all regex matches.
As in your question, to replace all substrings of two or more 'm' chars with a new line, run:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, "\n" ); print; }'
e
bb
aaddsb
The ‘g’ in gsub() stands for “global,” which means replace everywhere.
You may also ask to print just N match, for example:
echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, " " ); print $2; }'
bb

Parse out key=value pairs into variables

I have a bunch of different kinds of files I need to look at periodically, and what they have in common is that the lines have a bunch of key=value type strings. So something like:
Version=2 Len=17 Hello Var=Howdy Other
I would like to be able to reference the names directly from awk... so something like:
cat some_file | ... | awk '{print Var, $5}' # prints Howdy Other
How can I go about doing that?

The closest you can get is to parse the variables into an associative array first thing every line. That is to say,
awk '{ delete vars; for(i = 1; i <= NF; ++i) { n = index($i, "="); if(n) { vars[substr($i, 1, n - 1)] = substr($i, n + 1) } } Var = vars["Var"] } { print Var, $5 }'
More readably:
{
delete vars; # clean up previous variable values
for(i = 1; i <= NF; ++i) { # walk through fields
n = index($i, "="); # search for =
if(n) { # if there is one:
# remember value by name. The reason I use
# substr over split is the possibility of
# something like Var=foo=bar=baz (that will
# be parsed into a variable Var with the
# value "foo=bar=baz" this way).
vars[substr($i, 1, n - 1)] = substr($i, n + 1)
}
}
# if you know precisely what variable names you expect to get, you can
# assign to them here:
Var = vars["Var"]
Version = vars["Version"]
Len = vars["Len"]
}
{
print Var, $5 # then use them in the rest of the code
}

$ cat file | sed -r 's/[[:alnum:]]+=/\n&/g' | awk -F= '$1=="Var"{print $2}'
Howdy Other
Or, avoiding the useless use of cat:
$ sed -r 's/[[:alnum:]]+=/\n&/g' file | awk -F= '$1=="Var"{print $2}'
Howdy Other
How it works
sed -r 's/[[:alnum:]]+=/\n&/g'
This places each key,value pair on its own line.
awk -F= '$1=="Var"{print $2}'
This reads the key-value pairs. Since the field separator is chosen to be =, the key ends up as field 1 and the value as field 2. Thus, we just look for lines whose first field is Var and print the corresponding value.

Since discussion in commentary has made it clear that a pure-bash solution would also be acceptable:
#!/bin/bash
case $BASH_VERSION in
''|[0-3].*) echo "ERROR: Bash 4.0 required" >&2; exit 1;;
esac
while read -r -a words; do # iterate over lines of input
declare -A vars=( ) # refresh variables for each line
set -- "${words[#]}" # update positional parameters
for word; do
if [[ $word = *"="* ]]; then # if a word contains an "="...
vars[${word%%=*}]=${word#*=} # ...then set it as an associative-array key
fi
done
echo "${vars[Var]} $5" # Here, we use content read from that line.
done <<<"Version=2 Len=17 Hello Var=Howdy Other"
The <<<"Input Here" could also be <file.txt, in which case lines in the file would be iterated over.
If you wanted to use $Var instead of ${vars[Var]}, then substitute printf -v "${word%%=*}" %s "${word*=}" in place of vars[${word%%=*}]=${word#*=}, and remove references to vars elsewhere. Note that this doesn't allow for a good way to clean up variables between lines of input, as the associative-array approach does.

I will try to explain you a very generic way to do this which you can adapt easily if you want to print out other stuff.
Assume you have a string which has a format like this:
key1=value1 key2=value2 key3=value3
or more generic
key1_fs2_value1_fs1_key2_fs2_value2_fs1_key3_fs2_value3
With fs1 and fs2 two different field separators.
You would like to make a selection or some operations with these values. To do this, the easiest is to store these in an associative array:
array["key1"] => value1
array["key2"] => value2
array["key3"] => value3
array["key1","full"] => "key1=value1"
array["key2","full"] => "key2=value2"
array["key3","full"] => "key3=value3"
This can be done with the following function in awk:
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
So, after processing the string, you have the full flexibility to do operations in any way you like:
awk '
function str2map(str,fs1,fs2,map, n,tmp) {
n=split(str,map,fs1)
for (;n>0;n--) {
split(map[n],tmp,fs2);
map[tmp[1]]=tmp[2]; map[tmp[1],"full"]=map[n]
delete map[n]
}
}
{ str2map($0," ","=",map) }
{ print map["Var","full"] }
' file
The advantage of this method is that you can easily adapt your code to print any other key you are interested in, or even make selections based on this, example:
(map["Version"] < 3) { print map["var"]/map["Len"] }

The simplest and easiest way is to use the string substitution like this:
property='my.password.is=1234567890=='
name=${property%%=*}
value=${property#*=}
echo "'$name' : '$value'"
The output is:
'my.password.is' : '1234567890=='
Yore.

Using bash's set command, we can split the line into positional parameters like awk.
For each word, we'll try to read a name value pair delimited by =.
When we find a value, assign it to the variable named $key using bash's printf -v feature.
#!/usr/bin/env bash
line='Version=2 Len=17 Hello Var=Howdy Other'
set $line
for word in "$#"; do
IFS='=' read -r key val <<< "$word"
test -n "$val" && printf -v "$key" "$val"
done
echo "$Var $5"
output
Howdy Other

SYNOPSIS
an awk-based solution that doesn't require manually checking the fields to locate the desired key pair :
approach being avoid splitting unnecessary fields or arrays - only performing regex match via function call when needed
only returning FIRST occurrence of input key value. Subsequent matches along the row are NOT returned
i just called it S() cuz it's the closest letter to $
I only included an array (_) of the 3 test values for demo purposes. Those aren't needed. In fact, no state information is being kept at all
caveat being : key-match must be exact - this version of the code isn't for case-insensitive or fuzzy/agile matching
Tested and confirmed working on
- gawk 5.1.1
- mawk 1.3.4
- mawk-2/1.9.9.6
- macos nawk
CODE
# gawk profile, created Fri May 27 02:07:53 2022
{m,n,g}awk '
function S(__,_) {
return \
! match($(_=_<_), "(^|["(_="[:blank:]]")")"(__)"[=][^"(_)"*") \
? "^$" \
: substr(__=substr($-_, RSTART, RLENGTH), index(__,"=")+_^!_)
}
BEGIN { OFS = "\f" # This array is only for testing
_["Version"] _["Len"] _["Var"] # purposes. Feel free to discard at will
} {
for (__ in _) {
print __, S(__) } }'
OUTPUT
Var
Howdy
Len
17
Version
2
So either call the fields in BAU fashion
- $5, $0, $NF, etc
or call S(QUOTED_KEY_VALUE), case-sensitive, like
As a safeguard, to prevent mis-interpreting null strings
or invalid inputs as $0, a non-match returns ^$
instead of empty string
S("Version") to get back 2.
As a bonus, it can safely handle values in multibyte unicode, both for values and even for keys, regardless of whether ur awk is UTF-8-aware or not :
1 ✜
🤡
2 Version
2
3 Var
Howdy
4 Len
17
5 ✜=🤡 Version=2 Len=17 Hello Var=Howdy Other

I know this is particularly regarding awk but mentioning this as many people come here for solutions to break down name = value pairs ( with / without using awk as such).
I found below way simple straight forward and very effective in managing multiple spaces / commas as well -
Source: http://jayconrod.com/posts/35/parsing-keyvalue-pairs-in-bash
change="foo=red bar=green baz=blue"
#use below if var is in CSV (instead of space as delim)
change=`echo $change | tr ',' ' '`
for change in $changes; do
set -- `echo $change | tr '=' ' '`
echo "variable name == $1 and variable value == $2"
#can assign value to a variable like below
eval my_var_$1=$2;
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Convert a key:value file w/ comments into JSON document with UNIX tools - bash

Here's a no-frills but simple solution: def tidy: sub("^ '?";"") | sub(" '?$";""); def kv: split(":") | [ (.[:-1] | join(":")), (.[-1]|tidy)]; reduce (inputs| select( test("^ #|^ $")|not) | kv) as $row ({}; .[$row[0]] = $row[1] ) Invocation jq -n -R -f tojson.jq input.txt

Related

lowercase and remove punctuation from a csv

Print string variable that stores the output of a command in Bash [duplicate]

Creating users from .txt file [duplicate]

Howto split a string on a multi-character delimiter in bash?

Parse out key=value pairs into variables

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Convert a key:value file w/ comments into JSON document with UNIX tools - bash

Here's a no-frills but simple solution: def tidy: sub("^ *'?";"") | sub(" *'?$";""); def kv: split(":") | [ (.[:-1] | join(":")), (.[-1]|tidy)]; reduce (inputs| select( test("^ *#|^ *$")|not) | kv) as $row ({}; .[$row[0]] = $row[1] ) Invocation jq -n -R -f tojson.jq input.txt

Related

lowercase and remove punctuation from a csv

Print string variable that stores the output of a command in Bash [duplicate]

Creating users from .txt file [duplicate]

Howto split a string on a multi-character delimiter in bash?

Parse out key=value pairs into variables

Categories

Resources

Here's a no-frills but simple solution: def tidy: sub("^ '?";"") | sub(" '?$";""); def kv: split(":") | [ (.[:-1] | join(":")), (.[-1]|tidy)]; reduce (inputs| select( test("^ #|^ $")|not) | kv) as $row ({}; .[$row[0]] = $row[1] ) Invocation jq -n -R -f tojson.jq input.txt