How to split string into component parts in Linux Bash/Shell

How to split string into component parts in Linux Bash/Shell - bash

I'm writing the second version of my post-receive git hook.
I have a GL_REPO variable which conforms to:
/project.name/vhost-type/versioncodename
It may or may not have a trailing and/or preceding slash.
My current code misunderstood the function of the following code, and as a result it clearly duplicates $versioncodename into each variable:
# regex out project codename
PROJECT_NAME=${GL_REPO##*/}
echo "project codename is: $PROJECT_NAME"
# extract server target vhost-type -fix required
VHOST_TYPE=${GL_REPO##*/}
echo "server target is: $VHOST_TYPE"
# get server project - fix required
PROJECT_CODENAME=${GL_REPO##*/}
echo "server project is: $PROJECT_CODENAME"
What is the correct method for taking these elements one at a time from the back of the string, or guaranteeing that a three part string allocates these variables?
I guess it might be better to split into an array?

#!/bin/bash
GL_REPO=/project.name/vhost-type/versioncodename
GL_REPO=${GL_REPO#/} # remove preceding slash, if any
IFS=/ read -a arr <<< "$GL_REPO"
PROJECT_NAME="${arr[0]}"
VHOST_TYPE="${arr[1]}"
PROJECT_CODENAME="${arr[2]}"
UPDATE: an alternative solution by anishsane:
IFS=/ read PROJECT_NAME VHOST_TYPE PROJECT_CODENAME <<< "$GL_REPO"

You can use cut with a field separator to pull out items by order:
NAME=$(echo $GL_REPO | cut -d / -f 1)
You can repeat the same for other fields. The trailing/leading slash you can ignore (you'll get a NAME field being empty, for example) or you can strip off a leading slash with ${GL_REPO##/} (similarly, you can strip off a trailing slash with ${GL_REPO%%/}).

This is another way:
GL_REPO="/project.name/vhost-type/versioncodename"
GL_REPO="${GL_REPO/#\//}"
#^replace preceding slash (if any) with empty string.
IFS="/" arr=($GL_REPO)
echo "PN: ${arr[0]} VHT: ${arr[1]} VC: ${arr[2]}"
Using Bash Pattern Matching:
GL_REPO="/project.name/vhost-type/versioncodename"
patt="([^/]+)/([^/]+)/([^/]+)"
[[ $GL_REPO =~ $patt ]]
echo "PN: ${BASH_REMATCH[1]} VHT: ${BASH_REMATCH[2]} VC: ${BASH_REMATCH[3]}"

Related

Extract value for a key in a key/pair string

I have key value pairs in a string like this:
key1 = "value1"
key2 = "value2"
key3 = "value3"
In a bash script, I need to extract the value of one of the keys like for key2, I should get value2, not in quote.
My bash script needs to work in both Redhat and Ubuntu Linux hosts.
What would be the easiest and most reliable way of doing this?
I tried something like this simplified script:
pattern='key2\s*=\s*\"(.*?)\".*$'
if [[ "$content" =~ $pattern ]]
then
key2="${BASH_REMATCH[1]}"
echo "key2: $key2"
else
echo 'not found'
fi
But it does not work consistently.
Any better/easier/more reliable way of doing this?

To separate the key and value from your $content variable, you can use:
[[ $content =~ (^[^ ]+)[[:blank:]]*=[[:blank:]]*[[:punct:]](.*)[[:punct:]]$ ]]
That will properly populate the BASH_REMATCH array with both values where your key is in BASH_REMATCH[1] and the value in BASH_REMATCH[2].
Explanation
In bash the [[...]] treats what appears on the right side of =~ as an extended regular expression and matched according to man 3 regex. See man 1 bash under the section heading for [[ expression ]] (4th paragraph). Sub-expressions in parenthesis (..) are saved in the array variable BASH_REMATCH with BASH_REMATCH[0] containing the entire portion of the string (your $content) and each remaining elements containing the sub-expressions enclosed in (..) in the order the parenthesis appear in the regex.
The Regular Expression (^[^ ]+)[[:blank:]]*=[[:blank:]]*[[:punct:]](.*)[[:punct:]]$ is explained as:
(^[^ ]+) - '^' anchored at the beginning of the line, [^ ]+ match one or more characters that are not a space. Since this sub-expression is enclosed in (..) it will be saved as BASH_REMATCH[1], followed by;
[[:blank:]]* - zero or more whitespace characters, followed by;
= - an equal sign, followed by;
[[:blank:]]* - zero or more whitespace characters, followed by;
[[:punct:]] - a punctuation character (matching the '"', which avoids caveats associated with using quotes within the regex), followed by the sub-expression;
(.*) - zero or more characters (the rest of the characters), and since it is a sub-expression in (..) it the characters will be stored in BASH_REMATCH[2], followed by;
[[:punct:]] - a punctuation character (matching the '"' ... ditto), at the;
$ - end of line anchor.
So if you match what your key and value input lines separated by an = sign, it will separate the key and value into the array BASH_REMATCH as you wanted.

Bash supports BRE only and you cannot use \s and .*?.
As an alternative, please try:
while IFS= read -r content; do
# pattern='key2\s*=\s*\"(.*)\".*$'
pattern='key2[[:blank:]]*=[[:blank:]]*"([^"]*)"'
if [[ $content =~ $pattern ]]
then
key2="${BASH_REMATCH[1]}"
echo "key2: $key2"
(( found++ ))
fi
done < input-file.txt
if (( found == 0 )); then
echo "not found"
fi

What you start talking about key-value pairs, it is best to use an associative array:
declare -A map
Now looking at your lines, they look like key = "value" where we assume that:
value is always encapsulated by double quotes, but also could contain a quote
an unknown number of white spaces is before and/or after the equal sign.
So assuming we have a variable line which contains key = "value", the following operations will extract that value:
key="${line%%=*}"; key="${key// /}"
value="${line#*=}"; value="${value#*\042}"; value="${value%\042*}"
IFS=" \t=" read -r value _ <<<"$line"
This allows us now to have something like:
declare -A map
while read -r line; do
key="${line%%=*}"; key="${key// /}"
value="${line#*=}"; value="${value#*\042}"; value="${value%\042*}"
map["$key"]="$value"
done <inputfile

With awk:
awk -v key="key2" '$1 == key { gsub("\"","",$3);print $3 }' <<< "$string"
Reading the output of the variable called string, pass the required key in as a variable called key and then if the first space delimited field is equal to the key, remove the quotes from the third field with the gsub function and print.

Ok, after spending so many hours, this is how I solved the problem:
If you don't know where your script will run and what type of file (win/mac/linux) are you reading:
Try to avoid non-greedy macth in linux bash instead of tweaking diffrent switches.
don't trus end of line match $ when you might get data from windows or mac
This post solved my problem: Non greedy text matching and extrapolating in bash
This pattern works for me in may linux environments and all type of end of lines:
pattern='key2\s*=\s*"([^"]*)"'
The value is in BASH_REMATCH[1]

In bash how can I get the last part of a string after the last hyphen [duplicate]

I have this variable:
A="Some variable has value abc.123"
I need to extract this value i.e abc.123. Is this possible in bash?

Simplest is
echo "$A" | awk '{print $NF}'
Edit: explanation of how this works...
awk breaks the input into different fields, using whitespace as the separator by default. Hardcoding 5 in place of NF prints out the 5th field in the input:
echo "$A" | awk '{print $5}'
NF is a built-in awk variable that gives the total number of fields in the current record. The following returns the number 5 because there are 5 fields in the string "Some variable has value abc.123":
echo "$A" | awk '{print NF}'
Combining $ with NF outputs the last field in the string, no matter how many fields your string contains.

Yes; this:
A="Some variable has value abc.123"
echo "${A##* }"
will print this:
abc.123
(The ${parameter##word} notation is explained in §3.5.3 "Shell Parameter Expansion" of the Bash Reference Manual.)

Some examples using parameter expansion
A="Some variable has value abc.123"
echo "${A##* }"
abc.123
Longest match on " " space
echo "${A% *}"
Some variable has value
Longest match on . dot
echo "${A%.*}"
Some variable has value abc
Shortest match on " " space
echo "${A%% *}"
some
Read more Shell-Parameter-Expansion

The documentation is a bit painful to read, so I've summarised it in a simpler way.
Note that the '*' needs to swap places with the ' ' depending on whether you use # or %. (The * is just a wildcard, so you may need to take off your "regex hat" while reading.)
${A% *} - remove shortest trailing * (strip the last word)
${A%% *} - remove longest trailing * (strip the last words)
${A#* } - remove shortest leading * (strip the first word)
${A##* } - remove longest leading * (strip the first words)
Of course a "word" here may contain any character that isn't a literal space.
You might commonly use this syntax to trim filenames:
${A##*/} removes all containing folders, if any, from the start of the path, e.g.
/usr/bin/git -> git
/usr/bin/ -> (empty string)
${A%/*} removes the last file/folder/trailing slash, if any, from the end:
/usr/bin/git -> /usr/bin
/usr/bin/ -> /usr/bin
${A%.*} removes the last extension, if any (just be wary of things like my.path/noext):
archive.tar.gz -> archive.tar

How do you know where the value begins? If it's always the 5th and 6th words, you could use e.g.:
B=$(echo "$A" | cut -d ' ' -f 5-)
This uses the cut command to slice out part of the line, using a simple space as the word delimiter.

As pointed out by Zedfoxus here. A very clean method that works on all Unix-based systems. Besides, you don't need to know the exact position of the substring.
A="Some variable has value abc.123"
echo "$A" | rev | cut -d ' ' -f 1 | rev
# abc.123

More ways to do this:
(Run each of these commands in your terminal to test this live.)
For all answers below, start by typing this in your terminal:
A="Some variable has value abc.123"
The array example (#3 below) is a really useful pattern, and depending on what you are trying to do, sometimes the best.
1. with awk, as the main answer shows
echo "$A" | awk '{print $NF}'
2. with grep:
echo "$A" | grep -o '[^ ]*$'
the -o says to only retain the matching portion of the string
the [^ ] part says "don't match spaces"; ie: "not the space char"
the * means: "match 0 or more instances of the preceding match pattern (which is [^ ]), and the $ means "match the end of the line." So, this matches the last word after the last space through to the end of the line; ie: abc.123 in this case.
3. via regular bash "indexed" arrays and array indexing
Convert A to an array, with elements being separated by the default IFS (Internal Field Separator) char, which is space:
Option 1 (will "break in mysterious ways", as #tripleee put it in a comment here, if the string stored in the A variable contains certain special shell characters, so Option 2 below is recommended instead!):
# Capture space-separated words as separate elements in array A_array
A_array=($A)
Option 2 [RECOMMENDED!]. Use the read command, as I explain in my answer here, and as is recommended by the bash shellcheck static code analyzer tool for shell scripts, in ShellCheck rule SC2206, here.
# Capture space-separated words as separate elements in array A_array, using
# a "herestring".
# See my answer here: https://stackoverflow.com/a/71575442/4561887
IFS=" " read -r -d '' -a A_array <<< "$A"
Then, print only the last elment in the array:
# Print only the last element via bash array right-hand-side indexing syntax
echo "${A_array[-1]}" # last element only
Output:
abc.123
Going further:
What makes this pattern so useful too is that it allows you to easily do the opposite too!: obtain all words except the last one, like this:
array_len="${#A_array[#]}"
array_len_minus_one=$((array_len - 1))
echo "${A_array[#]:0:$array_len_minus_one}"
Output:
Some variable has value
For more on the ${array[#]:start:length} array slicing syntax above, see my answer here: Unix & Linux: Bash: slice of positional parameters, and for more info. on the bash "Arithmetic Expansion" syntax, see here:
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Arithmetic-Expansion
https://www.gnu.org/savannah-checkouts/gnu/bash/manual/bash.html#Shell-Arithmetic

You can use a Bash regex:
A="Some variable has value abc.123"
[[ $A =~ [[:blank:]]([^[:blank:]]+)$ ]] && echo "${BASH_REMATCH[1]}" || echo "no match"
Prints:
abc.123
That works with any [:blank:] delimiter in the current local (Usually [ \t]). If you want to be more specific:
A="Some variable has value abc.123"
pat='[ ]([^ ]+)$'
[[ $A =~ $pat ]] && echo "${BASH_REMATCH[1]}" || echo "no match"

echo "Some variable has value abc.123"| perl -nE'say $1 if /(\S+)$/'

Basic string manipulation from filenames in bash

I have a some file names in bash that I have acquired with
$ ones=$(find SRR*pass*1*.fq)
$ echo $ones
SRR6301033_pass_1_trimmed.fq
SRR6301034_pass_1_trimmed.fq
SRR6301037_pass_1_trimmed.fq
...
I then converted into an array so I can iterate over this list and perform some operations with filenames:
# convert to array
$ ones=(${ones// / })
and the iteration:
for i in $ones;
do
fle=$(basename $i)
out=$(echo $fle | grep -Po '(SRR\d*)')
echo "quants/$out.quant"
done
which produces:
quants/SRR6301033
SRR6301034
...
...
SRR6301220
SRR6301221.quant
However I want this:
quants/SRR6301033.quant
quants/SRR6301034.quant
...
...
quants/SRR6301220.quant
quants/SRR6301221.quant
Could somebody explain why what I'm doing doesn't work and how to correct it?

Why do you want this be done this complicated? You can get rid of all the unnecessary roundabouts and just use a for loop and built-in parameter expansion techniques to get this done.
# Initialize an empty indexed array
array=()
# Start a loop over files ending with '.fq' and if there are no such files
# the *.fq would be un-expanded and checking it against '-f' would fail and
# in-turn would cause the loop to break out
for file in *.fq; do
[ -f "$file" ] || continue
# Get the part of filename after the last '/' ( same as basename )
bName="${file##*/}"
# Remove the part after '.' (removing extension)
woExt="${bName%%.*}"
# In the resulting string, remove the part after first '_'
onlyFir="${woExt%%_*}"
# Append the result to the array, prefixing/suffixing strings 'quant'
array+=( quants/"$onlyFir".quant )
done
Now print the array to see the result
for entry in "${array[#]}"; do
printf '%s\n' "$entry"
done
Ways your attempt could fail
With ones=$(find SRR*pass*1*.fq) you are storing the results in a variable and not in an array. A variable has no way to distinguish if the contents are a list or a single string separated by spaces
With echo $ones i.e. an unquoted expansion, the string content is subject to word splitting. You might not see a difference as long as you have filenames with spaces, having one might let you interpret parts of the filename as different files
The part ${ones// / } makes no-sense in converting the string to an array as the attempt to use an unquoted variable $ones itself would be erroneous
for i in $ones; would be error prone for the said reasons above, the filenames with spaces could be interpreted as separated files instead of one.

Adding a comma after $variable

I'm writing a for loop in bash to run a command and I need to add a comma after one of my variables. I can't seem to do this without an extra space added. When I move "," right next to $bams then it outputs *.sorted,
#!/bin/bash
bams=*.sorted
for i in $bams
do echo $bams ","
done;
Output should be this:
'file1.sorted','file2.sorted','file3.sorted'
The eventual end goal is to be able to insert a list of files into a --flag in the format above. Not sure how to do that either.

First, a literal answer (if your goal were to generate a string of the form 'foo','bar','baz', rather than to run a program with a command line equivalent to somecommand --flag='foo','bar','baz', which is quite different):
shopt -s nullglob # generate a null result if no matches exist
printf -v var "'%s'," *.sorted # put list of files, each w/ a comma, in var
echo "${var%,}" # echo contents of var, with last comma removed
Or, if you don't need the literal single quotes (and if you're passing your result to another program on its command line with the single quotes being syntactic rather than literal, you absolutely don't want them):
files=( *.sorted ) # put *.sorted in an array
IFS=, # set the comma character as the field separator
somecommand --flag "${files[*]}" # run your program with the comma-separated list

try this -
lst=$( echo *.sorted | sed 's/ /,/g' ) # stack filenames with commas
echo $lst
if you really need the single-ticks around each filename, then
lst="'$( echo *.sorted | sed "s/ /','/g" )'" # commas AND quotes

#!/bin/bash
bams=*.sorted
for i in $bams
do flag+="${flag:+,}'$i'"
done
echo $flag

Split filename and get the element between first and last occurrence of underscore

I am trying to split many folder names in a for loop and extract the element between first and last underscore of filename. Filenames can look like ENCSR000AMA_HepG2_CTCF or ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF.
My problem is that folder names differ form each other in the total number of underscores, so I cannot use something like:
IN=$d
folderIN=(${IN//_/ })
tf_name=${folderIN[-1]%/*} #get last element which is the TF name
cell_line=${folderIN[-2]%/*}; #get second last element which is the cell line
dataset_name=${folderIN[0]%/*}; #get first element which is the dataset name
cell_line can be one or more words separated by underscore but it's allways between 1st and last underscore.
Any help?

Just do this in a two step bash parameter expansion ONLY because bash does not support nested parameter expansion unlike zsh or other shells.
"${string%_*}" to strip the everything after the last occurrence of '_' and "${tempString#*_}" to strip everything from beginning to first occurrence of '_'
string="ENCSR000ALA_endothelial_cell_of_umbilical_vein_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
endothelial_cell_of_umbilical_vein
Another example,
string="ENCSR000AMA_HepG2_CTCF"
tempString="${string%_*}"
printf "%s\n" "${tempString#*_}"
HepG2
You can modify this logic to apply on each of the file-names in your folder.

Could use regex.
extract_words() {
[[ "$1" =~ ^([^_]+)_(.*)_([^_]+)$ ]] && echo "${BASH_REMATCH[2]}"
}
while read -r from_line
do
extracted=$(extract_words "$from_line")
echo "$from_line" "[$extracted]"
done < list_of_filenames.txt
EDIT: I moved the "extraction" into an alone bash function for reuse and easy modification for more complex cases, like:
extract_words() {
perl -lnE 'say $2 if /^([^_]+)_(.*)_([^_]+)$/' <<< "$1"
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to split string into component parts in Linux Bash/Shell - bash

Related

Extract value for a key in a key/pair string

In bash how can I get the last part of a string after the last hyphen [duplicate]

Basic string manipulation from filenames in bash

Adding a comma after $variable

Split filename and get the element between first and last occurrence of underscore

Categories

Resources