how do I split a string on the nth delimiter? - bash

For every line in my file, I want to print everything on that line before the 4th dash.
Input:
TCGA-HC-8216-10A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01
and I want to split each line on the fourth dash "-"
Output:
TCGA-HC-8216-10A
TCGA-J4-8200-10A
TCGA-EJ-A65E-10A
I know I can split on every dash like this:
#!/usr/bin/env bash
IN="TCGA-HC-8216-01A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01"
arr=$(echo $IN | tr "-" "\n")
for x in $arr
do
echo "> [$x]"
done
but this splits and prints each part of the string between every dash.

Use cut
cut -d- -f1-4 <<'EOF'
TCGA-HC-8216-01A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01
EOF
You are cutting your input on -d (delimiter) of - and returning -f (fields) 1-4, one through four.

#!/bin/bash
IN="TCGA-HC-8216-01A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01"
arr=$(echo "$IN" | cut -d '-' -f1-4)
echo "$arr"
Prints:
TCGA-HC-8216-01A
TCGA-J4-8200-10A
TCGA-EJ-A65E-10A

Using pure bash and pattern matching:
#!/bin/bash
IN="TCGA-HC-8216-01A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01"
re='([^-]+-){3}[^-]+'
for line in $IN
do
if [[ $line =~ $re ]]; then
trunc=${BASH_REMATCH[0]}
fi
echo "$trunc"
done
Output:
TCGA-HC-8216-01A
TCGA-J4-8200-10A
TCGA-EJ-A65E-10A

Using grep with ERE:
arr=$(echo "$IN" | grep -oE "^([^-]*-){3}[^-]*")
With BRE:
arr=$(echo "$IN" | grep -o "^\([^-]*-\)\{3\}[^-]*")
Example:
#!/bin/bash
IN="TCGA-HC-8216-01A-11D-A323-01
TCGA-J4-8200-10A-11D-A323-01
TCGA-EJ-A65E-10A-11D-A323-01"
arr=$(echo "$IN" | grep -oE "^([^-]*-){3}[^-]*")
for x in $arr
do
echo "> [$x]"
done
Output:
> [TCGA-HC-8216-01A]
> [TCGA-J4-8200-10A]
> [TCGA-EJ-A65E-10A]

Related

Grep command returns nothing in shell script

When I try to extract rows that are matched string which are in another file.But the grep command returns nothing.
#!/bin/bash
input="export.txt"
file="filename.csv"
val=`head -n 1 $file`
echo $val>export.csv
cat export.txt | while read line
do
val=`echo $line | tr -d '\n'`
echo $val
valu=`grep $val $file`
echo $valu
done
You can simply do this :
grep -f list.txt input.txt
Which will extract all the lines from input which match any word from list.txt.
If for some reason you want to save each match, you can do it in a Bash array as :
IFS=$'\n' read -d '' -a values <<< "$( grep -f list.txt input.txt )"
And then you can print a certain match as :
echo "${values[1]}"
Regards!

Bash script to stdout stuck with redirect

My bash script is the following:
#!/bin/bash
if [ ! -f "$1" ]; then
exit
fi
while read line;do
str1="[GAC]*T"
num=$"(echo $line | tr -d -c 'T' | wc -m)"
for((i=0;i<$num;i++))do
echo $line | sed "s/$str1/&\n/" | head -n1 -q
str1="${str1}[GAC]*T"
done
str1="[GAC]*T"
done < "$1
While it works normally as it should (take the filename input and print it line by line until the letter T and next letter T and so on) it prints to the terminal.
Input:
GATTT
ATCGT
Output:
GAT
GATT
GATTT
AT
ATCGT
When I'm using the script with | tee outputfile the outputfile is correct but when using the script with > outputfile the terminal hangs / is stuck and does not finish. Moreover it works with bash -x scriptname inputfile > outputfile but is stuck with bash scriptname inputfile > outputfile.
I made modifications to your original script, please try:
if [ ! -f "$1" ]; then
exit
fi
while IFS='' read -r line || [[ -n "$line" ]];do
str1="[GAC]*T"
num=$(echo $line | tr -d -c 'T' | wc -m)
for((i=0;i<$num;i++));do
echo $line | sed "s/$str1/&\n/" | head -n1 -q
str1="${str1}[GAC]*T"
done
str1="[GAC]*T"
done < "$1"
For input:
GATTT
ATCGT
This script outputs:
GAT
GATT
GATTT
AT
ATCGT
Modifications made to your original script were:
Line while read line; do changed to while IFS='' read -r line || [[ -n "$line" ]]; do. Why I did this is explained here: Read a file line by line assigning the value to a variable
Line num=$"(echo $line | tr -d -c 'T' | wc -m)" changed to num=$(echo $line | tr -d -c 'T' | wc -m)
Line for((i=0;i<$num;i++))do changed to for((i=0;i<$num;i++));do
Line done < "$1 changed to done < "$1"
Now you can do: ./scriptname inputfile > outputfile
Try:
sed -r 's/([^T]*T+)/\1\n/g' gatc.txt > outputfile
instead of your script.
It takes some optional non-Ts, followed by at least one T and inserts a newline after the T.
cat gatc.txt
GATGATTGATTTATATCGT
sed -r 's/([^T]*T+)/\1\n/g' gatc.txt
GAT
GATT
GATTT
AT
AT
CGT
For multiple lines, to delete empty lines in the end:
echo "GATTT
ATCGT" | sed -r 's/([^T]*T+)/\1\n/g;' | sed '/^$/d'
GATTT
AT
CGT

Extract data between delimiters from a Shell Script variable

I have this shell script variable, var. It keeps 3 entries separated by new line. From this variable var, I want to extract 2, and 0.078688. Just these two numbers.
var="USER_ID=2
# 0.078688
Suhas"
These are the code I tried:
echo "$var" | grep -o -P '(?<=\=).*(?=\n)' # For extracting 2
echo "$var" | awk -v FS="(# |\n)" '{print $2}' # For extracting 0.078688
None of the above working. What is the problem here? How to fix this ?
Just use tr alone for retaining the numerical digits, the dot (.) and the white-space and remove everything else.
tr -cd '0-9. ' <<<"$var"
2 0.078688
From the man page, of tr for usage of -c, -d flags,
tr [OPTION]... SET1 [SET2]
-c, -C, --complement
use the complement of SET1
-d, --delete
delete characters in SET1, do not translate
To store it in variables,
IFS=' ' read -r var1 var2 < <(tr -cd '0-9. ' <<<"$var")
printf "%s\n" "$var1"
2
printf "%s\n" "$var2"
2
0.078688
Or in an array as
IFS=' ' read -ra numArray < <(tr -cd '0-9. ' <<<"$var")
printf "%s\n" "${numArray[#]}"
2
0.078688
Note:- The -cd flags in tr are POSIX compliant and will work on any systems that has tr installed.
echo "$var" |grep -oP 'USER_ID=\K.*'
2
echo "$var" |grep -oP '# \K.*'
0.078688
Your solution is near to perfect, you need to chance \n to $ which represent end of line.
echo "$var" |awk -F'# ' '/#/{print $2}'
0.078688
echo "$var" |awk -F'=' '/USER_ID/{print $2}'
2
You can do it with pure bash using a regex:
#!/bin/bash
var="USER_ID=2
# 0.078688
Suhas"
[[ ${var} =~ =([0-9]+).*#[[:space:]]([0-9\.]+) ]] && result1="${BASH_REMATCH[1]}" && result2="${BASH_REMATCH[2]}"
echo "${result1}"
echo "${result2}"
With awk:
First value:
echo "$var" | grep 'USER_ID' | awk -F "=" '{print $2}'
Second value:
echo "$var" | grep '#' | awk '{print $2}'
Assuming this is the format of data as your sample
# For extracting 2
echo "$var" | sed -e '/.*=/!d' -e 's///'
echo "$var" | awk -F '=' 'NR==1{ print $2}'
# For extracting 0.078688
echo "$var" | sed -e '/.*#[[:blank:]]*/!d' -e 's///'
echo "$var" | awk -F '#' 'NR==2{ print $2}'

Check if a string contains "-" and "]" at the same time

I have the next two regex in Bash:
1.^[-a-zA-Z0-9\,\.\;\:]*$
2.^[]a-zA-Z0-9\,\.\;\:]*$
The first matches when the string contains a "-" and the other values.
The second when contains a "]".
I put this values at the beginning of my regex because I can't scape them.
How I can get match the two values at the same time?
You can also place the - at the end of the bracket expression, since a range must be closed on both ends.
^[]a-zA-Z0-9,.;:-]*$
You don't have to escape any of the other characters, either. Colons, semicolons, and commas have no special meaning in any part of a regular expression, and while a period loses its special meaning inside a bracket expression.
Basically you can use this:
grep -E '^.*\-.*\[|\[.*\-.*$'
It matches either a - followed by zero or more arbitrary chars and a [ or a [ followed by zero or more chars and a -
However since you don't accept arbitrary chars, you need to change it to:
grep -E '^[a-zA-Z0-9,.;:]*\-[a-zA-Z0-9,.;:]*\[|\[[a-zA-Z0-9,.;:]*\-[a-zA-Z0-9,.;:]*$'
Maybe, this can help you
#!/bin/bash
while read p; do
echo $p | grep -E '\-.*\]|\].*\-' | grep "^[]a-zA-Z0-9,.;:-]*$"
done <$1
user-host:/tmp$ cat test
-i]string
]adfadfa-
string-
]string
str]ing
]123string
123string-
?????
++++++
user-host:/tmp$ ./test.sh test
-i]string
]adfadfa-
There are two questions in your post.
One is in the description:
How I can get match the two values at the same time?
That is an OR match, which could be done with a range that mix your two ranges:
pattern='^[]a-zA-Z0-9,.;:-]*$'
That will match a line that either contains one (or several) -…OR…]…OR any of the included characters. That would be all the lines (except ?????, ++++++ and as df gh) in the test script below.
Two is in the title:
… a string contains “-” and “]” at the same time
That is an AND match. The simplest (and slowest) way to do it is:
echo "$line" | grep '-' | grep ']' | grep '^[-a-zA-Z0-9,.;:]*$'
The first two calls to grep select only the lines that:
contain both (one or several) - and (one or several) ]
Test script:
#!/bin/bash
printlines(){
cat <<-\_test_lines_
asdfgh
asdfgh-
asdfgh]
as]df
as,df
as.df
as;df
as:df
as-df
as]]]df
as---df
asAS]]]DFdf
as123--456DF
as,.;:-df
as-dfg]h
as]dfg-h
a]s]d]f]g]h
a]s]d]f]g]h-
s-t-r-i-n-g]
as]df-gh
123]asdefgh
123asd-fgh-
?????
++++++
as df gh
_test_lines_
}
pattern='^[]a-zA-Z0-9,.;:-]*$'
printf '%s\n' "Testing the simple pattern of $pattern"
while read line; do
resultgrep="$( echo "$line" | grep "$pattern" )"
printf '%13s %-13s\n' "$line" "$resultgrep"
done < <(printlines)
echo "#############################################################"
echo
p1='-'; p2=']'; p3='^[]a-zA-Z0-9,.;:-]*$'
printf '%s\n' "Testing a 'grep AND' of '$p1', '$p2' and '$p3'."
while read line; do
resultgrep="$( echo "$line" | grep "$p1" | grep "$p2" | grep "$p3" )"
[[ $resultgrep ]] && printf '%13s %-13s\n' "$line" "$resultgrep"
done < <(printlines)
echo "#############################################################"
echo
printf '%s\n' "Testing an 'AWK AND' of '$p1', '$p2' and '$p3'."
while read line; do
resultawk="$( echo "$line" |
awk -v p1="$p1" -v p2="$p2" -v p3="$p3" '$0~p1 && $0~p2 && $0~p3' )"
[[ $resultawk ]] && printf '%13s %-13s\n' "$line" "$resultawk"
done < <(printlines)
echo "#############################################################"
echo
printf '%s\n' "Testing a 'bash AND' of '$p1', '$p2' and '$p3'."
while read line; do
rgrep="$( echo "$line" | grep "$p1" | grep "$p2" | grep "$p3" )"
[[ ( $line =~ $p1 ) && ( $line =~ $p2 ) && ( $line =~ $p3 ) ]]
rbash=${BASH_REMATCH[0]}
[[ $rbash ]] && printf '%13s %-13s %-13s\n' "$line" "$rgrep" "$rbash"
done < <(printlines)
echo "#############################################################"
echo

print line without the first word into a variable

This is my code
title=""
line=""
fname=$1
numoflines=$(wc -l < $fname)
for ((i=2 ; i<=$numoflines ; i++))
do
...
done
In the for loop i want to print the first word of every line into $title
and the rest of the line without the first word into $line
(using bash)
tnx
I am assuming that by print to a variable you mean add the contents of each line to the variable. To do this, you can use the bash built-in function read:
while read -r t l; do title+="$t"; line+="$l"; done < "$fname"
This will add the first word of every line to $title and the rest of the line to $line.
You can do some like this:
echo "$fname"
This is my line.
My cat is green.
title=$(awk '{print $1}' <<< "$fname")
line=$(awk '{$1="";sub(/^ /,"")}1' <<< "$fname")
echo "$title"
This
My
echo "$line"
is my line.
cat is green.
Alternative approach using the cut command:
file="./myfile.txt"
title=$(cut -f1 -d ' ' "$file")
line=$(cut -f2- -d ' ' "$file")
#check print
pr -tm <(echo -e "TITLES\n$title") <(echo -e "LINES\n$line")
for the next myfile.txt
My cat is green.
Green cats are strange.
prints
TITLES LINES
My cat is green.
Green cats are strange.
do
Tempo="$( sed -n "${i} {s/^[[:blank:]]*\([^[:blank:]]*\)[[:blank:]]*\(.*\)/title='\1';line='\2'/p;q;}" ${fname} )"
eval "${Tempo}"
done
# or
do
sed -n "${i} {p;q;}" | read Line Title
# but this does not keep content available on each OS/shell
done

Resources