How to remove last part of a string with different length in bash - bash

I am trying to collect the lines from a file which doesn't start with a # as its first caracter.
I have this code I am able to get them:
while IFS= read -r line
do
[[ -z "$line" ]] && continue
[[ "$line" =~ ^# ]] && continue
#echo "LINEREADED: $line"
done < $file
So the output I have is something like this:
modules/core_as/xxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxx [100]
My question is how can I get only the string without the [100]?
I know there is some commands like sed or trim but the problem is that the string is not always that length, sometimes is different like:
cross_modules/core_as/xxxx/xxxxxxxxx [100-103]
or
cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx [100-103]
or anything like that...
And in all this cases I only need the string without the [....] and without the last blank space at the end of last x, whichever the length of the string is, like cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx
echo ${caseReaded:1:${#caseReaded}-7}
This also do the job but is not generic for any length.
Does anyone knows how I can get this?

You can strip a certain part of a string in bash
echo "${line% [*}"
cross_modules/core_as/xxxx/xxxxxxxxx
modules/core_as/xxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxx
cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx

If the spaces are only before [:
while IFS= read -r line _
do
[[ -z $line ]] && continue
[[ $line =~ ^# ]] && continue
done < "$file"

grep to match all lines not starting with # and then display the first field using cut, which works if the first field doesn't contain spaces:
grep -v ^# "$file" | cut -f1 -d' '
If the thing before [100] contains spaces, this may be the way to go:
grep -v ^# "$file" | sed -E 's/^(.*) .*$/\1/'
The last one works because the .* match in sed is greedy so only the last space will be left to match the outer condition .*$.

Related

Detect double new lines with bash script

I am attempting to return the line number of lines that have a break. An input example:
2938
383
3938
3
383
33333
But my script is not working and I can't see why. My script:
input="./input.txt"
declare -i count=0
while IFS= read -r line;
do
((count++))
if [ "$line" == $'\n\n' ]; then
echo "$count"
fi
done < "$input"
So I would expect, 3, 6 as output.
I just receive a blank response in the terminal when I execute. So there isn't a syntax error, something else is wrong with the approach I am taking. Bit stumped and grateful for any pointers..
Also "just use awk" doesn't help me. I need this structure for additional conditions (this is just a preliminary test) and I don't know awk syntax.
The issue is that "$line" == $'\n\n' won't match a newline as it won't be there after consuming an empty line from the input, instead you can match an empty line with regex pattern ^$:
if [[ "$line" =~ ^$ ]]; then
Now it should work.
It's also match easier with awk command:
$ awk '$0 == ""{ print NR }' test.txt
3
6
As Roman suggested, line read by read terminates with a delimiter, and that delimiter would not show up in the line the way you're testing for.
If the pattern you are searching for looks like an empty line (which I infer is how a "double newline" always manifests), then you can just test for that:
while read -r; do
((count++))
if [[ -z "$REPLY" ]]; then
echo "$count"
fi
done < "$input"
Note that IFS is for field-splitting data on lines, and since we're only interested in empty lines, IFS is moot.
Or if the file is small enough to fit in memory and you want something faster:
mapfile -t -O1 foo < i
declare -p foo
for n in "${!foo[#]}"; do
if [[ -z "${foo[$n]}" ]]; then
echo "$n"
fi
done
Reading the file all at once (mapfile) then stepping through an array may be easier on resources than stepping through a file line by line.
You can also just use GNU awk:
gawk -v RS= -F '\n' '{ print (i += NF); i += length(RT) - 1 }' input.txt
By using FS = ".+", it ensures only truly zero-length (i.e. $0 == "") line numbers get printed, while skipping rows consisting entirely of [[:space:]]'s
echo '2938
383
3938
3
383
33333' |
{m,g,n}awk -F'.+' '!NF && $!NF = NR'
3
6
This sed one-liner should do the job at once:
sed -n '/^$/=' input.txt
Simply writes the current line number (the = command) if the line read is empty (the /^$/ matches the empty line).

Need help for string manipulation in a bash script

I'm not use to the syntax of bash script. I'm trying to read a file. For each line I want to keep only the part of the string before the delimiter '/' and put it back into a new file if the word respect a perticular length. I've download a dictionary, but the format does not meet my expectation. Since there is 84000 words, I don't really want to manualy remove what after the '/' for each word. I though it would be an easy thing and I follow couple of idea in other similar question on this site, but it seem that I'm missing something somewhere because it still doesn't work. I can't get the length right. The file Test_Input contains one word per line. Here's the code:
#!/usr/bin/bash
filename="Test_Input.txt"
while read -r line
do
sub= echo $line | cut -d '/' -f1
length= echo ${#sub}
if $length >= 4 && $length <= 10;
then echo $sub >> Test_Output.txt
fi
done < "$filename"
Several items:
I assume that you have been using single back-quotes in the assignments, and not literally sub= echo $line | cut -d '/' -f1, as this would have certainly failed. Alternatively, you can also use sub=$(), as in $(echo $line | cut -d '/' -f1)
The conditions in an if clause need to be encompassed by single or double [], like this: if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
Which brings me to the next point: <= doesn't reliably work in bash. Just use -ge for "greater or equal" and -le for "less or equal".
If your line does not contain any / characters, in your version sub will contain the whole line. This might not be what you want, so I'd advise to also add the -s flag to cut.
You don't need somevar=$(echo $someothervar). Just use somevar=$someothervar
Here's a version that works:
#!/usr/bin/env bash
filename="Test_Input.txt"
while read -r line
do
sub=$(echo $line | cut -s -d '/' -f 1)
length=${#sub}
if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
then echo $sub >> Test_Output.txt
fi
done < "$filename"
Of course, you could also just use sed:
sed -n -r '/^[^/]{4,10}\// s;/.*$;;p' Test_Input.txt > Test_Output.txt
Explanation:
-n Don't print anything unless explicitly marked for printing.
-r Use the extended regex
/<searchterm>/ <operation> Search for lines that match a certain criteria, and perform this operation:
Searchterm is: ^[^/]{4,10}\/ From the beginning of the line, there should be between 4 and 10 non-slash characters, followed by the slash
Operation is: s;/.*$;;p replace everything between the first slash and the end of the line with nothing, then print.
awk is the best tool for this
awk -F/ 'length($1) >= 4 && length($1) <= 10 {print $1} > newfile

bash loop through file replace string

I have a file called file.txt that contains the following:
123
223
Lane,id,s_id_sample_id
1,3_range.single_try,N76
2,44_range.single_try,N77
3,92_out_range.double_try,N79
I like to loop through this file and do the following:
begin from line after 'Lane' then split using comma and take the second column (id)
then take the id column and split on underscore, then
search and replace all dots and underscores with 'X' EXCEPT THE LAST TWO UNDERSCORES. So do not search and replace the last underscore (e.g. double_try).
So will like to end up with:
123
223
Lane,id,s_id_sample_id
1,3Xrange_single_try,N76
2,44Xrange_single_try,N77
3,92XoutXrange_double_try,N79
This is what I have done:
while IFS=',' read -r f1 f2; do
sed -e 's/_/X/g;s/\./X/g;s/'
echo "$f1,$f2"
done < "$file" > output
mv output $file
The problem is how can I specify to ignore the last two underscores?
This works by first replacing the last two dots or underscores with '#', then replacing the remaining dots and underscores with 'X', and finally, replacing all the '#' characters with underscores:
IFS=','
while read -r f1 f2 f3; do
f2=$(sed 's/[._]\([^._]\+\)[._]\([^._]\+\)$/#\1#\2/;s/[._]/X/g;s/#/_/g' <<< "$f2")
echo -n "$f1"
[[ -n $f2 ]] && echo -n ",$f2"
[[ -n $f3 ]] && echo -n ",$f3"
echo
done < "$file" > output
mv output "$file"
If '#' is likely to occur in your input data, you may want to use a different character. Anything that you can be reasonably sure won't occur in your input will do.

Checking for empty lines in a file

I don't have a code example here since I'm not sure how to do this at all, but I have a file. A legal empty line is one that only contains the new-line tab. Spaces or tabs are illegal.
How do I check if a line is "legally empty"?
If it doesn't have any words (I can check this with wc -w), how do I check if it has no spaces or tabs either, just new-line?
So I've tried something like this:
while read line; do
if [[ "$line" =~ ^$ ]]; then
echo empty line
continue
fi
done < $1
But it's not working. If I put a " " in an otherwise empty line, it still considers it empty.
If you want the line numbers of those empty lines:
perl -lne 'print $. if(/^$/)' your_file
If you want to delete those lines without Perl:
grep . your_file >new_file
If you want to delete those empty line in place using Perl:
perl -i -lne 'print if(/./)' your_file
Terminology: a line that contains only white space is a blank line. A line that contains nothing (except for the newline terminator) is an empty line.
The read builtin strips off leading and trailing whitespace. So if it encounters a blank line, it sets its argument to an empty string, regardless of the amount of whitespace. To avoid this behavior and return the input line unmodified, set the field separator characters to nothing (by default, they are space, tab and newline): set the IFS variable to the empty string. See Why is while IFS= read used so often, instead of IFS=; while read..? for a more detailed explanation. While you're at it, pass the -r option to read, unless you want backslash-newline sequences to be a line continuation.
while IFS= read -r line; do
if [ -z "$line" ]; then
echo empty line
fi
done <"$1"
If you want to tell whether a line is blank:
while IFS= read -r line; do
case "$line" in
'') echo "empty line";;
*[![:space:]]*) echo "non-blank line";;
*) echo "non-empty blank line";;
esac
done <"$1"
You can use Bash regular expression matching if you prefer:
while IFS= read -r line; do
if [[ "$line" =~ ^$ ]]; then
echo "empty line"
elif [[ "$line" =~ ^[[:space:]]+$ ]]; then
echo "non-empty blank line"
else
echo "non-blank line"
fi
done <"$1"
These can be done with pattern matching too (using shell wildcards, which have a different syntax from common regular expressions):
while IFS= read -r line; do
if [[ "$line" == "" ]]; then
echo "empty line"
elif [[ "$line" != *[![:space:]]* ]]; then
echo "non-empty blank line"
else
echo "non-blank line"
fi
done <"$1"
If you merely want to look for empty lines in the file and aren't processing the lines in any other way, you can use grep:
if grep -qxF '' <"$1"; then
echo "$1 contains an empty line"
fi
If you're looking for blank lines that are not empty:
if grep -Ex '[[:space:]]+' <"$1"; then
echo "$1 contains a non-empty blank line"
fi
You can check for an empty line with the regex
^$
^ is the beginning of a line, $ is the end of a line, the above regex matches if there are no other characters.
You can now use that in e.g. sed
sed '/^$/d' input.txt
This would delete all empty lines from your input file.
This would remove empty lines from the file and display the file content on console. The file still remains unchanged.
If you want to remove the empty lines from the file (meaning, changing the file content), then run:
sed -i '/^$/d' input.txt

Extract numbers from strings

I have a file containing on each line a string of the form
string1.string2:\string3{string4}{number}
and what I want to extract is the number. I've searched and tried for a while to get this done using sed or bash, but failed. Any help would be much appreciated.
Edit 1: The strings may contains numbers.
$ echo 'string1.string2:\string3{string4}{number}' |\
cut -d'{' -f3 | cut -d'}' -f 1
number
Using sed:
sed 's/[^}]*}{\([0-9]*\)}/\1/' input_file
Description:
[^}]*} : match anything that is not } and the following }
{\([0-9]*\)}: capture the following digits within {...}
/\1/ : substitute all with the captured number
Use grep:
grep -o '\{[0-9]\+\}' | tr -d '[{}]'
In bash:
sRE='[[:alnum:]]+'
nRE='[[:digit:]]+'
[[ $str =~ $sRE\.$sRE:\\$sRE\{$sRE\}\{($nRE)\} ]] && number=${BASH_REMATCH[1]}
You can drop the first part of the regular expression, if your text file is sufficiently uniform:
[[ $str =~ \\$sRE{$sRE}{($nRE)} ]] && number=${BASH_REMATCH[1]}
or even
[[ $str =~ {$sRE}{($nRE)} ]] && number=${BASH_REMATCH[1]}

Resources