Need help for string manipulation in a bash script - bash

I'm not use to the syntax of bash script. I'm trying to read a file. For each line I want to keep only the part of the string before the delimiter '/' and put it back into a new file if the word respect a perticular length. I've download a dictionary, but the format does not meet my expectation. Since there is 84000 words, I don't really want to manualy remove what after the '/' for each word. I though it would be an easy thing and I follow couple of idea in other similar question on this site, but it seem that I'm missing something somewhere because it still doesn't work. I can't get the length right. The file Test_Input contains one word per line. Here's the code:
#!/usr/bin/bash
filename="Test_Input.txt"
while read -r line
do
sub= echo $line | cut -d '/' -f1
length= echo ${#sub}
if $length >= 4 && $length <= 10;
then echo $sub >> Test_Output.txt
fi
done < "$filename"

Several items:
I assume that you have been using single back-quotes in the assignments, and not literally sub= echo $line | cut -d '/' -f1, as this would have certainly failed. Alternatively, you can also use sub=$(), as in $(echo $line | cut -d '/' -f1)
The conditions in an if clause need to be encompassed by single or double [], like this: if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
Which brings me to the next point: <= doesn't reliably work in bash. Just use -ge for "greater or equal" and -le for "less or equal".
If your line does not contain any / characters, in your version sub will contain the whole line. This might not be what you want, so I'd advise to also add the -s flag to cut.
You don't need somevar=$(echo $someothervar). Just use somevar=$someothervar
Here's a version that works:
#!/usr/bin/env bash
filename="Test_Input.txt"
while read -r line
do
sub=$(echo $line | cut -s -d '/' -f 1)
length=${#sub}
if [[ $length -ge 4 ]] && [[ $length -le 10 ]];
then echo $sub >> Test_Output.txt
fi
done < "$filename"
Of course, you could also just use sed:
sed -n -r '/^[^/]{4,10}\// s;/.*$;;p' Test_Input.txt > Test_Output.txt
Explanation:
-n Don't print anything unless explicitly marked for printing.
-r Use the extended regex
/<searchterm>/ <operation> Search for lines that match a certain criteria, and perform this operation:
Searchterm is: ^[^/]{4,10}\/ From the beginning of the line, there should be between 4 and 10 non-slash characters, followed by the slash
Operation is: s;/.*$;;p replace everything between the first slash and the end of the line with nothing, then print.

awk is the best tool for this
awk -F/ 'length($1) >= 4 && length($1) <= 10 {print $1} > newfile

Related

Detect double new lines with bash script

I am attempting to return the line number of lines that have a break. An input example:
2938
383
3938
3
383
33333
But my script is not working and I can't see why. My script:
input="./input.txt"
declare -i count=0
while IFS= read -r line;
do
((count++))
if [ "$line" == $'\n\n' ]; then
echo "$count"
fi
done < "$input"
So I would expect, 3, 6 as output.
I just receive a blank response in the terminal when I execute. So there isn't a syntax error, something else is wrong with the approach I am taking. Bit stumped and grateful for any pointers..
Also "just use awk" doesn't help me. I need this structure for additional conditions (this is just a preliminary test) and I don't know awk syntax.
The issue is that "$line" == $'\n\n' won't match a newline as it won't be there after consuming an empty line from the input, instead you can match an empty line with regex pattern ^$:
if [[ "$line" =~ ^$ ]]; then
Now it should work.
It's also match easier with awk command:
$ awk '$0 == ""{ print NR }' test.txt
3
6
As Roman suggested, line read by read terminates with a delimiter, and that delimiter would not show up in the line the way you're testing for.
If the pattern you are searching for looks like an empty line (which I infer is how a "double newline" always manifests), then you can just test for that:
while read -r; do
((count++))
if [[ -z "$REPLY" ]]; then
echo "$count"
fi
done < "$input"
Note that IFS is for field-splitting data on lines, and since we're only interested in empty lines, IFS is moot.
Or if the file is small enough to fit in memory and you want something faster:
mapfile -t -O1 foo < i
declare -p foo
for n in "${!foo[#]}"; do
if [[ -z "${foo[$n]}" ]]; then
echo "$n"
fi
done
Reading the file all at once (mapfile) then stepping through an array may be easier on resources than stepping through a file line by line.
You can also just use GNU awk:
gawk -v RS= -F '\n' '{ print (i += NF); i += length(RT) - 1 }' input.txt
By using FS = ".+", it ensures only truly zero-length (i.e. $0 == "") line numbers get printed, while skipping rows consisting entirely of [[:space:]]'s
echo '2938
383
3938
3
383
33333' |
{m,g,n}awk -F'.+' '!NF && $!NF = NR'
3
6
This sed one-liner should do the job at once:
sed -n '/^$/=' input.txt
Simply writes the current line number (the = command) if the line read is empty (the /^$/ matches the empty line).

Q: find longest string using for loop (Bash)

Im learning bash, and I have an assignment where I need to iterate through a list of strings in bash using a for loop, and return the longest string.
This is what I've written:
max=-1
word=""
list=`cat random-text.txt | tr -s [:space:] " " | sed -r 's/([.* ])/\1\n/g' | grep -E "^a.*" | sed -r 's/(.*)[[:space:]]/\1/' | tr -s [:space:] " "`
for i in $list; do
int=`$i | wc -c`
if [ $int > $max ]; then
max=$int
word=$i
fi
done
echo The longest word in $infile that starts with $char is $i
that's probably a bit messy, but I'm having trouble using the for loop (I need the echo function at the end to return the longest string I have found iterating through the array.
** that's a part of a longer script I've written, I
Thanks in advance, much appreciated!
for some reason, while I run this script I get an error which says: "Command 'an' not found
That's because you erroneously used $i | to feed the content of variable i to wc; correct is <<<$i instead (with Bash). But better use just int=${#i}.
Then in $int > $max the > is interpreted as an output redirection; the correct arithmetic comparison operator is -gt.
Finally you don't echo the longest word found, but rather the last processed one; change $i to $word there.

How to remove last part of a string with different length in bash

I am trying to collect the lines from a file which doesn't start with a # as its first caracter.
I have this code I am able to get them:
while IFS= read -r line
do
[[ -z "$line" ]] && continue
[[ "$line" =~ ^# ]] && continue
#echo "LINEREADED: $line"
done < $file
So the output I have is something like this:
modules/core_as/xxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxx [100]
My question is how can I get only the string without the [100]?
I know there is some commands like sed or trim but the problem is that the string is not always that length, sometimes is different like:
cross_modules/core_as/xxxx/xxxxxxxxx [100-103]
or
cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx [100-103]
or anything like that...
And in all this cases I only need the string without the [....] and without the last blank space at the end of last x, whichever the length of the string is, like cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx
echo ${caseReaded:1:${#caseReaded}-7}
This also do the job but is not generic for any length.
Does anyone knows how I can get this?
You can strip a certain part of a string in bash
echo "${line% [*}"
cross_modules/core_as/xxxx/xxxxxxxxx
modules/core_as/xxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxx
cross_modules/core_as/xxxxxxxxxxxx/xxxxxxxxx
If the spaces are only before [:
while IFS= read -r line _
do
[[ -z $line ]] && continue
[[ $line =~ ^# ]] && continue
done < "$file"
grep to match all lines not starting with # and then display the first field using cut, which works if the first field doesn't contain spaces:
grep -v ^# "$file" | cut -f1 -d' '
If the thing before [100] contains spaces, this may be the way to go:
grep -v ^# "$file" | sed -E 's/^(.*) .*$/\1/'
The last one works because the .* match in sed is greedy so only the last space will be left to match the outer condition .*$.

How to cut variables which are beteween quotes from a string

I had problem with cut variables from string in " quotes. I have some scripts to write for my sys classes, I had a problem with a script in which I had to read input from the user in the form of (a="var1", b="var2")
I tried the code below
#!/bin/bash
read input
a=$($input | cut -d '"' -f3)
echo $a
it returns me a error "not found a command" on line 3 I tried to double brackets like
a=$(($input | cut -d '"' -f3)
but it's still wrong.
In a comment the OP gave a working answer (should post it as an answer):
#!/bin/bash
read input
a=$(echo $input | cut -d '"' -f2)
b=$(echo $input | cut -d '"' -f4)
echo sum: $(( a + b))
echo difference: $(( a - b))
This will work for user input that is exactly like a="8", b="5".
Never trust input.
You might want to add the check
if [[ ${input} =~ ^[a-z]+=\"[0-9]+\",\ [a-z]+=\"[0-9]+\"$ ]]; then
echo "Use your code"
else
echo "Incorrect input"
fi
And when you add a check, you might want to execute the input (after replacing the comma with a semicolon).
input='testa="8", testb="5"'
if [[ ${input} =~ ^[a-z]+=\"[0-9]+\",\ [a-z]+=\"[0-9]+\"$ ]];
then
eval $(tr "," ";" <<< ${input})
set | grep -E "^test[ab]="
else
echo no
fi
EDIT:
#PesaThe commented correctly about BASH_REMATCH:
When you use bash and a test on the input you can use
if [[ ${input} =~ ^[a-z]+=\"([0-9]+)\",\ [a-z]+=\"([0-9])+\"$ ]];
then
a="${BASH_REMATCH[1]}"
b="${BASH_REMATCH[2]}"
fi
To extract the digit 1 from a string "var1" you would use a Bash substring replacement most likely:
$ s="var1"
$ echo "${s//[^0-9]/}"
1
Or,
$ a="${s//[^0-9]/}"
$ echo "$a"
1
This works by replacing any non digits in a string with nothing. Which works in your example with a single number field in the string but may not be what you need if you have multiple number fields:
$ s2="1 and a 2 and 3"
$ echo "${s2//[^0-9]/}"
123
In this case, you would use sed or grep awk or a Bash regex to capture the individual number fields and keep them distinct:
$ echo "$s2" | grep -o -E '[[:digit:]]+'
1
2
3

How to check if word is in alphabetical order

I 'd like to find a bash only (no sed, awk, perl, ...) for finding out if a word is in alphabetical order, in other words every letter is.
example:
bdjkz is true,
ahjmno is true,
sdgla is false.
I'm already struggling just comparing ascii values for characters, so if anyone could point me in the right direction for that it would help a lot!
Thanks
Pure bash solution (no external tool used), using Parameter Expansion to address characters inside strings:
function compare () {
word=$1
for (( pos=0; pos<${#word}-1; pos++ )) ; do
[[ ${word:pos:1} < ${word:pos+1:1} ]] || return 1
done
return 0
}
Tested with
for word in bdjkz ahjmno sdgla ; do
if compare $word ; then
echo $word ordered
else
echo $word not ordered
fi
done
If you can utilize other command line tools (but not awk, sed, perl), you can try:
[[ "YOURSTRING" = "$(echo "YOURSTRING" | grep -o '.' | sort -n |tr -d '\n')" ]] && \
echo "Alphabetic order"
[[ ... ]] is testing the expresion
"YOURSTRING" = string comparison
"$( ... )" capture the inner workings output in a string
echo "YOURSTRING" | grep -o '.' print every character on a line from "YOURSTRING" (-o '.': print only the matches for any single character - NOTE: you might need a new version of grep for this option)
... sort -n | sort the output from 4.
... tr -d '\n' rejoin the characters from 5. (by deleting the trailing new line characters)
You can use:
p='bdjkz'
q=$(fold -w1 <<< "$p"|sort|tr -d "\n")
[[ "$p" == "$q" ]] && echo "in alphabetical order" || echo "not in alphabetical order"
s=($(echo "existingString" | grep -o .)) # put each character of input string in an array.
k=($(printf '%s\n' "${s[#]}" | sort)) # sorts the input string
if [[ "${s[*]}" == "${k[*]}" ]]; then # comparing the input string array with sorted array
echo "alphabetical"
else
echo "not alphabetical"
fi

Resources