removing all characters expect a-z 0-9 in bash - bash

I am trying to remove all characters besides characters a-z and 0-9 in a bash file here is what I have so far:
#!/bin/bash
i=-1
cat rtrans.txt | while read line
do
i=$((i+1))
for word in $line
do
echo "$i $word"|tr A-Z a-z|sed 's/[\._-]//g'
done
done > input1.test
However with sed it seems like I have to input all different non characters I want to remove.
This there a better way of doing this?

You can use a character class
echo "$i $word" | tr A-Z a-z | sed -e 's/[^a-z0-9]//g'
this removes all characters not ^ in [a-z0-9].
If you want to split a file into words and number the lines consecutively, you can also try
tr -s ' \t' '\n' <rtrans.txt | tr A-Z a-z | sed -e 's/[^a-z]//g' | nl -n ln -w1 -s ' '

You can use ${var/Pattern/Replacement} as suggested with bash parameter substitution.
In your case, to remove from $word all characters besides a-z, A-Z and 0-9:
echo "$i ${word//[^a-zA-Z0-9]/}"

Related

How to remove invisible chars in bash (tr and sed not working)

I want to remove invisible chars from a response:
Here is my code:
test_id=`clasp run testRunner`
echo "visible"
echo "$test_id"
echo "invisible"
echo "$test_id" | cat -v
echo "invisible2"
echo "$test_id" | tr -dc '[:print:]' | cat -v
echo "invisible3"
echo "$test_id" | sed 's/[^a-zA-Z0-9]//g' | cat -v
echo "invisible4"
printf '%q\n' "$test_id"
Here's the output:
visible
1d5422fb
invisible
^[[2K^[[1G1d5422fb
invisible2
[2K[1G1d5422fbinvisible3
2K1G1d5422fb
invisible4
$'\E[2K\E[1G1d5422fb'
The following code works with your example:
shopt -s extglob
test_id=$'\e[2K\e[1G1d5422fb'
test_id="${test_id//$'\e['*([^a-zA-Z])[a-zA-Z]}"
echo "$test_id" | cat -v
The crucial part is the third line, which applies a string substitution to the expanded variable. It matches (and removes) all occurrences of the pattern
$'\e[' - a single Esc character followed by [
*( ... ) - (this is what extglob is needed for) zero or more occurrences of ...
[^a-zA-Z] - a single non-alphabetic character
[a-zA-Z] - a single alphabetic character
In your example this gets rid of the two escape sequences \e[2K (erase line) and \e[1G (move cursor to column 1).
Instead of removing the escape sequences prevent them from being generated, which I guess you can do with
test_id=$(TERM=dumb clasp run testRunner)
echo "solution"
echo "$test_id" | perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)//g' | cat -v
as per #Dave's edit on his own question.

Replace '>' with '>\n' in several files in shell/bash

I have several files in a single folder and I want to replace the character > with >\n everywhere in all of those files.
But whatever I do, the \n character does not get added after the > character.
I have tried the following:
echo '>ABCCHACAC' | tr '\>' '>\\n'
echo '>ABCCHACAC' | tr '>' '>\\n'
echo '>ABCCHACAC' | tr '>' '>\n'
echo '>ABCCHACAC' | tr '>' '\>\n'
echo '>ABCCHACAC' | tr '>' '\>\\n'
echo '>ABCCHACAC' | tr '>' '\>\\n'
But I get the same input string as output, whereas the correct output I want is:
>
ABCCHACAC
And I am using this script to do the same thing on many files:
for f in *.txt
do
tr ">" ">\n" < "$f" > $(basename "$f" .txt)_newline_added.txt
done
tr is for one-for-one character replacements, not replacing strings. E.g. if you translate abc with def, it replaces all a with d, all b with e, and all c with f. When the second string is longer than the first, the extra characters are ignored. So tr '>' '>\n' means to replace > with > and ignores \n.
Use sed to perform string replacements.
sed 's/>/>\n/g' "$f" > "$(basename "$f" .txt)_newline_added.txt"
In addition to Barmar's answer, if you're using a BSD based *nix (eg. OS X) you'll either need to include an escaped literal newline, or possibly use tr in addition to sed.
Escaped literal newline:
$ sed 's/^>/>\
/' "$f"
sed with tr:
$ sed 's/^>/>▾/' "$f" | tr '▾' '\n'
↳ Insert newline (\n) using sed

How to remove special characters from strings but keep underscores in shell script

I have a string that is something like "info_A!__B????????C_*". I wan to remove the special characters from it but keep underscores and letters. I tried with [:word:] (ASCII letters and _) character set, but it says "invalid character set". any idea how to handle this ? Thanks.
text="info_!_????????_*"
if [ -z `echo $text | tr -dc "[:word:]"` ]
......
Using bash parameter expansion:
$ var='info_A!__B????????C_*'
$ echo "${var//[^[:alnum:]_]/}"
info_A__BC_
A sed one-liner would be
sed 's/[^[:alnum:]_]//g' <<< 'info_!????????*'
gives you
info_
An awk one-liner would be
awk '{gsub(/[^[:alnum:]_]/,"",$0)} 1' <<< 'info_!??A_??????*pi9ngo^%$_mingo745'
gives you
info_A_pi9ngo_mingo745
If you don't wish to have numbers in the output then change :alnum: to :alpha:.
My tr doesn't understand [:word:]. I had to do like this:
$ x=$(echo 'info_A!__B????????C_*' | tr -cd '[:alnum:]_')
$ echo $x
info_A__BC_
Not sure if its robust way but it worked for your sample text.
sed one-liner:
echo "SamPlE_#tExT%, really ?" | sed -e 's/[^a-z^A-Z|^_]//g'
SamPlE_tExTreally

bash check if a string has a character more than once

The title actually almost explains it all. I would like to check if a string contains a letter (not a specific letter, really any letter) more than once.
for example:
user:
test.sh this list
script:
if [ "$1" has some letter more then once ]
then
do something
fi
Use a Posix character class:
if [[ $1 =~ [[:alpha:]].*[[:alpha:]] ]]; then
echo "more than one letter"
fi
This regex (in bash) will tell you the first lower case letter that is repeated.
And which is it:
#!/bin/bash
regex="([a-z]).*\1"
if [[ $1 =~ $regex ]]; then
echo "more than one letter ${BASH_REMATCH[1]}"
fi
Call as:
$ script.sh "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz"
more than one letter "z"
Of course, the range of letters could be changed to lower and upper:
[a-zA-Z]
But only if the LC_COLLATE is set to "C", if that is set to UTF-8, then also accented characters could be included in the a-z range. As this may show:
$ ./sc.sh abcdefghijklémnopéqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz
more than one letter "é"
This will keep letters as what ASCII believe a letter is:
$ LC_COLLATE=C ./sc.sh abcdefghijklémnopéqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZz
more than one letter "z"
The range of characters could be some of the POSIX character ranges:
[[:word:]] [[:alpha:]] [[:lower:]] [[:upper:]]
Please note that what those ranges mean is also changed by the character set in use.
If you want to go by using just basic commands, you can use something like this ...
#!/bin/bash
PATH=/bin/:/usr/bin/:$PATH
if [ `echo $* | tr -d ' ' | sed 's/\(.\)/\1\n/g' | sort | uniq -c | tr -s ' ' | sort -n | grep -v '^ 1 ' | wc -l` -ge 1 ]
then
echo "Input contains duplicate characters"
fi
In case it is unclear, it will be easy to try it out each step on the command line like this ... echo test input | tr -d ' 'see the output, then add the sed part to it and so on and so forth.
The first tr -d ' ' will ensure spaces from your input are not counted as duplicates. For example, if the input is "abcd efgh ijkl", the only character repeating is the space. If you keep tr -d ' ' in there, the script will not count the input to be having duplicate characters, if you remove it, the script will count the input to be having duplicate characters.
Cheers.
-- Parag

How to remove extra spaces in bash?

How to remove extra spaces in variable HEAD?
HEAD=" how to remove extra spaces "
Result:
how to remove extra spaces
Try this:
echo "$HEAD" | tr -s " "
or maybe you want to save it in a variable:
NEWHEAD=$(echo "$HEAD" | tr -s " ")
Update
To remove leading and trailing whitespaces, do this:
NEWHEAD=$(echo "$HEAD" | tr -s " ")
NEWHEAD=${NEWHEAD%% }
NEWHEAD=${NEWHEAD## }
Using awk:
$ echo "$HEAD" | awk '$1=$1'
how to remove extra spaces
Take advantage of the word-splitting effects of not quoting your variable
$ HEAD=" how to remove extra spaces "
$ set -- $HEAD
$ HEAD=$*
$ echo ">>>$HEAD<<<"
>>>how to remove extra spaces<<<
If you don't want to use the positional paramaters, use an array
ary=($HEAD)
HEAD=${ary[#]}
echo "$HEAD"
One dangerous side-effect of not quoting is that filename expansion will be in play. So turn it off first, and re-enable it after:
$ set -f
$ set -- $HEAD
$ set +f
This horse isn't quite dead yet: Let's keep beating it!*
Read into array
Other people have mentioned read, but since using unquoted expansion may cause undesirable expansions all answers using it can be regarded as more or less the same. You could do
set -f
read HEAD <<< $HEAD
set +f
or you could do
read -rd '' -a HEAD <<< "$HEAD" # Assuming the default IFS
HEAD="${HEAD[*]}"
Extended Globbing with Parameter Expansion
$ shopt -s extglob
$ HEAD="${HEAD//+( )/ }" HEAD="${HEAD# }" HEAD="${HEAD% }"
$ printf '"%s"\n' "$HEAD"
"how to remove extra spaces"
*No horses were actually harmed – this was merely a metaphor for getting six+ diverse answers to a simple question.
Here's how I would do it with sed:
string=' how to remove extra spaces '
echo "$string" | sed -e 's/ */ /g' -e 's/^ *\(.*\) *$/\1/'
=> how to remove extra spaces # (no spaces at beginning or end)
The first sed expression replaces any groups of more than 1 space with a single space, and the second expression removes any trailing or leading spaces.
echo -e " abc \t def "|column -t|tr -s " "
column -t will:
remove the spaces at the beginning and at the end of the line
convert tabs to spaces
tr -s " " will squeeze multiple spaces to single space
BTW, to see the whole output you can use cat - -A: shows you all spacial characters including tabs and EOL:
echo -e " abc \t def "|cat - -A
output: abc ^I def $
echo -e " abc \t def "|column -t|tr -s " "|cat - -A
output:
abc def$
Whitespace can take the form of both spaces and tabs. Although they are non-printing characters and unseen to us, sed and other tools see them as different forms of whitespace and only operate on what you ask for. ie, if you tell sed to delete x number of spaces, it will do this, but the expression will not match tabs. The inverse is true- supply a tab to sed and it will not match spaces, even if the number of them is equal to those in a tab.
A more extensible solution that will work for removing either/both additional space in the form of spaces and tabs (I've tested mixing both in your specimen variable) is:
echo $HEAD | sed 's/^[[:blank:]]*//g'
or we can tighten-up #Frontear 's excellent suggestion of using xargs without the tr:
echo $HEAD | xargs
However, note that xargs would also remove newlines. So if you were to cat a file and pipe it to xargs, all the extra space- including newlines- are removed and everything put on the same line ;-).
Both of the foregoing achieved your desired result in my testing.
Try this one:
echo ' how to remove extra spaces ' | sed 's/^ *//g' | sed 's/$ *//g' | sed 's/ */ /g'
or
HEAD=" how to remove extra spaces "
HEAD=$(echo "$HEAD" | sed 's/^ *//g' | sed 's/$ *//g' | sed 's/ */ /g')
I would make use of tr to remove the extra spaces, and xargs to trim the back and front.
TEXT=" This is some text "
echo $(echo $TEXT | tr -s " " | xargs)
# [...]$ This is some text
echo variable without quotes does what you want:
HEAD=" how to remove extra spaces "
echo $HEAD
# or assign to new variable
NEW_HEAD=$(echo $HEAD)
echo $NEW_HEAD
output: how to remove extra spaces

Resources