Listing all words containing more than 1 capitalized letter - bash

I want to search for all of the acronyms placed within a document so I can correct their formatting. I think I can assume that all acronyms are words containing at least 2 capital letters in them (e.g.: "EU"), as I've never seen a one-word acronym or acronym only containing 1 capital letter, but sometimes they have a small "o" for "of" in them or another small letter. How can I print out a list showing all of the possible matches once?

This might work for you:
tr -s '[:space:]' '\n' <input.txt | sed '/\<[[:upper:]]\{2,\}\>/!d' | sort -u

The -o option of grep can help you:
grep -o '\b[[:alpha:]]*[[:upper:]][[:alpha:]]*[[:upper:]][[:alpha:]]*'

Almost only Bash:
for word in $(cat file.txt) ; do
if [[ $word =~ [[:upper:]].*[[:upper:]] ]] ; then # at least 2 capital letters
echo "${word//[^[:alpha:]]/}" # remove non-alphabetic characters
fi
done

Will this work for you:
sed 's/[[:space:]]\+/\n/g' $your_file | sort -u | egrep '[[:upper:]].*[[:upper:]]'
Translation:
Replace all runs of whitespace in $your_file with newlines. This will put each word on its own line.
Sort the file and remove duplicates.
Find all lines that contain two uppercase letters separated by zero or more characters.

One way using perl.
Example:
Content of infile:
One T
Two T
THREE
Four
Five SIX
Running the perl command:
perl -ne 'printf qq[%s\n], $1 while /\b([[:upper:]]{2,})\b/g' infile
Result:
THREE
SIX

Related

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

Excluding only four-digit and five-digit numbers of a txt file

I have a file in linux bash, where I have a list of file names basically.
The filenames are including all kind of characters A-Z a-Z . _ and numbers.
Examples:
hello-34.87-world
foo-34578-bar
fo.23-5789-foobar
and a lot more...
The goal is, that I get a list of only the four and five digit numbers.
So Outcome should be:
34578
5789
I tought it would be a good idea, to work with vi.
So I could use only one command like:
:%s/someregularexpression//g
Thanks for your help.
Without having to store the filenames in a file
$ shopt -s extglob nullglob
$ for file in *-[0-9][0-9][0-9][0-9]?([0-9])-*; do
tmp=${file#*-} # remove up to the first hyphen
tmp=${tmp%-*} # remove the last hyphen and after
echo $tmp
done
5789
34578
Just use sed:
sed -nr 's/^[^0-9]*([0-9]{4,5}).*/\1/p' < myfile.txt
If you use vim, and a line doesn't have more than one such number per line, you can try the following:
:%s/^.\{-}\(\d\{4,5\}\).\{-}$/\1/g
And see :help \{- for non-greedy search.
This works with 1 instance per line and surrounded by 1 pair of dashes
grep -P '\d{4,5}' mytxt | \
while read buff
do
buff=${buff#*-}
echo ${buff%-*}
done
Normally you do not want to parse ls in view of spaces, newlines and other special characters. In this case you don't care.
First replace all non-numeric things into newlines.
Than only look for lines with 4 or 5 digits. After the replacement you only have digits, so this can be done by looking for 4 or 5 characters.
ls | tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"
When you already have the filenames in a file and you are in vi, use
:% !tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"

How to overcome greedy match everything when looking for a particular string later?

echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.*([0-9]+) guys.*/\1/g'
The above command currently outputs just 5. Essentially I'd like to parse the number of "guys" in a random sentence that could have numbers (or not.. I'd also like to parse just echo "365 guys") preceding the number of guys. My .* is matching the 36 and preventing it from appearing in the \1. How can I write a sed command (or any other regex/perl/awk) to accomplish what I want?
Use the "frugal" quantifier *? in Perl:
perl -pe 's/.*?([0-9]+) guys.*/$1/'
With GNU grep:
$ grep -Po '\b[0-9]+(?= guys\b)' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
-P actives support for PCREs, which enables advanced regex features.
-o specifies that only the matching parts of input lines should be printed.
\b matches only on a word boundary, including at the start of a line;
this prevents matching numbers that aren't stand-alone numbers but part of other words, such as in foo365 guys, and words that start with guys, such as guysanddolls.
(?= guys) is a look-ahead assertion that matches the enclosed subexpression without including it in the matched string returned.
As demonstrated, this may match multiple patterns on a given line, with each number extracted printed on its own output line.
If that is undesired, grep cannot be used, because -o invariably returns all of a line's matches; see the perl command below for a solution.
Inspired by Sobrique's comment on choroba's answer, here is the perl equivalent of the above grep command:
$ perl -lne 'print for m/\b(\d+) guys\b/g' <<<"365 guys or 366 guys, but not foo12 guys."
365
366
Simply omit the g to only match at most 1 number per line.
Since your number is preceded by a blank, you can make it a part of the regex:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" | sed -E 's/.* ([0-9]+) guys.*/\1/g'
# => 365
In Bash:
$ s="A number is about to show up 1 and now I want to parse 365 guys and some extra junk"
$ [[ $s =~ ([0-9]+)\ +guys.*$ ]] && echo ${BASH_REMATCH[1]}
365
Or, with awk:
$ echo "$s" | awk '/guys/{for (i=1;i<=NF;i++) if ($i=="guys" && $(i-1)+0==$(i-1)) print $(i-1)}'
365
with standard sed regex you can benefit from greedy match if you reverse the string and matching
echo ... | rev | sed -E 's/.*syug ([0-9]+).*/\1/g' | rev
obviously this is a hack, but desperate times...
#Andrew Cassidy: #try:
echo "A number is about to show up 1 and now I want to parse 365 guys and some extra junk" |
awk '/guys/{print VAL;exit} {VAL=$0}' RS=" "
This might work for you (GNU sed):
sed -r 's/.*\b([0-9]+) guys.*/\1/' file
or perhaps:
sed -r 's/.*\<([0-9]+) guys.*/\1/' file
Make the numeric part of the pattern match a word boundary.

Bash - Extract numbers from String

I got a string which looks like this:
"abcderwer 123123 10,200 asdfasdf iopjjop"
Now I want to extract numbers, following the scheme xx,xxx where x is a number between 0-9. E.g. 10,200. Has to be five digit, and has to contain ",".
How can I do that?
Thank you
You can use grep:
$ echo "abcderwer 123123 10,200 asdfasdf iopjjop" | egrep -o '[0-9]{2},[0-9]{3}'
10,200
In pure Bash:
pattern='([[:digit:]]{2},[[:digit:]]{3})'
[[ $string =~ $pattern ]]
echo "${BASH_REMATCH[1]}"
Simple pattern matching (glob patterns) is built into the shell. Assuming you have the strings in $* (that is, they are command-line arguments to your script, or you have used set on a string you have obtained otherwise), try this:
for token; do
case $token in
[0-9][0-9],[0-9][0-9][0-9] ) echo "$token" ;;
esac
done
Check out pattern matching and regular expressions.
Links:
Bash regular expressions
Patterns and pattern matching
SO question
and as mentioned above, one way to utilize pattern matching is with grep.
Other uses: echo supports patterns (globbing) and find supports regular expressions.
A slightly non-typical solution:
< input tr -cd [0-9,\ ] | tr \ '\012' | grep '^..,...$'
(The first tr removes everything except commas, spaces, and digits. The
second tr replaces spaces with newlines, putting each "number" on a separate
line, and the grep discards everything except those that match your criterion.)
The following example using your input data string should solve the problem using sed.
$ echo abcderwer 123123 10,200 asdfasdf iopjjop | sed -ne 's/^.*\([0-9,]\{6\}\).*$/\1/p'
10,200

Regexp in bash for number between "quotes"

Input:
hello world "22" bye world
I need a regex that will work in bash that can get me the numbers between the quotes. The regex should match 22.
Thanks!
Hmm have you tried \"([0-9]+)\" ?
In Bash >= 3.2:
while read -r line
do
[[ $line =~ .*\"([0-9]+)\".* ]]
echo "${BASH_REMATCH[1]}"
done < inputfile.txt
Same thing using sed so it's more portable:
while read -r line
do
result=$(sed -n 's/.*\"\([0-9]\+\)\".*/\1/p')
echo "$result"
done < inputfile.txt
Pure Bash, no Regex. Number is in array element 1.
IFS=\" # input field separator is a double quote now
while read -a line ; do
echo -e "${line[1]}"
done < "$infile"
There are not really regexes in bash itself. There are however some programs that can use regexes, amongst them grep and sed.
grep's main functionality is to filter lines that match a given regex, ie you give it some data to stdin or a file and it prints the lines that match the regex.
sed does transform data. It doesn't just return the matching lines, you can tell it what to return with the s/regex/replacement/ command. The output part can contain references to groups (\x where x is the number of the group), if you specify the -r option.
So what we need is sed. Your input contains some stuff (^.*), a ", some digits ([0-9]+), a ", and some stuff (.*$). We later need to reference the digits, so we need to make the digits a group. So our complete matching regex is: ^.*"([0-9]+)".*$. We want to replace that with only the digits, so the replacement part is just \1.
Building the complete sed command is left as an exercise to you :-)
(Note that sed does not transform lines that don't match. If your input is only the line you provided above, that's fine. If there are other lines you'd like to silently skip, you need to specify the option -n (no automatic printing) and add a n to the end of the sed expression, which instructs it to print the line. That way it only prints the matching line(s).)

Resources