How to extract numbers from string but keep decimals? - bash

I'm trying to extract numbers from string using
cat out.log | tr -dc '0-9'
In Bash but I want to keep the decimals but this script will only keep numbers and no decimals

You need to add . and likely a space to the character class like so:
$ echo "12.32foo 44.2 bar" | tr -dc '[. [:digit:]]'
12.32 44.2

grep -Eo '[[:digit:]]+([.][[:digit:]]+)?' <out.log
-o takes care that only the parts matching the pattern are written to stdout. The pattern greedily matches one or more digits, followed by optionally a decimal point and more digits. Note that this on purpose skips "numbers" such as .56 and 14., since they are considered malformed. If you want to include them, you can easily adjust the pattern to this.

Related

Cut number of character in the beginning of string and end of the string

I need to cut a number of characters from the beginning and end of a string. The string is does not have a specific format and can be random numbers and words. I am trying to remove 5 characters in the beginning and 11 from the end of the string.
Input string:
342136001788006DEEFF0000060000806000006HSV40002HP
Output string:
6001788006DEEFF000006000080600000
The bolded characters 3413 and 6HSV40002HP are removed from the input.
it's ok found my answer using cut command which I was so focusing with awk & sed , but cut helps in the end
cut -c6-38 test.txt
You found the cut commamd wat is the best solution in this case.
You wondered how you should do this with sed, which will be interesting for more complex situations.
The noob solution is (using ; for 2 different substititions and $ for end-of-line):
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed 's/.....//;s/...........$//'
You do not want to count the dots, you can tell how often a pattern repeats with pattern{count}.
And you can remember/recall a pattern with `s/..(pattern)../\1/'.
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed 's/.\{5\}\(.*\).\{11\}/\1/'
When your sed supports the flog -r, you can avoid all thise backslashes:
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed -r 's/.{5}(.*).{11}/\1/'

Dynamic delimiter in Unix

Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.

Bash grep keyword plus trailing numbers upto first whitespace

I'm looking to filter tcpdump output and extracting only two constant element names and their string of changing numbers which is followed by a white space and more unwanted data. Is there a way to only extract up to the first white space using GREP of SED? I've been using bash for about a month and this is the first time my googlefoo has failed me.
Example output: red23:34:23 black23:43 purple00:55:22 yellow32:43 green10:10 (color names are constant)
Looking to extract: black23:43 yellow32:43
The -o option in grep prints only the matching part, so to get just black and the numbers you might do this:
output='red23:34:23 black23:43 purple00:55:22 yellow32:43 green10:10'
echo "$output" | grep -Eo 'black[0-9]+:[0-9]+'
and you could parameterize it like so:
color='green'
echo "$output" | grep -Eo "${color}[0-9]+:[0-9]+"

list words from file using shell script in alphabetical order and with no punctuation

I am using Shell script and bash commands.
I have to generate a list of words that are in alphabetical order from a file which has many sentences in it, i am using song lyrics to work this out on. I can return each word in alphabetical order but it still includes some apostrophes, question marks and full stops. to do this I use:
cat lyrics01.txt | tr "\"' " '\n' | sort -u >> lyrics01.wl
I know this tells the list to go down after each space and apostrophe but I need it to delete the punctuation and simply be the words in an alphabetical order.
I have tried implementing this part:
-d ',.;:-+=()'
after the 'tr' from my original code but it will not work. Any help for a simpler way or even to solve this would be much appreciated.
Assuming you want lines split on words but not split on punctuation so that "The world isn't fair." becomes
The
world
isnt
fair
and not
The
world
isn
t
fair
<blank line>
the following should do what you want
sed 's/[[:punct:]]*//g;s/ /\n/g' lyrics01.txt | sort -u >> lyrics01.wl
Try sed as below:
sed 's/\([[:punct:] ]\)/\n/g' lyrics01.txt | sort -u >> lyrics01.wl
This will remove any punctuation marks or space and replace it with new line character.
All of the examples seem to remove the single quote from the word "isn't"
If that is not what you want, I've tested and come up with this :
$ cat test.txt
The
world
isn't
fair.
Isn't it ?
$ sed "s/ /\n/g" test.txt | sed "s/[[:punct:]]$/\n/g" | grep .
The
world
isn't
fair
Isn't
it
$
It's not sorted, but this is to show you can retain punctionation if not at the end

Grep characters before and after match?

Using this:
grep -A1 -B1 "test_pattern" file
will produce one line before and after the matched pattern in the file. Is there a way to display not lines but a specified number of characters?
The lines in my file are pretty big so I am not interested in printing the entire line but rather only observe the match in context. Any suggestions on how to do this?
3 characters before and 4 characters after
$> echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}'
23_string_and
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
This will match up to 5 characters before and after your pattern. The -o switch tells grep to only show the match and -E to use an extended regular expression. Make sure to put the quotes around your expression, else it might be interpreted by the shell.
You could use
awk '/test_pattern/ {
match($0, /test_pattern/); print substr($0, RSTART - 10, RLENGTH + 20);
}' file
You mean, like this:
grep -o '.\{0,20\}test_pattern.\{0,20\}' file
?
That will print up to twenty characters on either side of test_pattern. The \{0,20\} notation is like *, but specifies zero to twenty repetitions instead of zero or more.The -o says to show only the match itself, rather than the entire line.
I'll never easily remember these cryptic command modifiers so I took the top answer and turned it into a function in my ~/.bashrc file:
cgrep() {
# For files that are arrays 10's of thousands of characters print.
# Use cpgrep to print 30 characters before and after search pattern.
if [ $# -eq 2 ] ; then
# Format was 'cgrep "search string" /path/to/filename'
grep -o -P ".{0,30}$1.{0,30}" "$2"
else
# Format was 'cat /path/to/filename | cgrep "search string"
grep -o -P ".{0,30}$1.{0,30}"
fi
} # cgrep()
Here's what it looks like in action:
$ ll /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
-rw-r--r-- 1 rick rick 25780 Jul 3 19:05 /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
$ cat /tmp/rick/scp.Mf7UdS/Mf7UdS.Source | cgrep "Link to iconic"
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
$ cgrep "Link to iconic" /tmp/rick/scp.Mf7UdS/Mf7UdS.Source
1:43:30.3540244000 /mnt/e/bin/Link to iconic S -rwxrwxrwx 777 rick 1000 ri
The file in question is one continuous 25K line and it is hopeless to find what you are looking for using regular grep.
Notice the two different ways you can call cgrep that parallels grep method.
There is a "niftier" way of creating the function where "$2" is only passed when set which would save 4 lines of code. I don't have it handy though. Something like ${parm2} $parm2. If I find it I'll revise the function and this answer.
With gawk , you can use match function:
x="hey there how are you"
echo "$x" |awk --re-interval '{match($0,/(.{4})how(.{4})/,a);print a[1],a[2]}'
ere are
If you are ok with perl, more flexible solution : Following will print three characters before the pattern followed by actual pattern and then 5 character after the pattern.
echo hey there how are you |perl -lne 'print "$1$2$3" if /(.{3})(there)(.{5})/'
ey there how
This can also be applied to words instead of just characters.Following will print one word before the actual matching string.
echo hey there how are you |perl -lne 'print $1 if /(\w+) there/'
hey
Following will print one word after the pattern:
echo hey there how are you |perl -lne 'print $2 if /(\w+) there (\w+)/'
how
Following will print one word before the pattern , then the actual word and then one word after the pattern:
echo hey there how are you |perl -lne 'print "$1$2$3" if /(\w+)( there )(\w+)/'
hey there how
If using ripgreg this is how you would do it:
grep -E -o ".{0,5}test_pattern.{0,5}" test.txt
You can use regexp grep for finding + second grep for highlight
echo "some123_string_and_another" | grep -o -P '.{0,3}string.{0,4}' | grep string
23_string_and
With ugrep you can specify -ABC context with option -o (--only-matching) to show the match with extra characters of context before and/or after the match, fitting the match plus the context within the specified -ABC width. For example:
ugrep -o -C30 pattern testfile.txt
gives:
1: ... long line with an example pattern to match. The line could...
2: ...nother example line with a pattern.
The same on a terminal with color highlighting gives:
Multiple matches on a line are either shown with [+nnn more]:
or with option -k (--column-number) to show each individually with context and the column number:
The context width is the number of Unicode characters displayed (UTF-8/16/32), not just ASCII.
I personally do something similar to the posted answers.. but since the dot key, like any keyboard key, can be tapped or held down.. and I often don't need a lot of context(if I needed more I might do the lines like grep -C but often like you I don't want lines before and after), so I find it much quicker for entering the command, to just tap the dot key for how many dots / how many characters, if it's a few then tapping the key, or hold it down for more.
e.g. echo zzzabczzzz | grep -o '.abc..'
Will have the abc pattern with one dot before and two after. ( in regex language, Dot matches any character). Others used dot too but with curly braces to specify repetition.
If I wanted to be strict re between (0 or x) characters and exactly y characters, then i'd use the curlies.. and -P, as others have done.
There is a setting re whether dot matches new line but you can look into that if it's a concern/interest.

Resources