Extract the first character after the first number - shell

Let's say I have a file like this :
14-Hello14657
156:Good morning 487
1478456=Good bye 1 2
I would like to extract the first character after the first number of the line (and store it in a variable, one at a time). In this example, it would extract
-
:
=
I guess that I should probably use regular expressions but I am still learning it and I can't find a way to do this.

sed approach:
s="156:Good morning 487"
var1=$(sed 's/^[0-9]*\([^0-9]\).*/\1/' <<< $s)
echo $var1
:
Another approach is bash variable expansion + cut command:
s="1478456=Good bye 1 2"
echo ${s//[[:digit:]]/} | cut -c1
=

With GNU grep (the one installed on most Linux systems) you can use
grep -Po '^[0-9]+\K.' yourFile
To store the output in a variable, use
myVar="$(grep -Po '^[0-9]+\K.' yourFile)"
Using your example, the variable myVar will contain all three symbols:
-
:
=

Related

Delete everything before a pattern

I'm trying to clean a text file.
I want to delete everything start before the first 12 numbers.
1:0:135103079189:0:0:2:0::135103079189:000011:00
A:908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Output desired:
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Here's my command but seems not working.
sed '/:\([0-9]\{12\}\)/d' t.txt
the d command in sed will delete entire line on matching the given regex, you need to use s command to search and replace only part of line... however, for given problem, sed is not suitable as it doesn't support non-greedy regex
you can use perl instead
$ perl -pe's/^.*?(?=\d{12}:)//' ip.txt
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
.*? match zero or more characters as minimally as possible
(?=\d{12}:) only if it is followed by 12-digits ending with :
use perl -i -pe for in-place editing
some possible corner cases
$ # this is matching part of field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe's/^.*?(?=\d{12}:)//'
135103079189:23:603307102606:1
$ # this is not matching 12-digit field at end of line
$ echo 'foo:123:135103079189' | perl -pe's/^.*?(?=\d{12}:)//'
foo:123:135103079189
$ # so, add start/end of line matching cases and restrict 12-digits to whole field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe 's/^(?:.*?:)?(?=\d{12}(:|$))//'
603307102606:1
$ echo 'foo:123:135103079189' | perl -pe's/^(?:.*?:)?(?=\d{12}(:|$))//'
135103079189
Could you please try following.
awk --re-interval 'match($0,/[0-9]{12}/){print substr($0,RSTART)}' Input_file
Since I have OLD version of awk so I am using --re-interval you could remove it in case you have new version of it.
This might work for you (GNU sed):
sed -n 's/[0-9]\{12\}/\n&/;s/.*\n//p' file
We only want to print specific lines so use the -n option to turn off automatic printing. If a line contains a 12 digit number, insert a newline before it. Remove any characters before and including a newline and print the result.
If you want to print lines that do not contain a 12 digit number as is, use:
sed 's/[0-9]\{12\}/\n&/;s/.*\n//' file
The crux of the problem is to identify the start of a multi-character string, insert a unique marker and delete all characters before and including the unique marker. As sed uses the newline to delimit lines, only the user can introduce newlines into the pattern space and as a result, newlines will always be unique.
Taking the nice answer from #Sundeep, in case you would like to use grep or pcregrep (macOS/BSD) you could give a try to:
$ grep -oP '^(?:.*?:)?(?=\d{12})\K.*' file
or
$ pcregrep -o '^(?:.*?:)?(?=\d{12})\K.*' file
The \K will ignore everything after the pattern
Alternative thoughts - I almost think your data is too dirty for a quick sed fix but if generally it's all similar to your sample set of data then certainly pick one of the answers with sed etc. However if you wanted to be more particular about it you could build up a set of commands to ensure the values. I like doing this for debugging and when speed isn't urgent.
Take this tiny sample of code, you could do this other ways but I'm getting the value for each part of the string and I know the order because it contiguous. You could then set up controls on which parts to keep and such as it builds out say a new string per line. Overwrought for sure, but sometimes that is a better long term approach.
#!/bin/bash
while IFS= read -r line ;do
IFS=':' read -r -a array <<< "$line"
for ((i=0; i<${#array[#]}; i++)) ;do
echo "part : ${array[$i]}"
done
done < "test_data.txt"
You could then build the data back up how you wanted and more easily understand what's happening every step of the way ..
part : 1
part : 0
part : 135103079189
part : 0
part : 0
part : 2
part : 0
part :
part : 135103079189
part : 000011
part : 00
part : A
part : 908529896240
part : 0

BASH Palindrome Checker

This is my first time posting on here so bear with me please.
I received a bash assignment but my professor is completely unhelpful and so are his notes.
Our assignment is to filter and print out palindromes from a file. In this case, the directory is:
/usr/share/dict/words
The word lengths range from 3 to 45 and are supposed to only filter lowercase letters (the dictionary given has characters and uppercases, as well as lowercase letters). i.e. "-dkas-das" so something like "q-evvavve-q" may count as a palindrome but i shouldn't be getting that as a proper result.
Anyways, I can get it to filter out x amount of words and return (not filtering only lowercase though).
grep "^...$" /usr/share/dict/words |
grep "\(.\).\1"
And I can use subsequent lines for 5 letter words and 7 and so on:
grep "^.....$" /usr/share/dict/words |
grep "\(.\)\(.\).\2\1"
But the prof does not want that. We are supposed to use a loop. I get the concept but I don't know the syntax, and like I said, the notes are very unhelpful.
What I tried was setting variables x=... and y=.. and in a while loop, having x=$x$y but that didn't work (syntax error) and neither did x+=..
Any help is appreciated. Even getting my non-lowercase letters filtered out.
Thanks!
EDIT:
If you're providing a solution or a hint to a solution, the simplest method is prefered.
Preferably one that uses 2 grep statements and a loop.
Thanks again.
Like this:
for word in `grep -E '^[a-z]{3,45}$' /usr/share/dict/words`;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Output using my dictionary:
aha
bib
bob
boob
...
wow
Update
As pointed out in the comments, reading in most of the dictionary into a variable in the for loop might not be the most efficient, and risks triggering errors in some shells. Here's an updated version:
grep -E '^[a-z]{3,45}$' /usr/share/dict/words | while read -r word;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Why use grep? Bash will happily do that for you:
#!/bin/bash
is_pal() {
local w=$1
while (( ${#w} > 1 )); do
[[ ${w:0:1} = ${w: -1} ]] || return 1
w=${w:1:-1}
done
}
while read word; do
is_pal "$word" && echo "$word"
done
Save this as banana, chmod +x banana and enjoy:
./banana < /usr/share/dict/words
If you only want to keep the words with at least three characters:
grep ... /usr/share/dict/words | ./banana
If you only want to keep the words that only contain lowercase and have at least three letters:
grep '^[[:lower:]]\{3,\}$' /usr/share/dict/words | ./banana
The multiple greps are wasteful. You can simply do
grep -E '^([a-z])[a-z]\1$' /usr/share/dict/words
in one fell swoop, and similarly, put the expressions on grep's standard input like this:
echo '^([a-z])[a-z]\1$
^([a-z])([a-z])\2\1$
^([a-z])([a-z])[a-z]\2\1$' | grep -E -f - /usr/share/dict/words
However, regular grep does not permit backreferences beyond \9. With grep -P you can use double-digit backreferences, too.
The following script constructs the entire expression in a loop. Unfortunately, grep -P does not allow for the -f option, so we build a big thumpin' variable to hold the pattern. Then we can actually also simplify to a single pattern of the form ^(.)(?:.|(.)(?:.|(.)....\3)?\2?\1$, except we use [a-z] instead of . to restrict to just lowercase.
head=''
tail=''
for i in $(seq 1 22); do
head="$head([a-z])(?:[a-z]|"
tail="\\$i${tail:+)?}$tail"
done
grep -P "^${head%|})?$tail$" /usr/share/dict/words
The single grep should be a lot faster than individually invoking grep 22 or 43 times on the large input file. If you want to sort by length, just add that as a filter at the end of the pipeline; it should still be way faster than multiple passes over the entire dictionary.
The expression ${tail+:)?} evaluates to a closing parenthesis and question mark only when tail is non-empty, which is a convenient way to force the \1 back-reference to be non-optional. Somewhat similarly, ${head%|} trims the final alternation operator from the ultimate value of $head.
Ok here is something to get you started:
I suggest to use the plan you have above, just generate the number of "." using a for loop.
This question will explain how to make a for loop from 3 to 45:
How do I iterate over a range of numbers defined by variables in Bash?
for i in {3..45};
do
* put your code above here *
done
Now you just need to figure out how to make "i" number of dots "." in your first grep and you are done.
Also, look into sed, it can nuke the non-lowercase answers for you..
Another solution that uses a Perl-compatible regular expressions (PCRE) with recursion, heavily inspired by this answer:
grep -P '^(?:([a-z])(?=[a-z]*(\1(?(2)\2))$))++[a-z]?\2?$' /usr/share/dict/words

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Remove part of path on Unix

I'm trying to remove part of the path in a string. I have the path:
/path/to/file/drive/file/path/
I want to remove the first part /path/to/file/drive and produce the output:
file/path/
Note: I have several paths in a while loop, with the same /path/to/file/drive in all of them, but I'm just looking for the 'how to' on removing the desired string.
I found some examples, but I can't get them to work:
echo /path/to/file/drive/file/path/ | sed 's:/path/to/file/drive:\2:'
echo /path/to/file/drive/file/path/ | sed 's:/path/to/file/drive:2'
\2 being the second part of the string and I'm clearly doing something wrong...maybe there is an easier way?
If you wanted to remove a certain NUMBER of path components, you should use cut with -d'/'. For example, if path=/home/dude/some/deepish/dir:
To remove the first two components:
# (Add 2 to the number of components to remove to get the value to pass to -f)
echo $path | cut -d'/' -f4-
# output:
# some/deepish/dir
To keep the first two components:
echo $path | cut -d'/' -f-3
# output:
# /home/dude
To remove the last two components (rev reverses the string):
echo $path | rev | cut -d'/' -f4- | rev
# output:
# /home/dude/some
To keep the last three components:
echo $path | rev | cut -d'/' -f-3 | rev
# output:
# some/deepish/dir
Or, if you want to remove everything before a particular component, sed would work:
echo $path | sed 's/.*\(some\)/\1/g'
# output:
# some/deepish/dir
Or after a particular component:
echo $path | sed 's/\(dude\).*/\1/g'
# output:
# /home/dude
It's even easier if you don't want to keep the component you're specifying:
echo $path | sed 's/some.*//g'
# output:
# /home/dude/
And if you want to be consistent you can match the trailing slash too:
echo $path | sed 's/\/some.*//g'
# output:
# /home/dude
Of course, if you're matching several slashes, you should switch the sed delimiter:
echo $path | sed 's!/some.*!!g'
# output:
# /home/dude
Note that these examples all use absolute paths, you'll have to play around to make them work with relative paths.
You can also use POSIX shell variable expansion to do this.
path=/path/to/file/drive/file/path/
echo ${path#/path/to/file/drive/}
The #.. part strips off a leading matching string when the variable is expanded; this is especially useful if your strings are already in shell variables, like if you're using a for loop. You can strip matching strings (e.g., an extension) from the end of a variable also, using %.... See the bash man page for the gory details.
If you don't want to hardcode the part you're removing:
$ s='/path/to/file/drive/file/path/'
$ echo ${s#$(dirname "$(dirname "$s")")/}
file/path/
One way to do this with sed is
echo /path/to/file/drive/file/path/ | sed 's:^/path/to/file/drive/::'
If you want to remove the first N parts of the path, you could of course use N calls to dirname, as in glenn's answer, but it's probably easier to use globbing:
path=/path/to/file/drive/file/path/
echo "${path#*/*/*/*/*/}" # file/path/
Specifically, ${path#*/*/*/*/*/} means "return $path minus the shortest prefix that contains 5 slashes".
Using ${path#/path/to/file/drive/} as suggested by evil otto is certainly the typical/best way to do this, but since there are many sed suggestions it is worth pointing out that sed is overkill if you are working with a fixed string. You can also do:
echo $PATH | cut -b 21-
To discard the first 20 characters. Similarly, you can use ${PATH:20} in bash or $PATH[20,-1] in zsh.
Pure bash, without hard coding the answer
basenames()
{
local d="${2}"
for ((x=0; x<"${1}"; x++)); do
d="${d%/*}"
done
echo "${2#"${d}"/}"
}
Argument 1 - How many levels do you want to keep (2 in the original question)
Argument 2 - The full path
Taken from vsi_common(original version)
Here's a solution using simple bash syntax that accommodates variables (in case you don't want to hard code full paths), removes the need for piping stdin to sed, and includes a for loop, for good measure:
FULLPATH="/path/to/file/drive/file/path/"
SUBPATH="/path/to/file/drive/"
for i in $FULLPATH;
do
echo ${i#$SUBPATH}
done
as mentioned above by #evil otto, the # symbol is used to remove a prefix in this scenario.

Capturing Groups From a Grep RegEx

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.
I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I would like to attack this from the *nix purist angle.
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?
If you're using Bash, you don't even have to use grep:
files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files # unquoted in order to allow the glob to expand
do
if [[ $f =~ $regex ]]
then
name="${BASH_REMATCH[1]}"
echo "${name}.jpg" # concatenate strings
name="${name}.jpg" # same thing stored in a variable
else
echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
fi
done
It's better to put the regex in a variable. Some patterns won't work if included literally.
This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.
You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:
123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz
To eliminate the second and fourth examples, make your regex like this:
^[0-9]+_([a-z]+)_[0-9a-z]*
which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:
^[0-9]+_([a-z]+)_[0-9a-z]*$
then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.
If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):
name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg
The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).
The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.
In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.
The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.
This isn't really possible with pure grep, at least not generally.
But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.
The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
Not possible in just grep I believe
for sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`
I'll take a stab at the bonus though:
echo "$name.jpg"
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }
to use just do
$ echo 'hello world' | regex1 'hello\s(.*)'
world
str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
week="${BASH_REMATCH[1]}"
day="${BASH_REMATCH[2]}"
hour="${BASH_REMATCH[3]}"
echo $week --- $day ---- $hour
fi
output:
1 --- 2 ---- 1
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
Then name will have the value abc.
See Apple developer docs, search forward for 'Parameter Expansion'.
I prefer the one line python or perl command, both often included in major linux disdribution
echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' | python -c $'
import re
import sys
for i in sys.stdin:
g=re.match(r\'.*href="(.*)"\',i);
if g is not None:
print g.group(1)
'
and to handle files:
ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
i=i.strip()
f=open(i,"r")
for j in f:
g=re.match(r\'.*href="(.*)"\',j);
if g is not None:
print g.group(1)
f.close()
'
The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:
for f in 123_abc_123.jpg 123_xyz_432.jpg
do
echo "f: " $f
name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
echo "name: " $name
done
Outputs:
f: 123_abc_123.jpg
name: abc
f: 123_xyz_432.jpg
name: xyz
So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,
if you have bash, you can use extended globbing
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
or
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done

Resources