Remove part of path on Unix - shell

I'm trying to remove part of the path in a string. I have the path:
/path/to/file/drive/file/path/
I want to remove the first part /path/to/file/drive and produce the output:
file/path/
Note: I have several paths in a while loop, with the same /path/to/file/drive in all of them, but I'm just looking for the 'how to' on removing the desired string.
I found some examples, but I can't get them to work:
echo /path/to/file/drive/file/path/ | sed 's:/path/to/file/drive:\2:'
echo /path/to/file/drive/file/path/ | sed 's:/path/to/file/drive:2'
\2 being the second part of the string and I'm clearly doing something wrong...maybe there is an easier way?

If you wanted to remove a certain NUMBER of path components, you should use cut with -d'/'. For example, if path=/home/dude/some/deepish/dir:
To remove the first two components:
# (Add 2 to the number of components to remove to get the value to pass to -f)
echo $path | cut -d'/' -f4-
# output:
# some/deepish/dir
To keep the first two components:
echo $path | cut -d'/' -f-3
# output:
# /home/dude
To remove the last two components (rev reverses the string):
echo $path | rev | cut -d'/' -f4- | rev
# output:
# /home/dude/some
To keep the last three components:
echo $path | rev | cut -d'/' -f-3 | rev
# output:
# some/deepish/dir
Or, if you want to remove everything before a particular component, sed would work:
echo $path | sed 's/.*\(some\)/\1/g'
# output:
# some/deepish/dir
Or after a particular component:
echo $path | sed 's/\(dude\).*/\1/g'
# output:
# /home/dude
It's even easier if you don't want to keep the component you're specifying:
echo $path | sed 's/some.*//g'
# output:
# /home/dude/
And if you want to be consistent you can match the trailing slash too:
echo $path | sed 's/\/some.*//g'
# output:
# /home/dude
Of course, if you're matching several slashes, you should switch the sed delimiter:
echo $path | sed 's!/some.*!!g'
# output:
# /home/dude
Note that these examples all use absolute paths, you'll have to play around to make them work with relative paths.

You can also use POSIX shell variable expansion to do this.
path=/path/to/file/drive/file/path/
echo ${path#/path/to/file/drive/}
The #.. part strips off a leading matching string when the variable is expanded; this is especially useful if your strings are already in shell variables, like if you're using a for loop. You can strip matching strings (e.g., an extension) from the end of a variable also, using %.... See the bash man page for the gory details.

If you don't want to hardcode the part you're removing:
$ s='/path/to/file/drive/file/path/'
$ echo ${s#$(dirname "$(dirname "$s")")/}
file/path/

One way to do this with sed is
echo /path/to/file/drive/file/path/ | sed 's:^/path/to/file/drive/::'

If you want to remove the first N parts of the path, you could of course use N calls to dirname, as in glenn's answer, but it's probably easier to use globbing:
path=/path/to/file/drive/file/path/
echo "${path#*/*/*/*/*/}" # file/path/
Specifically, ${path#*/*/*/*/*/} means "return $path minus the shortest prefix that contains 5 slashes".

Using ${path#/path/to/file/drive/} as suggested by evil otto is certainly the typical/best way to do this, but since there are many sed suggestions it is worth pointing out that sed is overkill if you are working with a fixed string. You can also do:
echo $PATH | cut -b 21-
To discard the first 20 characters. Similarly, you can use ${PATH:20} in bash or $PATH[20,-1] in zsh.

Pure bash, without hard coding the answer
basenames()
{
local d="${2}"
for ((x=0; x<"${1}"; x++)); do
d="${d%/*}"
done
echo "${2#"${d}"/}"
}
Argument 1 - How many levels do you want to keep (2 in the original question)
Argument 2 - The full path
Taken from vsi_common(original version)

Here's a solution using simple bash syntax that accommodates variables (in case you don't want to hard code full paths), removes the need for piping stdin to sed, and includes a for loop, for good measure:
FULLPATH="/path/to/file/drive/file/path/"
SUBPATH="/path/to/file/drive/"
for i in $FULLPATH;
do
echo ${i#$SUBPATH}
done
as mentioned above by #evil otto, the # symbol is used to remove a prefix in this scenario.

Related

Bash command - how to grep and then truncate but keep grep-ed part?

I am trying to splice out a particular piece of string. I used:
myVar=$(grep --color 'GACCT[ATCG]*AGGTC' FILE.txt | cat)
then, I used the code below to remove everything before and after my desired portion.
myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}
The code is working however, it cuts off the GACCT and AGGTC at the beginning and end of the desired fragmen that I want to keep. Is there anyway to cut the beginning and end off while still keeping the GACCT and AGGTC?
Thank you!
If you have a GNU grep, you can make use of
myVar=$(grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)
See the online demo:
#!/bin/bash
s='GACCTAAATTTGGGCCCAGGTC'
# Original script
myVar=$(grep --color 'GACCT[ATCG]*AGGTC' <<< "$s" | cat)
myVar1=$(echo ${myVar##*GACCT})
echo ${myVar1%%AGGTC*}
# => AAATTTGGGCCC
# My suggestion:
grep --color=never -oP 'GACCT\K[ATCG]+(?=AGGTC)' <<< "$s"
# => AAATTTGGGCCC
With --color=never, your matches are not colored.
The -o option outputs the matched texts, and the P option enables the PCRE regex engine. It is necessary here since the regex pattern contains specific operators, like \K and (?=...).
More details
GACCT - a literal string
\K - operator that makes the regex engine "forget" what has been consumed
[ATCG]+ - one or more letters from the set
(?=AGGTC) - a positive lookahead that requires an AGGTC string immediately to the right of the current location.
Note you can get this result with pcregrep, too, if you install it:
myVar=$(pcregrep -o 'GACCT\K[ATCG]+(?=AGGTC)' FILE.txt)

Remove pattern in first occurence from right to left in file name in bash

Say I have a string file name aa.bb.cc.xx.txt
I would like to remove the first content between . and . (remove .xx) before the .txt to have aa.bb.cc.txt.
I don't want to use rev, cut and rev because this uses 3 commands
echo 'aa.bb.cc.xx.rpm' |rev | cut -d '.' --complement -s -f 2 |rev
Is there any better solution by using bash?
Thanks
If you know the file ends with .txt, you can remove that as well, then put it back on.
$ oldname=aa.bb.cc.xx.txt
$ echo "${oldname%.*.txt}.txt"
aa.bb.cc.txt
%.*.txt removes the shortest string matching the pattern .*.txt (in this case, .xx.txt).
If the extension could be an arbitrary string, you can save it by removing everything but the extension as a prefix, then restoring it.
$ echo "${oldname%.*.*}.${oldname##*.}"
##*. removes the longest matching prefix ending in ., in this case aa.bb.cc.xx.. Both operators require removing the . that delimits the matched prefix or suffix, which is why you need to add it back explicitly between the two expansions.
You can use sed as follows:
$ echo "aa.bb.cc.xx.txt" | sed "s/.[a-zA-Z].txt/txt/g"
aa.bb.cc.txt
If you want a general sed solution that works on any extension, you can do:
$ echo 'aa.bb.cc.xx.rpm' | sed 's/[^.]*\.\([^.]*\)$/\1/'
aa.bb.cc.rpm

How to get line WITH tab character using tail and head

I have made a script to practice my Bash, only to realize that this script does not take tabulation into account, which is a problem since it is designed to find and replace a pattern in a Python script (which obviously needs tabulation to work).
Here is my code. Is there a simple way to get around this problem ?
pressure=1
nline=$(cat /myfile.py | wc -l) # find the line length of the file
echo $nline
for ((c=0;c<=${nline};c++))
do
res=$( tail -n $(($(($nline+1))-$c)) myfile.py | head -n 1 | awk 'gsub("="," ",$1){print $1}' | awk '{print$1}')
#echo $res
if [ $res == 'pressure_run' ]
then
echo "pressure_run='${pressure}'" >> myfile_mod.py
else
echo $( tail -n $(($nline-$c)) myfile.py | head -n 1) >> myfile_mod.py
fi
done
Basically, it finds the line that has pressure_run=something and replaces it by pressure_run=$pressure. The rest of the file should be untouched. But in this case, all tabulation is deleted.
If you want to just do the replacement as quickly as possible, sed is the way to go as pointed out in shellter's comment:
sed "s/\(pressure_run=\).*/\1$pressure/" myfile.py
For Bash training, as you say, you may want to loop manually over your file. A few remarks for your current version:
Is /myfile.py really in the root directory? Later, you don't refer to it at that location.
cat ... | wc -l is a useless use of cat and better written as wc -l < myfile.py.
Your for loop is executed one more time than you have lines.
To get the next line, you do "show me all lines, but counting from the back, don't show me c lines, and then show me the first line of these". There must be a simpler way, right?
To get what's the left-hand side of an assignment, you say "in the first space-separated field, replace = with a space , then show my the first space separated field of the result". There must be a simpler way, right? This is, by the way, where you strip out the leading tabs (your first awk command does it).
To print the unchanged line, you do the same complicated thing as before.
A band-aid solution
A minimal change that would get you the result you want would be to modify the awk command: instead of
awk 'gsub("="," ",$1){print $1}' | awk '{print$1}'
you could use
awk -F '=' '{ print $1 }'
"Fields are separated by =; give me the first one". This preserves leading tabs.
The replacements have to be adjusted a little bit as well; you now want to match something that ends in pressure_run:
if [[ $res == *pressure_run ]]
I've used the more flexible [[ ]] instead of [ ] and added a * to pressure_run (which must not be quoted): "if $res ends in pressure_run, then..."
The replacement has to use $res, which has the proper amount of tabs:
echo "$res='${pressure}'" >> myfile_mod.py
Instead of appending each line each loop (and opening the file each time), you could just redirect output of your whole loop with done > myfile_mod.py.
This prints literally ${pressure} as in your version, because it's single quoted. If you want to replace that by the value of $pressure, you have to remove the single quotes (and the braces aren't needed here, but don't hurt):
echo "$res=$pressure" >> myfile_mod.py
This fixes your example, but it should be pointed out that enumerating lines and then getting one at a time with tail | head is a really bad idea. You traverse the file for every single line twice, it's very error prone and hard to read. (Thanks to tripleee for suggesting to mention this more clearly.)
A proper solution
This all being said, there are preferred ways of doing what you did. You essentially loop over a file, and if a line matches pressure_run=, you want to replace what's on the right-hand side with $pressure (or the value of that variable). Here is how I would do it:
#!/bin/bash
pressure=1
# Regular expression to match lines we want to change
re='^[[:space:]]*pressure_run='
# Read lines from myfile.py
while IFS= read -r line; do
# If the line matches the regular expression
if [[ $line =~ $re ]]; then
# Print what we matched (with whitespace!), then the value of $pressure
line="${BASH_REMATCH[0]}"$pressure
fi
# Print the (potentially modified) line
echo "$line"
# Read from myfile.py, write to myfile_mod.py
done < myfile.py > myfile_mod.py
For a test file that looks like
blah
test
pressure_run=no_tab
blah
something
pressure_run=one_tab
pressure_run=two_tabs
the result is
blah
test
pressure_run=1
blah
something
pressure_run=1
pressure_run=1
Recommended reading
How to read a file line-by-line (explains the IFS= and -r business, which is quite essential to preserve whitespace)
BashGuide

Get just the filename from a path in a Bash script [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 6 years ago.
How would I get just the filename without the extension and no path?
The following gives me no extension, but I still have the path attached:
source_file_filename_no_ext=${source_file%.*}
Many UNIX-like operating systems have a basename executable for a very similar purpose (and dirname for the path):
pax> full_name=/tmp/file.txt
pax> base_name=$(basename ${full_name})
pax> echo ${base_name}
file.txt
That unfortunately just gives you the file name, including the extension, so you'd need to find a way to strip that off as well.
So, given you have to do that anyway, you may as well find a method that can strip off the path and the extension.
One way to do that (and this is a bash-only solution, needing no other executables):
pax> full_name=/tmp/xx/file.tar.gz
pax> xpath=${full_name%/*}
pax> xbase=${full_name##*/}
pax> xfext=${xbase##*.}
pax> xpref=${xbase%.*}
pax> echo "path='${xpath}', pref='${xpref}', ext='${xfext}'"
path='/tmp/xx', pref='file.tar', ext='gz'
That little snippet sets xpath (the file path), xpref (the file prefix, what you were specifically asking for) and xfext (the file extension).
basename and dirname solutions are more convenient. Those are alternative commands:
FILE_PATH="/opt/datastores/sda2/test.old.img"
echo "$FILE_PATH" | sed "s/.*\///"
This returns test.old.img like basename.
This is salt filename without extension:
echo "$FILE_PATH" | sed -r "s/.+\/(.+)\..+/\1/"
It returns test.old.
And following statement gives the full path like dirname command.
echo "$FILE_PATH" | sed -r "s/(.+)\/.+/\1/"
It returns /opt/datastores/sda2
Here is an easy way to get the file name from a path:
echo "$PATH" | rev | cut -d"/" -f1 | rev
To remove the extension you can use, assuming the file name has only ONE dot (the extension dot):
cut -d"." -f1
$ file=${$(basename $file_path)%.*}
Some more alternative options because regexes (regi ?) are awesome!
Here is a Simple regex to do the job:
regex="[^/]*$"
Example (grep):
FP="/hello/world/my/file/path/hello_my_filename.log"
echo $FP | grep -oP "$regex"
#Or using standard input
grep -oP "$regex" <<< $FP
Example (awk):
echo $FP | awk '{match($1, "$regex",a)}END{print a[0]}
#Or using stardard input
awk '{match($1, "$regex",a)}END{print a[0]} <<< $FP
If you need a more complicated regex:
For example your path is wrapped in a string.
StrFP="my string is awesome file: /hello/world/my/file/path/hello_my_filename.log sweet path bro."
#this regex matches a string not containing / and ends with a period
#then at least one word character
#so its useful if you have an extension
regex="[^/]*\.\w{1,}"
#usage
grep -oP "$regex" <<< $StrFP
#alternatively you can get a little more complicated and use lookarounds
#this regex matches a part of a string that starts with / that does not contain a /
##then uses the lazy operator ? to match any character at any amount (as little as possible hence the lazy)
##that is followed by a space
##this allows use to match just a file name in a string with a file path if it has an exntension or not
##also if the path doesnt have file it will match the last directory in the file path
##however this will break if the file path has a space in it.
regex="(?<=/)[^/]*?(?=\s)"
#to fix the above problem you can use sed to remove spaces from the file path only
## as a side note unfortunately sed has limited regex capibility and it must be written out in long hand.
NewStrFP=$(echo $StrFP | sed 's:\(/[a-z]*\)\( \)\([a-z]*/\):\1\3:g')
grep -oP "$regex" <<< $NewStrFP
Total solution with Regexes:
This function can give you the filename with or without extension of a linux filepath even if the filename has multiple "."s in it.
It can also handle spaces in the filepath and if the file path is embedded or wrapped in a string.
#you may notice that the sed replace has gotten really crazy looking
#I just added all of the allowed characters in a linux file path
function Get-FileName(){
local FileString="$1"
local NoExtension="$2"
local FileString=$(echo $FileString | sed 's:\(/[a-zA-Z0-9\<\>\|\\\:\)\(\&\;\,\?\*]*\)\( \)\([a-zA-Z0-9\<\>\|\\\:\)\(\&\;\,\?\*]*/\):\1\3:g')
local regex="(?<=/)[^/]*?(?=\s)"
local FileName=$(echo $FileString | grep -oP "$regex")
if [[ "$NoExtension" != "" ]]; then
sed 's:\.[^\.]*$::g' <<< $FileName
else
echo "$FileName"
fi
}
## call the function with extension
Get-FileName "my string is awesome file: /hel lo/world/my/file test/path/hello_my_filename.log sweet path bro."
##call function without extension
Get-FileName "my string is awesome file: /hel lo/world/my/file test/path/hello_my_filename.log sweet path bro." "1"
If you have to mess with a windows path you can start with this one:
[^\\]*$
$ source_file_filename_no_ext=${source_file%.*}
$ echo ${source_file_filename_no_ext##*/}

Capturing Groups From a Grep RegEx

I've got this little script in sh (Mac OSX 10.6) to look through an array of files. Google has stopped being helpful at this point:
files="*.jpg"
for f in $files
do
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
echo $name
done
So far (obviously, to you shell gurus) $name merely holds 0, 1 or 2, depending on if grep found that the filename matched the matter provided. What I'd like is to capture what's inside the parens ([a-z]+) and store that to a variable.
I'd like to use grep only, if possible. If not, please no Python or Perl, etc. sed or something like it – I would like to attack this from the *nix purist angle.
Also, as a super-cool bonus, I'm curious as to how I can concatenate string in shell? Is the group I captured was the string "somename" stored in $name, and I wanted to add the string ".jpg" to the end of it, could I cat $name '.jpg'?
If you're using Bash, you don't even have to use grep:
files="*.jpg"
regex="[0-9]+_([a-z]+)_[0-9a-z]*"
for f in $files # unquoted in order to allow the glob to expand
do
if [[ $f =~ $regex ]]
then
name="${BASH_REMATCH[1]}"
echo "${name}.jpg" # concatenate strings
name="${name}.jpg" # same thing stored in a variable
else
echo "$f doesn't match" >&2 # this could get noisy if there are a lot of non-matching files
fi
done
It's better to put the regex in a variable. Some patterns won't work if included literally.
This uses =~ which is Bash's regex match operator. The results of the match are saved to an array called $BASH_REMATCH. The first capture group is stored in index 1, the second (if any) in index 2, etc. Index zero is the full match.
You should be aware that without anchors, this regex (and the one using grep) will match any of the following examples and more, which may not be what you're looking for:
123_abc_d4e5
xyz123_abc_d4e5
123_abc_d4e5.xyz
xyz123_abc_d4e5.xyz
To eliminate the second and fourth examples, make your regex like this:
^[0-9]+_([a-z]+)_[0-9a-z]*
which says the string must start with one or more digits. The carat represents the beginning of the string. If you add a dollar sign at the end of the regex, like this:
^[0-9]+_([a-z]+)_[0-9a-z]*$
then the third example will also be eliminated since the dot is not among the characters in the regex and the dollar sign represents the end of the string. Note that the fourth example fails this match as well.
If you have GNU grep (around 2.5 or later, I think, when the \K operator was added):
name=$(echo "$f" | grep -Po '(?i)[0-9]+_\K[a-z]+(?=_[0-9a-z]*)').jpg
The \K operator (variable-length look-behind) causes the preceding pattern to match, but doesn't include the match in the result. The fixed-length equivalent is (?<=) - the pattern would be included before the closing parenthesis. You must use \K if quantifiers may match strings of different lengths (e.g. +, *, {2,4}).
The (?=) operator matches fixed or variable-length patterns and is called "look-ahead". It also does not include the matched string in the result.
In order to make the match case-insensitive, the (?i) operator is used. It affects the patterns that follow it so its position is significant.
The regex might need to be adjusted depending on whether there are other characters in the filename. You'll note that in this case, I show an example of concatenating a string at the same time that the substring is captured.
This isn't really possible with pure grep, at least not generally.
But if your pattern is suitable, you may be able to use grep multiple times within a pipeline to first reduce your line to a known format, and then to extract just the bit you want. (Although tools like cut and sed are far better at this).
Suppose for the sake of argument that your pattern was a bit simpler: [0-9]+_([a-z]+)_ You could extract this like so:
echo $name | grep -Ei '[0-9]+_[a-z]+_' | grep -oEi '[a-z]+'
The first grep would remove any lines that didn't match your overall patern, the second grep (which has --only-matching specified) would display the alpha portion of the name. This only works because the pattern is suitable: "alpha portion" is specific enough to pull out what you want.
(Aside: Personally I'd use grep + cut to achieve what you are after: echo $name | grep {pattern} | cut -d _ -f 2. This gets cut to parse the line into fields by splitting on the delimiter _, and returns just field 2 (field numbers start at 1)).
Unix philosophy is to have tools which do one thing, and do it well, and combine them to achieve non-trivial tasks, so I'd argue that grep + sed etc is a more Unixy way of doing things :-)
I realize that an answer was already accepted for this, but from a "strictly *nix purist angle" it seems like the right tool for the job is pcregrep, which doesn't seem to have been mentioned yet. Try changing the lines:
echo $f | grep -oEi '[0-9]+_([a-z]+)_[0-9a-z]*'
name=$?
to the following:
name=$(echo $f | pcregrep -o1 -Ei '[0-9]+_([a-z]+)_[0-9a-z]*')
to get only the contents of the capturing group 1.
The pcregrep tool utilizes all of the same syntax you've already used with grep, but implements the functionality that you need.
The parameter -o works just like the grep version if it is bare, but it also accepts a numeric parameter in pcregrep, which indicates which capturing group you want to show.
With this solution there is a bare minimum of change required in the script. You simply replace one modular utility with another and tweak the parameters.
Interesting Note: You can use multiple -o arguments to return multiple capture groups in the order in which they appear on the line.
Not possible in just grep I believe
for sed:
name=`echo $f | sed -E 's/([0-9]+_([a-z]+)_[0-9a-z]*)|.*/\2/'`
I'll take a stab at the bonus though:
echo "$name.jpg"
This is a solution that uses gawk. It's something I find I need to use often so I created a function for it
function regex1 { gawk 'match($0,/'$1'/, ary) {print ary['${2:-'1'}']}'; }
to use just do
$ echo 'hello world' | regex1 'hello\s(.*)'
world
str="1w 2d 1h"
regex="([0-9])w ([0-9])d ([0-9])h"
if [[ $str =~ $regex ]]
then
week="${BASH_REMATCH[1]}"
day="${BASH_REMATCH[2]}"
hour="${BASH_REMATCH[3]}"
echo $week --- $day ---- $hour
fi
output:
1 --- 2 ---- 1
A suggestion for you - you can use parameter expansion to remove the part of the name from the last underscore onwards, and similarly at the start:
f=001_abc_0za.jpg
work=${f%_*}
name=${work#*_}
Then name will have the value abc.
See Apple developer docs, search forward for 'Parameter Expansion'.
I prefer the one line python or perl command, both often included in major linux disdribution
echo $'
<a href="http://stackoverflow.com">
</a>
<a href="http://google.com">
</a>
' | python -c $'
import re
import sys
for i in sys.stdin:
g=re.match(r\'.*href="(.*)"\',i);
if g is not None:
print g.group(1)
'
and to handle files:
ls *.txt | python -c $'
import sys
import re
for i in sys.stdin:
i=i.strip()
f=open(i,"r")
for j in f:
g=re.match(r\'.*href="(.*)"\',j);
if g is not None:
print g.group(1)
f.close()
'
The follow example shows how to extract the 3 character sequence from a filename using a regex capture group:
for f in 123_abc_123.jpg 123_xyz_432.jpg
do
echo "f: " $f
name=$( perl -ne 'if (/[0-9]+_([a-z]+)_[0-9a-z]*/) { print $1 . "\n" }' <<< $f )
echo "name: " $name
done
Outputs:
f: 123_abc_123.jpg
name: abc
f: 123_xyz_432.jpg
name: xyz
So the if-regex conditional in perl will filter out all non-matching lines at the same time, for those lines that do match, it will apply the capture group(s) which you can access with $1, $2, ... respectively,
if you have bash, you can use extended globbing
shopt -s extglob
shopt -s nullglob
shopt -s nocaseglob
for file in +([0-9])_+([a-z])_+([a-z0-9]).jpg
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done
or
ls +([0-9])_+([a-z])_+([a-z0-9]).jpg | while read file
do
IFS="_"
set -- $file
echo "This is your captured output : $2"
done

Resources