Ubuntu: Terminal remove part of string - bash

I want to use the Ubuntu Terminal to rename several hundred files in a folder since I am not allowed to install anything.
The name of the files is in the following format:
ER201703_Company_Name_Something_9876543218_90087625374823.csv
Afterwards it should look like this:
ER201703_9876543218_90087625374823.csv
So, I want to remove the middle part (Company_name_something) which sometimes has 2, 3 or even 4 _'s. I wanted to create 2 strings; one for the front part and one for the back part. The front part is easy and already working but I am struggeling with the back part.
for name in *.csv;
do
charleng=${#name};
start=$(echo "$name" | grep -a '_9');
back=$(echo "$name" | cut -c $start-);
front=$(echo "$name" | cut -c1-9);
mv "$name""$front$back";
done
I am trying to find the position of _9 and keep everything from there to the end of the string.
Best regards
Jan

If rename is installed (I think that's the case for Ubuntu) you can use the following command instead of your loop.
rename -n 's/^(ER\d*)\w*?(_9\w*)/$1$2/' *.csv
Remove the -n (no act) to apply the changes.
Explanation
s/.../.../ substitutes matches of the left regex with the right pattern.
(ER\d*) matches the first part (ER followed by some digits) and stores it inside $1 for later use.
\w*? matches the company part, that is as few (non-greedy) word characters (letters, numbers, underscore, ...) as possible.
(_9\w*) matches the second part and stores it inside $2 for later use.
$1$2 is the substitution of the previously matched parts. We only omit the company part.

awk -F'_' '{printf "mv %s %s_%s_%s\n",$0,$1,$(NF-1),$NF}'
Example:
kent$ awk -F'_' '{printf "mv %s %s_%s_%s\n",$0,$1,$(NF-1),$NF}' <<<"ER201703_Company_Name_Something_9876543218_90087625374823.csv"
mv ER201703_Company_Name_Something_9876543218_90087625374823.csv ER201703_9876543218_90087625374823.csv
This one-liner will print out the mv old new command. If it is ok, you just pipe the output to |sh, (awk ....|sh), the rename will be done.
If your filename can contain spaces, pls consider to quote the filenames by double quotes.

I can offer alternative solution, may be more generic.
rename 's/^([^_]+(?=_))(?:\w+(?=_\d+))(_\d+_\d+\.csv)$/$1$2/' *.csv
in a case the name of the log will change you want to have robust regex expression.
([^_]+(?=_)) - match everything that not underscore till the first one and store it to $1
(?:\w+(?=_\d+)) - match chars until the numbers but (?:...) not store to var
(_\d+_\d+\.csv) - match set of numbers and file extension and store it to $2

Related

Remove pattern in first occurence from right to left in file name in bash

Say I have a string file name aa.bb.cc.xx.txt
I would like to remove the first content between . and . (remove .xx) before the .txt to have aa.bb.cc.txt.
I don't want to use rev, cut and rev because this uses 3 commands
echo 'aa.bb.cc.xx.rpm' |rev | cut -d '.' --complement -s -f 2 |rev
Is there any better solution by using bash?
Thanks
If you know the file ends with .txt, you can remove that as well, then put it back on.
$ oldname=aa.bb.cc.xx.txt
$ echo "${oldname%.*.txt}.txt"
aa.bb.cc.txt
%.*.txt removes the shortest string matching the pattern .*.txt (in this case, .xx.txt).
If the extension could be an arbitrary string, you can save it by removing everything but the extension as a prefix, then restoring it.
$ echo "${oldname%.*.*}.${oldname##*.}"
##*. removes the longest matching prefix ending in ., in this case aa.bb.cc.xx.. Both operators require removing the . that delimits the matched prefix or suffix, which is why you need to add it back explicitly between the two expansions.
You can use sed as follows:
$ echo "aa.bb.cc.xx.txt" | sed "s/.[a-zA-Z].txt/txt/g"
aa.bb.cc.txt
If you want a general sed solution that works on any extension, you can do:
$ echo 'aa.bb.cc.xx.rpm' | sed 's/[^.]*\.\([^.]*\)$/\1/'
aa.bb.cc.rpm

bash display parts of a variable that contains a path

I am trying to output parts of a file path but remove the file name and some levels of the path.
Currently I have a for loop doing a lot of things, but I am creating a variable from the full file path and would like to strip some bits out.
For example
for f in (find /path/to/my/file - name *.ext)
will give me
$f = /path/to/my/file/filename.ext
What I want to do is printf/echo some of that variable. I know I can do:
printf ${f#/path/to/}
my/file/filename.ext
But I would like to remove the filename and end up with:
my/file
Is there any easy way to do this without having to use sed/awk etc?
When you know which level of your path you want, you can use cut:
echo "/path/to/my/filename/filename.ext" | cut -d/ -f4-5
When you want the last two levels of the path, you can use sed:
echo "/path/to/my/file/filename.ext" | sed 's#.*/\([^/]*/[^/]*\)/[^/]*$#\1#'
Explanation:
s/from/to/ and s#from#to# are equivalent, but will help when from or to has slashes.
s/xx\(remember_me\)yy/\1/ will replace "xxremember_meyy" by "remember_me"
s/\(r1\) and \(r2\)/==\2==\1==/ will replace "r1 and r2" by "==r2==r1=="
.* is the longest match with any characters
[^/]* is the longest match without a slash
$ is end of the string for a complete match

Rename multiple files with bash for loop, mv, and sed

My goal is to rename a folder of files of the form 'img_MM-DD-YY_XX.jpg' to the form 'newyears_YYYY-MM-DD_XXX.jpg' by iterating through each filename and using sed to perform substitutions based on character positions. Unfortunately I cannot seem to get the position-based swaps to work.
e.g. s/.\{4\}[0-9][0-9]/.\{10\}[0-9][0-9]/ attempts to replace MM with YY
Here is my attempt (neglecting for now the _XX part):
for filename in images/*
do
newname=$(echo $filename | sed 's/.\{4\}[0-9][0-9]/.\{10\}[0-9][0-9]/;
s/.\{7\}[0-9][0-9]/.\{4\}[0-9][0-9]/;
s/.\{10\}[0-9][0-9]/.\{7\}[0-9][0-9]/;
s/img_/newyears_20/')
mv $filename $newname
done
Any ideas how I can fix this?
$ echo 'img_11-22-14_XX.jpg' | sed -r 's/[^_]*_([0-9]{2})-([0-9]{2})-([0-9]{2})/newyears_20\3-\1-\2/'
newyears_2014-11-22_XX.jpg
The above looks for anything up to and including the first underline followed by a 6-digit date. It replaces the initial part with newyears_ and reformats the date from mm-dd-yy to 20yy-mm-dd.
The two-digit mm, dd, or yy values are matched with ([0-9]{2}). The parentheses indicate that sed should capture the value for later use. The output side of the substitution is _20\3-\1-\2. This restores the underline and adds a 20 to the front of the year. The year was the third captured value so it is denoted \3. Likewise, the month was the first captured value so it is denoted \1 and the day the second so it is \2.
To eliminate some blackslashes, I used the -r option to invoke extended regular expressions. If you are on a Mac or other non-GNU system, use sed -E in place of sed -r. Otherwise, use:
sed 's/[^_]*_\([0-9]\{2\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/newyears_20\3-\1-\2/'
This is simple to do with awk
echo "img_MM-DD-YY_XX.jpg" | awk -F"[_-]" '{print "newyears_20"$4"-"$2"-"$3"_0"$5}'
newyears_20YY-MM-DD_0XX.jpg

Using BASH, how to increment a number that uniquely only occurs once in most lines of an HTML file?

The target is always going to be between two characters, 'E' and '/' and there will never be but one occurrence of this combination, e.g. 'E01/' in most lines in the HTML file and will always be between '01' and '90'.
So, I need to programmatically read the file and replace each occurrence of 'Enn/' where 'nn' in 'Enn/' will be between '01' and '90' and must maintain the '0' for numbers '01' to '09' in 'Enn/' while incrementing the existing number by 1 throughout the HTML file.
Is this doable and if so how best to go about it?
Edit: Target lines will be in one or the other formats:
<DT>ProgramName
<DT>Program Name
You can use sed inside BASH as a fantastic one-liner, either:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+(10#\2>=90?0:1)))/ge' FILENAME
or if you are guaranteed the number is lower than 100:
sed -ri 's/(.*E)([0-9]{2})(\/.*)/printf "\1%02u\3" $((10#\2+1)))/ge' FILENAME
Basically, you'll be doing inplace search and replace. The above will not add anything after 90 (since you didn't specify the exact nature of the overflow condition). So E89/ -> E90/, E90/ -> E90/, and if by chance you have E91/, it will remain E91/. Add this line inside a loop for multiple files
A small explanation of the above command:
-r states that you'll be using a regular expression
-i states to write back to the same file (be careful with overwriting!)
s/search/replace/ge this is the regex command you'll be using
s/ states you'll be using a string search
(.E) first grouping of all characters upto the first E (case sensitive)
([0-9]{2}) second grouping of numbers 0 through 9, repeated twice (fixed width)
(/.) third grouping getting the escaped trailing slash and everything after that
/ (slash separator) denotes end of search pattern and beginning of replacement pattern
printf "format" var this is the expression used for each replacement
\1 place first grouping found here
%02u the replace format for the var
\3 place third grouping found here
$((expression)) BASH arithmetic expression to use in printf format
10#\2 force second grouping as a base 10 number
+(10#\2>=90?0:1) add 0 or 1 to the second grouping based on if it is >= 90 (as used in first command)
+1 add 1 to the second grouping (see second command)
/ge flags for global replacement and the replace parameter will be an expression
GNU sed and awk are very powerful tools to do this sort of thing.
You can use the following perl one-liner to increment the numbers while maintaining the ones with leading 0s.
perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
$ cat file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
$ perl -pe 's/E\K([0-9]+)/sprintf "%02d", 1+$1/e' file
<DT>ProgramName
<DT>Program Name
<DT>Program Name
<DT>Program Name
You can add the -i option to make changes in-place. I would recommend creating backup before doing so.
Not as elegant as one line sed!
Break the commands used into multiple commands and you can debug your bash or grep or sed.
# find the number
# use -o to grep to just return pattern
# use head -n1 for safety to just get 1 number
n=$(grep -o "E[0-9][0-9]\/" file.html |grep -o "[0-9][0-9]"|head -n1)
#octal 08 and 09 are problem so need to do this
n1=10#$n
echo Debug n1=$n1 n=$n
n2=n1
# bash arithmetic done inside (( ))
# as ever with bash bracketing whitespace is needed
(( n2++ ))
echo debug n2=$n2
# use sed with -i -e for inline edit to replace number
sed -ie "s/E$n\//E$(printf '%02d' $n2)\//" file.html
grep "E[0-9][0-9]" file.html
awk might be better. Maybe could do it in one awk command also.
The sed one-liner in other answer is awesome :-)
This works in bash or sh.
http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

How to take string from a file name and use it as an argument

If a file name is in this format
assignment_number_username_filename.extension
Ex.
assignment_01_ssaha_homework1.txt
I need to extract just the username to use it in the rest of the script.
How do I take just the username and use it as an argument.
This is close to what I'm looking for but not exactly:
Extracting a string from a file name
if someone could explain how sed works in that scenario that would be just as helpful!
Here's what I have so far; I haven't used cut in a while so I'm getting error messages while trying to refresh myself.
#!/bin/sh
a = $1
grep $a /home | cut -c 1,2,4,5 echo $a`
You probably need command substitution, plus echo plus sed. You need to know that sed regular expressions can remember portions of the match. And you need to know basic regular expressions. In context, this adds up to:
filename="assignment_01_ssaha_homework1.txt"
username=$(echo "$file" | sed 's/^[^_]*_[^_]*_\([^_]*\)_[^.]*\.[^.]*$/\1/')
The $(...) notation is command substitution. The commands in between the parentheses are run and the output is captured as a string. In this case, the string is assigned to the variable username.
In the sed command, the overall command applies a particular substitution (s/match/replace/) operation to each line of input (here, that will be one line). The [^_]* components of the regular expression match a sequence of (zero or more) non-underscores. The \(...\) part remembers the enclosed regex (the third sequence of non-underscores, aka the user name). The switch to [^.]* at the end recognizes the change in delimiter from underscore to dot. The replacement text \1 replaces the entire name with the remembered part of the pattern. In general, you can have several remembered subsections of the pattern. If the file name does not match the pattern, you'll get the input as output.
In bash, there are ways of avoiding the echo; you might well be able to use some of the more esoteric (meaning 'not available in other shells') mechanisms to extract the data. That will work on the majority of modern POSIX-derived shells (Korn, Bash, and others).
filename="assignment_01_ssaha_homework1.txt"
username=$(echo "$file" | awk -F_ '{print $3}')
Just bash:
filename="assignment_01_ssaha_homework1.txt"
tmp=${filename%_*}
username=${tmp##*_}
http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion

Resources