Extract date from filename using bash script - bash

I know that similar things have been asked before, but I haven't been able to really make hand and foot out of what's been posted.
I've got a whole bunch of files that contain the date in the format YYYYMMDD at some point in the filename. Luckily this is the only 8 digit substring in all the filenames!
I will need to write the dates into another file later, but that should be fine. I'm struggling to extract the date into a variable first...
I know I can get it with grep:
for d in $( ls *.csv | grep -Po "\d{8}"; do
echo $d done
However, as I want to get the full filename into a variable too while I iterate through them, that's not an option right now.
I've tried using sed, but I don't think I know how to use it:
for f in $( ls *.csv ); do
d=$( $f | sed -e 's/^.*\(\d{8}\).*$')
echo $d
done
Thanks for pointing me in the right direction!

Loop through your csv files like this (don't parse ls):
for f in *.csv; do
echo "$f"
d=$(echo "$f" | grep -oE '[0-9]{8}')
done
I've used grep in extended mode (-E) but perl mode is equally valid.
As you have tagged with bash, you can do d=$(grep -oE '[0-9]{8}' <<<"$f" instead if you prefer. You can also use built-in regular expression support, which is slightly more verbose but saves calling an external tool:
re='[0-9]{8}'
[[ $f =~ $re ]] && d="${BASH_REMATCH[0]}"
The array BASH_REMATCH contains the matches to the regular expression. If there is a match, we assign it to d.

#!/bin/bash
# ^-- important: bash, not not /bin/sh
for f in *.csv; do # Don't use ls for iterating over filenames
[[ $f =~ [[:digit:]]{8} ]] && { # native built-in regex matching
number=${BASH_REMATCH[0]} # ...refer to the matched content...
echo "Found $number in filename $f" # ...and emit output.
}
done

Related

Substitute shortest match of pattern in filename

I have files with the following filename pattern:
C14_1_S1_R1_001_copy1.fastq.gz
That I would like to be renamed this way:
C14_1_S1_R1.fastq.gz
I have tested unsuccessfully the following pattern replacement strategy:
for f in *.fastq.gz; do echo mv "$f" "${f/_*./_}"; done
Any suggestion is welcome.
Your original filename has several underscore characters but you only want to remove from the second to last underscore. In that case, try:
mv "$f" "${f%_*_*}.fastq.gz"
Consider a directory with these files:
$ ls -1
C14_1_S1_R1_001_copy1.fastq.gz
C15_1_S1_R1_001_copy1.fastq.gz
If we run our loop and then run a new ls, we see the changed filenames:
$ for f in ./*.fastq.gz; do mv "$f" "${f%_*_*}.fastq.gz"; done
$ ls -1
C14_1_S1_R1.fastq.gz
C15_1_S1_R1.fastq.gz
The key here is that ${var%word} is suffix removal and it matches the shortest possible suffix that matches the glob word. Thus, ${f%_*_*} removes the second-to-last underscore character and everything after it. ${f%_*_*}.fastq.gz removes the second-to-last underscore character and everything after and then restores your desired suffix of .fastq.gz.
str="C14_1_S1_R1_001_copy1.fastq.gz"
front=$(echo "${str}" | cut -d'_' -f1-4)
back=$(echo "${str}" | cut --complement -d'.' -f1)
echo "${front}.${back}"
With regex using the =~ test operator and BASH_REMATCH
#!/usr/bin/env bash
for file in *.fastq.gz; do
if [[ $file =~ ^(.+)(_[[:digit:]]+_copy.*[^\.])(\.fastq\.gz)$ ]]; then
echo mv -v "$file" "${BASH_REMATCH[1]}${BASH_REMATCH[3]}"
fi
done
Basically it just split the C14_1_S1_R1_001_copy1.fastq.gz into three parts.
BASH_REMATCH[1] has C14_1_S1_R1
BASH_REMATCH[2] has _001_copy1
BASH_REMATCH[3] has .fastq.gz
Remove the echo if you're ok with the output so the files can be renamed.

select nth file in folder (using sed)?

I am trying to select the nth file in a folder of which the filename matches a certain pattern:
Ive tried using this with sed: e.g.,
sed -n 3p /path/to/files/pattern.txt
but it appears to return the 3rd line of the first matching file.
Ive also tried
sed -n 3p ls /path/to/files/*pattern*.txt
which doesnt work either.
Thanks!
Why sed, when bash is so much better at it?
Assuming some name n indicates the index you want:
Bash
files=(path/to/files/*pattern*.txt)
echo "${files[n]}"
Posix sh
i=0
for file in path/to/files/*pattern*.txt; do
if [ $i = $n ]; then
break
fi
i=$((i++))
done
echo "$file"
What's wrong with sed is that you would have to jump through many hoops to make it safe for the entire set of possible characters that can occur in a filename, and even if that doesn't matter to you you end up with a double-layer of subshells to get the answer.
file=$(printf '%s\n' path/to/files/*pattern*.txt | sed -n "$n"p)
Please, never parse ls.
ls -1 /path/to/files/*pattern*.txt | sed -n '3p'
or, if patterne is a regex pattern
ls -1 /path/to/files/ | egrep 'pattern' | sed -n '3p'
lot of other possibilities, it depend on performance or simplicity you look at

change lowercase file names to uppercase with awk ,sed or bash

I would like to change lowercase filenames to uppercase with awk/sed/bash
your help would be appreciated
aaaa.txt
vvjv.txt
acfg.txt
desired output
AAAA.txt
VVJV.txt
ACFG.txt
PREFACE:
If you don't care about the case of your extensions, simply use the 'tr' utility in a shell loop:
for i in *.txt; do mv "$i" "$(echo "$i" | tr '[a-z]' '[A-Z]')"; done
If you do care about the case of the extensions, then you should be aware that there is more than one way to do it (TIMTOWTDI). Personally, I believe the Perl solution, listed here, is probably the simplest and most flexible solution under Linux. If you have multiple file extensions, simply specify the number you wish to keep unchanged. The BASH4 solution is also a very good one, but you must be willing to write out the extension a few times, or alternatively, use another variable to store it. But if you need serious portability then I recommend the last solution in this answer which uses octals. Some flavours of Linux also ship with a tool called rename that may also be worth checking out. It's usage will vary from distro to distro, so type man rename for more info.
SOLUTIONS:
Using Perl:
# single extension
perl -e 's/\.[^\.]*$/rename $_, uc($`) . $&/e for #ARGV' *.txt
# multiple extensions
perl -e 's/(?:\.[^\.]*){2}$/rename $_, uc($`) . $&/e for #ARGV' *.tar.gz
Using BASH4:
# single extension
for i in *.txt; do j="${i%.txt}"; mv "$i" "${j^^}.txt"; done
# multiple extensions
for i in *.tar.gz; do j="${i%.tar.gz}"; mv "$i" "${j^^}.tar.gz"; done
# using a var to store the extension:
e='.tar.gz'; for i in *${e}; do j="${i%${e}}"; mv "$i" "${j^^}${e}"; done
Using GNU awk:
for i in *.txt; do
mv "$i" $(echo "$i" | awk '{ sub(/.txt$/,""); print toupper($0) ".txt" }');
done
Using GNU sed:
for i in *.txt; do
mv "$i" $(echo "$i" | sed -r -e 's/.*/\U&/' -e 's/\.TXT$/\u.txt/');
done
Using BASH3.2:
for i in *.txt; do
stem="${i%.txt}";
for ((j=0; j<"${#stem}"; j++)); do
chr="${stem:$j:1}"
if [[ "$chr" == [a-z] ]]; then
chr=$(printf "%o" "'$chr")
chr=$((chr - 40))
chr=$(printf '\'"$chr")
fi
out+="$chr"
done
mv "$i" "$out.txt"
out=
done
In general for lowercase/upper case modifications "tr" ( translate characters ) utility is often used, it's from the set of command line utilities used for character replacement.
dtpwmbp:~ pwadas$ echo "xxx" | tr '[a-z]' '[A-Z]'
XXX
dtpwmbp:~ pwadas$
Also, for renaming files there's "rename" utility, delivered with perl ( man rename ).
SYNOPSIS
rename [ -v ] [ -n ] [ -f ] perlexpr [ files ]
DESCRIPTION
"rename" renames the filenames supplied according to the rule specified as the first argument. The perlexpr argument is a Perl expression which is expected to modify the $_ string in
Perl for at least some of the filenames specified. If a given filename is not modified by the expression, it will not be renamed. If no filenames are given on the command line,
filenames will be read via standard input.
For example, to rename all files matching "*.bak" to strip the extension, you might say
rename 's/\.bak$//' *.bak
To translate uppercase names to lower, you'd use
rename 'y/A-Z/a-z/' *
I would suggest using rename, if you only want to uppercase the filename and not the extension, use something like this:
rename -n 's/^([^.]*)\.(.*)$/\U$1\E.$2/' *
\U uppercases everything until \E, see perlreref(1). Remove the -n when your happy with the output.
Bash 4 parameter expansion can perform case changes:
for i in *.txt; do
i="${i%.txt}"
mv "$i.txt" "${i^^?}.txt"
done
bash:
for f in *.txt; do
no_ext=${f%.txt}
mv "$f" "${no_ext^^}.txt"
done
for f in *.txt; do
mv "$f" "`tr [:lower:] [:upper:] <<< "${f%.*}"`.txt"
done
An easier, lightweight and portable approach would be:
for i in *.txt
do
fname=$(echo $i | cut -d"." -f1 | tr [a-z] [A-Z])
ext=$(echo $i | cut -d"." -f2)
mv $i $fname.$ext
done
This would work on almost every version of BASH since we are using most common external utilities (cut, tr) found on every Unix flavour.
Simply use (on terminal):
for i in *.txt; do mv $i `echo ${i%.*} | tr [:lower:] [:upper:]`.txt; done;
This might work for you (GNU sed):
printf "%s\n" *.txt | sed 'h;s/[^.]*/\U&/;H;g;s/\(.*\)\n/mv -v \1 /' | sh
or more simply:
printf "%s\n" *.txt | sed 'h;s/[^.]*/\U&/;H;g;s/\(.*\)\n/mv -v \1 /e'
for i in *.jar; do mv $i `echo ${i%} | tr [:upper:] [:lower:]`; done;
this works for me.

sed command to fix filenames in a directory

I run a script which generated about 10k files in a directory. I just discovered that there is a bug in the script which causes some filenames to have a carriage return (presumably a '\n' character).
I want to run a sed command to remove the carriage return from the filenames.
Anyone knows which params to pass to sed to clean up the filenames in the manner described?
I am running Linux (Ubuntu)
I don't know how sed would do this, but this python script should do the trick:.
This isn't sed, but I find python a lot easier to use when doing things like these:
#!/usr/bin/env python
import os
files = os.listdir('.')
for file in files:
os.rename(file, file.replace('\r', '').replace('\n', ''))
print 'Processed ' + file.replace('\r', '').replace('\n', '')
It strips any occurrences of both \r and \n from all of the filenames in a given directory.
To run it, save it somewhere, cd into your target directory (with the files to be processed), and run python /path/to/the/file.py.
Also, if you plan on doing more batch renaming, consider Métamorphose. It's a really nice and powerful GUI for this stuff. And, it's free!
Good luck!
Actually, try this: cd into the directory, type in python, and then just paste this in:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', ''))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '')")
It's a one-line version of the previous script, and you don't have to save it.
Version 2, with space replacement powers:
#!/usr/bin/env python
import os
for file in os.listdir('.'):
os.rename(file, file.replace('\r', '').replace('\n', '').replace(' ', '_')
print 'Processed ' + file.replace('\r', '').replace('\n', '')
And here's the one-liner:
exec("import os\nfor file in os.listdir('.'):\n os.rename(file, file.replace('\\r', '').replace('\\n', '')replace(' ', '_'))\n print 'Processed ' + file.replace('\\r', '').replace('\\n', '');")
If there are no spaces in your filenames, you can do:
for f in *$'\n'; do mv "$f" $f; done
It won't work if the newlines are embedded, but it will work for trailing newlines.
If you must use sed:
for f in *$'\n'; do mv "$f" "$(echo "$f" | sed '/^$/d')"; done
Using the rename Perl script:
rename 's/\n//g' *$'\n'
or the util-linux-ng utility:
rename $'\n' '' *$'\n'
If the character is a return instead of a newline, change the \n or ^$ to \r in any places they appear above.
The reason you aren't getting any pure-sed answers is that fundamentally sed edits file contents, not file names; thus the answers that use sed all do something like echo the filename into a pipe (pseudo file), edit that with sed, then use mv to turn that back into a filename.
Since sed is out, here's a pure-bash version to add to the Perl, Python, etc scripts you have so far:
killpattern=$'[\r\n]' # remove both carriage returns and linefeeds
for f in *; do
if [[ "$f" == *$killpattern* ]]; then
mv "$f" "${f//$killpattern/}"
fi
done
...but since ${var//pattern/replacement} isn't available in plain sh (along with [[...]]), here's a version using sh-only syntax, and tr to do the character replacement:
for f in *; do
new="$(printf %s "$f" | tr -d "\r\n")"
if [ "$f" != "$new" ]; then
mv "$f" "$new"
fi
done
EDIT: If you really want it with sed, take a look at this:
http://www.linuxquestions.org/questions/programming-9/merge-lines-in-a-file-using-sed-191121/
Something along these lines should work similar to the perl below:
for i in *; do echo mv "$i" `echo "$i"|sed ':a;N;s/\n//;ta'`; done
With perl, try something along these lines:
for i in *; do mv "$i" `echo "$i"|perl -pe 's/\n//g'`; done
This will rename all files in the current folder by removing all newline characters from them. If you need to go recursive, you can use find instead - be aware of the escaping in that case, though.
In fact there is a way to use sed:
carr='\n' # specify carriage return
files=( $(ls -f) ) # array of files in current dir
for i in ${files[#]}
do
if [[ -n $(echo "$i" | grep $carr) ]] # filenames with carriage return
then
mv "$i" "$(echo "$i" | sed 's/\\n//g')" # move!
fi
done
This actually works.

Best way to choose a random file from a directory in a shell script

What is the best way to choose a random file from a directory in a shell script?
Here is my solution in Bash but I would be very interested for a more portable (non-GNU) version for use on Unix proper.
dir='some/directory'
file=`/bin/ls -1 "$dir" | sort --random-sort | head -1`
path=`readlink --canonicalize "$dir/$file"` # Converts to full path
echo "The randomly-selected file is: $path"
Anybody have any other ideas?
Edit: lhunath makes a good point about parsing ls. I guess it comes down to whether you want to be portable or not. If you have the GNU findutils and coreutils then you can do:
find "$dir" -maxdepth 1 -mindepth 1 -type f -print0 \
| sort --zero-terminated --random-sort \
| sed 's/\d000.*//g/'
Whew, that was fun! Also it matches my question better since I said "random file". Honsetly though, these days it's hard to imagine a Unix system deployed out there having GNU installed but not Perl 5.
files=(/my/dir/*)
printf "%s\n" "${files[RANDOM % ${#files[#]}]}"
And don't parse ls. Read http://mywiki.wooledge.org/ParsingLs
Edit: Good luck finding a non-bash solution that's reliable. Most will break for certain types of filenames, such as filenames with spaces or newlines or dashes (it's pretty much impossible in pure sh). To do it right without bash, you'd need to fully migrate to awk/perl/python/... without piping that output for further processing or such.
Is "shuf" not portable?
shuf -n1 -e /path/to/files/*
or find if files are deeper than one directory:
find /path/to/files/ -type f | shuf -n1
it's part of coreutils but you'll need 6.4 or newer to get it... so RH/CentOS does not include it.
# ******************************************************************
# ******************************************************************
function randomFile {
tmpFile=$(mktemp)
files=$(find . -type f > $tmpFile)
total=$(cat "$tmpFile"|wc -l)
randomNumber=$(($RANDOM%$total))
i=0
while read line; do
if [ "$i" -eq "$randomNumber" ];then
# Do stuff with file
amarok $line
break
fi
i=$[$i+1]
done < $tmpFile
rm $tmpFile
}
Something like:
let x="$RANDOM % ${#file}"
echo "The randomly-selected file is ${path[$x]}"
$RANDOM in bash is a special variable that returns a random number, then I use modulus division to get a valid index, then reference that index in the array.
This boils down to: How can I create a random number in a Unix script in a portable way?
Because if you have a random number between 1 and N, you can use head -$N | tail to cut somewhere in the middle. Unfortunately, I know no portable way to do this with the shell alone. If you have Python or Perl, you can easily use their random support but AFAIK, there is no standard rand(1) command.
I think Awk is a good tool to get a random number. According to the Advanced Bash Guide, Awk is a good random number replacement for $RANDOM.
Here's a version of your script that avoids Bash-isms and GNU tools.
#! /bin/sh
dir='some/directory'
n_files=`/bin/ls -1 "$dir" | wc -l | cut -f1`
rand_num=`awk "BEGIN{srand();print int($n_files * rand()) + 1;}"`
file=`/bin/ls -1 "$dir" | sed -ne "${rand_num}p"`
path=`cd $dir && echo "$PWD/$file"` # Converts to full path.
echo "The randomly-selected file is: $path"
It inherits the problems other answers have mentioned should files contain newlines.
Newlines in file-names can be avoided by doing the following in Bash:
#!/bin/sh
OLDIFS=$IFS
IFS=$(echo -en "\n\b")
DIR="/home/user"
for file in $(ls -1 $DIR)
do
echo $file
done
IFS=$OLDIFS
Here's a shell snippet that relies only on POSIX features and copes with arbitrary file names (but omits dot files from the selection). The random selection uses awk, because that's all you get in POSIX. It's a very poor random number generator, since awk's RNG is seeded with the current time in seconds (so it's easily predictable, and returns the same choice if you call it multiple times per second).
set -- *
n=$(echo $# | awk '{srand(); print int(rand()*$0) + 1}')
eval "file=\$$n"
echo "Processing $file"
If you don't want to ignore dot files, the file name generation code (set -- *) needs to be replaced by something more complicated.
set -- *; [ -e "$1" ] || shift
set .[!.]* "$#"; [ -e "$1" ] || shift
set ..?* "$#"; [ -e "$1" ] || shift
if [ $# -eq 0]; then echo 1>&2 "empty directory"; exit 1; fi
If you have OpenSSL available, you can use it to generate random bytes. If you don't but your system has /dev/urandom, replace the call to openssl by dd if=/dev/urandom bs=3 count=1 2>/dev/null. Here's a snippet that sets n to a random value between 1 and $#, taking care not to introduce a bias. This snippet assumes that $# is at most 2^23-1.
while
n=$(($(openssl rand 3 | od -An -t u4) + 1))
[ $n -gt $((16777216 / $# * $#)) ]
do :; done
n=$((n % $#))
BusyBox (used on embedded devices) is usually configured to support $RANDOM but it doesn't have bash-style arrays or sort --random-sort or shuf. Hence the following:
#!/bin/sh
FILES="/usr/bin/*"
for f in $FILES; do echo "$RANDOM $f" ; done | sort -n | head -n1 | cut -d' ' -f2-
Note trailing "-" in cut -f2-; this is required to avoid truncating files that contain spaces (or whatever separator you want to use).
It won't handle filenames with embedded newlines correctly.
Put each line of output from the command 'ls' into an associative array named line and then choose one of those like so...
ls | awk '{ line[NR]=$0 } END { print line[(int(rand()*NR+1))]}'
My 2 cents, with a version that should not break when filenames with special chars exist:
#!/bin/bash --
dir='some/directory'
let number_of_files=$(find "${dir}" -type f -print0 | grep -zc .)
let rand_index=$((1+(RANDOM % number_of_files)))
printf "the randomly-selected file is: "
find "${dir}" -type f -print0 | head -z -n "${rand_index}" | tail -z -n 1
printf "\n"

Resources