How can I save only a substring of file names from a directory without the file extension? - bash

I have a directory that I'm reading from and I want to save only the date representation as a string.
I am close to getting it , although I know there is probably an easier way. Here is what I have so far:
#files are in the format of "THIS_20200420.csv" so I want only "20200420"
declare -a arr
declare -a arr2
FILES=test2/*.csv
for file in $FILES
do
arr=(${arr[*]} "${file##*/}")
done
for i in "${arr[#]}"
do
arr2+=$(echo $i | cut -c6-13)
done
for item in "${arr2[#]}"
do
echo $item
done
the output shows the array only having one element which is all the strings concatenated:
20200110202001202020021920200220202004202020042220200110202001202020021920200220202004202020042220200219202002202020042020200422
Im bashing my head against my computer at this point.

arr=(
"THIS_20200420.csv"
"THIS_20200421.csv"
"THIS_20200422.csv"
"THIS_20200423.csv"
"THIS_20200424.csv"
"THIS_20200425.csv"
"THIS_20200426.csv"
"THIS_20200427.csv"
"THIS_20200428.csv"
"THIS_20200429.csv"
"THIS_20200430.csv" )
arr=( ${arr[#]//*_} )
arr=( ${arr[#]//.*} )
echo "arr: ${arr[#]}"
Explanation:
arr=( ${arr[#]//*_} ) will match all char up to '_' for each element, and replace them with empty string.
arr=( ${arr[#]//.*} ) will match all char after '.' for each element, and replace them with empty string.
For more information on parameter expansion, a good reference is TLDP's guide on parameter expansion.

Try this
declare -a arrayname=($(ls -1 test2/*.csv | grep -o '[0-9]*'))
Demo:
$ls -1 *csv
THIS_20200420.csv
THIS_20200421.csv
THIS_20200422.csv
THIS_20200423.csv
THIS_20200424.csv
THIS_20200425.csv
THIS_20200426.csv
THIS_20200427.csv
THIS_20200428.csv
THIS_20200429.csv
THIS_20200430.csv
$declare -a arrayname=($(ls -1 *csv | grep -o '[0-9]*'))
$echo ${arrayname[#]}
20200420 20200421 20200422 20200423 20200424 20200425 20200426 20200427 20200428 20200429 20200430
$echo ${arrayname[2]}
20200422
$

You could achieve this using a loop with awk:
$ for file in *.csv; do echo $file | awk -F '[^[:alnum:]]' '{print $2}'; done
The -F '[^[:alnum:]]' tells awk to use non alphanumeric characters as the delimiter.
Another way to do this is to use bash shell parameter expansion to echo only the part of the filename you want. This obviously only works if your filenames have consistent formatting:
$ for file in *.csv; do echo "${file:5:8}"; done
I thought it would be nice to use bash parameter expansion to strip the unwanted prefix and suffix but you can't have nested expansion (afaict) so this is the best I could come up with:
$ for file in *.csv; do echo "$(tmp=${file%.csv}; echo ${tmp#THIS_})"; done

Meet Cut! A good friend of Linux Users
for file in ./*.csv; do echo $file | cut -d "_" -f 2 | cut -d "." -f 1 ; done
This one line should do the trick!
Example:

Use an array for the files assignment and parameter expansion.
#!/usr/bin/env bash
shopt -s nullglob
##: Save the files ending in *.csv in an array
## so it expands properly, variable assignment does not expand the glob *
files=(test2/*.csv)
##: Remain only the files that end with .csv without the pathname, longest match
files=("${files[#]##*/}")
##: Remain only the file names without the .csv extention
files=("${files[#]%.csv}")
##: Remain only the filename after the _ from the beginning, shortest match.
files=("${files[#]#*_}")
printf '%s ' "${files[#]}"

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Update numbers in filenames

I have a set of filenames which are ordered numerically like:
13B12363_1B1_0.png
13B12363_1B1_1.png
13B12363_1B1_2.png
13B12363_1B1_3.png
13B12363_1B1_4.png
13B12363_1B1_5.png
13B12363_1B1_6.png
13B12363_1B1_7.png
13B12363_1B1_8.png
13B12363_1B1_9.png
13B12363_1B1_10.png
[...]
13B12363_1B1_495.png
13B12363_1B1_496.png
13B12363_1B1_497.png
13B12363_1B1_498.png
13B12363_1B1_499.png
After some postprocessing, I removed some files and I would like to update the ordering number and replace the actual number by its new position. Looking at this previous question I end up doing something like:
(1) ls -v | cat -n | while read n f; do mv -i $f ${f%%[0-9]+.png}_$n.png; done
However, this command do not recognize the "ordering number + png" and just append the new number at the end of the filename. Something like 13B12363_1B1_10.png_9.png
On the other hand, if I do:
(2) ls -v * | cat -n | while read n f; do mv $f ${f%.*}_$n.png; done
The ordering number is added without issues. Like 13B12363_1B1_10_9.png
So, for (1) it seems I am not specifying the digit correctly but I am not able to find the correct syntax. So far I tried [0-9], [0-9]+, [[:digits:]] and [[:digits:]]+. Which should be the proper one?
Additionally, in (2) I am wondering how I should specify rename (CentOS version) to remove the numbers between the second and the third underscore. Here I have to say that I have some filenames like 20B12363_22_10_9.png, so I should somehow specify second and third underscore.
Using Bash's built-in Basic Regex Engine and a null delimited list of files.
Tested with sample
#!/usr/bin/env bash
prename=$1
# Bash setting to return empty result if no match found
shopt -s nullglob
# Create a temporary directory to prevent file rename collisions
tmpdir=$(mktemp -d) || exit 1
# Add a trap to remove the temporary directory on EXIT
trap 'rmdir -- "$tmpdir"' EXIT
# Initialize file counter
n=0
# Generate null delimited list of files
printf -- %s\\0 "${prename}_"*'.png' |
# Sort the null delimited list on 3rd field numeric order with _ separator
sort --zero-terminated --field-separator=_ --key=3n |
# Iterate the null delimited list
while IFS= read -r -d '' f; do
# If Bash Regex match the file name AND
# file has a different sequence number
if [[ "$f" =~ (.*)_([0-9]+)\.png$ ]] && [[ ${BASH_REMATCH[2]} -ne $n ]]; then
# Use captured Regex match group 1 to rename file with incrementing counter
# and move it to the temporary folder to prevent rename collision with
# existing file
echo mv -- "$f" "$tmpdir/${BASH_REMATCH[1]}_$((n)).png"
fi
# Increment file counter
n=$((n+1))
done
# Move back the renamed files in place
mv --no-clobber -- "$tmpdir/*" ./
# $tempdir removal is automatic on EXIT
# If something goes wrong, some files remain in it and it is not deleted
# so these can be dealt with manually
Remove the echo if the result matches your expectations.
Output from the sample
mv -- 13B12363_1B1_495.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_11.png
mv -- 13B12363_1B1_496.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_12.png
mv -- 13B12363_1B1_497.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_13.png
mv -- 13B12363_1B1_498.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_14.png
mv -- 13B12363_1B1_499.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_15.png
Do not parse ls.
read interprets \ and splits on IFS. bashfaq how to read a stream line by line
In ${f%%replacement} expansion the replacement is not regex, but globulation. Rules differ. + means literally +.
You could shopt -o extglob and then ${f%%+([0-9]).png}. Or write a loop. Or match the _ too and do f=${f%%.png}; f="${f%_[0-9]*}_".
Or something along (untested):
find . -maxdepth 1 -mindepth 1 -type f -name '13B12363_1B1_*.png' |
sort -t_ -n -k3 |
sed 's/\(.*\)[0-9]+\.png$/&\t\1/' |
{
n=1;
while IFS=$'\t' read -r from to; do
echo mv "$from" "$to$((n++)).png";
done;
}
Another alternative, with perl:
perl -e 'while(<#ARGV>){$o=$_;s/\d+(?=\D*$)/$i++.".renamed"/e;die if -e $_;rename $o,$_}while(<*.renamed>){$o=$_;s/\.renamed$//;die if -e $_;rename $o,$_}' $(ls -v|sed -E "s/$|^/'/g"|paste -sd ' ' -)
This solution should avoid rename collisions by: first renaming files adding extra ".renamed" extension. And then removing the ".renamed" extension as the last step. Also, There are checks to detect rename collision.
Anyways, please backup your data before trying :)
The perl script unrolled and explained:
while(<#ARGV>){ # loop through arguments.
# filenames are passed to "$_" variable
# save old file name
$o=$_;
# if not using variable, regex replacement (s///) uses topic variable ($_)
# e flag ==> evals the replacement
s/\d+(?=\D*$)/$i++.".renamed"/e; # works on $_
# Detect rename collision
die if -e $_;
rename $o,$_
}
while(<*.renamed>){
$o=$_;
s/\.renamed$//; # remove .renamed extension
die if -e $_;
rename $o,$_
}
The regex:
\d+ # one number or more
(?=\D*$) # followed by 0 or more non-numbers and end of string

UNIX :: Padding for files containing string and multipleNumber

I have many files not having consistent filenames.
For example
IMG_20200823_1.jpg
IMG_20200823_10.jpg
IMG_20200823_12.jpg
IMG_20200823_9.jpg
I would like to rename all of them and ensure they all follow same naming convention
IMG_20200823_0001.jpg
IMG_20200823_0010.jpg
IMG_20200823_0012.jpg
IMG_20200823_0009.jpg
Found out it's possible to change for file having only a number using below
printf "%04d\n"
However am not able to do with my files considering they mix string + "_" + different numbers.
Could anyone help me ?
Thanks !
With Perl's standalone rename or prename command:
rename -n 's/(\d+)(\.jpg$)/sprintf("%04d%s",$1,$2)/e' *.jpg
Output:
rename(IMG_20200823_10.jpg, IMG_20200823_0010.jpg)
rename(IMG_20200823_12.jpg, IMG_20200823_0012.jpg)
rename(IMG_20200823_1.jpg, IMG_20200823_0001.jpg)
rename(IMG_20200823_9.jpg, IMG_20200823_0009.jpg)
if everything looks fine, remove -n.
With Bash regular expressions:
re='(IMG_[[:digit:]]+)_([[:digit:]]+)'
for f in *.jpg; do
[[ $f =~ $re ]]
mv "$f" "$(printf '%s_%04d.jpg' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]}")"
done
where BASH_REMATCH is an array containing the capture groups of the regular expression. At index 0 is the whole match; index 1 contains IMG_ and the first group of digits; index 2 contains the second group of digits. The printf command is used to format the second group with zero padding, four digits wide.
Use a regex to extract the relevant sub-strings from the input and then pad it...
For each file.
Extract the prefix, number and suffix from the filename.
Pad the number with zeros.
Create the new filename.
Move files
The following code for bash:
echo 'IMG_20200823_1.jpg
IMG_20200823_10.jpg
IMG_20200823_12.jpg
IMG_20200823_9.jpg' |
while IFS= read -r file; do # foreach file
# Use GNU sed to extract parts on separate lines
tmp=$(<<<"$file" sed 's/\(.*_\)\([0-9]*\)\(\..*\)/\1\n\2\n\3\n/')
# Read the separate parts separated by newlines
{
IFS= read -r prefix
IFS= read -r number
IFS= read -r suffix
} <<<"$tmp"
# create new filename
newfilename="$prefix$(printf "%04d" "$number")$suffix"
# move the files
echo mv "$file" "$newfilename"
done
outputs:
mv IMG_20200823_1.jpg IMG_20200823_0001.jpg
mv IMG_20200823_10.jpg IMG_20200823_0010.jpg
mv IMG_20200823_12.jpg IMG_20200823_0012.jpg
mv IMG_20200823_9.jpg IMG_20200823_0009.jpg
Being puzzled by your hint at printf...
Current folder content:
$ ls -1 IMG_*
IMG_20200823_1.jpg
IMG_20200823_21.jpg
Surely is not a good solution but with printf and sed we can do that:
$ printf "mv %3s_%8s_%d.%3s %3s_%8s_%04d.%3s\n" $(ls -1 IMG_* IMG_* | sed 's/_/ /g; s/\./ /')
mv IMG_20200823_1.jpg IMG_20200823_0001.jpg
mv IMG_20200823_21.jpg IMG_20200823_0021.jpg

Turning a list of abs pathed files to a comma delimited string of files in bash

I have been working in bash, and need to create a string argument. bash is a newish for me, to the point that I dont know how to build a string in bash from a list.
// foo.txt is a list of abs file names.
/foo/bar/a.txt
/foo/bar/b.txt
/delta/test/b.txt
should turn into: a.txt,b.txt,b.txt
OR: /foo/bar/a.txt,/foo/bar/b.txt,/delta/test/b.txt
code
s = ""
for file in $(cat foo.txt);
do
#what goes here? s += $file ?
done
myShellScript --script $s
I figure there was an easy way to do this.
with for loop:
for file in $(cat foo.txt);do echo -n "$file",;done|sed 's/,$/\n/g'
with tr:
cat foo.txt|tr '\n' ','|sed 's/,$/\n/g'
only sed:
sed ':a;N;$!ba;s/\n/,/g' foo.txt
This seems to work:
#!/bin/bash
input="foo.txt"
while IFS= read -r var
do
basename $var >> tmp
done < "$input"
paste -d, -s tmp > result.txt
output: a.txt,b.txt,b.txt
basename gets you the file names you need and paste will put them in the order you seem to need.
The input field separator can be used with set to create split/join functionality:
# split the lines of foo.txt into positional parameters
IFS=$'\n'
set $(< foo.txt)
# join with commas
IFS=,
echo "$*"
For just the file names, add some sed:
IFS=$'\n'; set $(sed 's|.*/||' foo.txt); IFS=,; echo "$*"

Is it possible to do a grep with keywords stored in the array?

Is it possible to do a grep with keywords stored in the array.
Here is the possible code snippet; how can I correct it?
args=("key1" "key2" "key3")
cat file_name |while read line
echo $line | grep -q -w ${args[c]}
done
At the moment, I can search for only one keyword. I would like to search for all the keywords which is stored in args array.
args=("key1" "key2" "key3")
pat=$(echo ${args[#]}|tr " " "|")
grep -Eow "$pat" file
Or with the shell
args=("key1" "key2" "key3")
while read -r line
do
for i in ${args[#]}
do
case "$line" in
*"$i"*) echo "found: $line";;
esac
done
done <"file"
You can use some bash expansion magic to prefix each element with -e and pass each element of the array as a separate pattern. This may avoid some precedence issues where your patterns may interact badly with the | operator:
$ grep ${args[#]/#/-e } file_name
The downside to this is that you cannot have any spaces in your patterns because that will split the arguments to grep. You cannot put quotes around the above expansion, otherwise you get "-e pattern" as a single argument to grep.
This is one way:
args=("key1" "key2" "key3")
keys=${args[#]/%/\\|} # result: key1\| key2\| key3\|
keys=${keys// } # result: key1\|key2\|key3\|
grep "${keys}" file_name
Edit:
Based on Pavel Shved's suggestion:
( IFS="|"; keys="${args[*]}"; keys="${keys//|/\\|}"; grep "${keys}" file_name )
The first version as a one-liner:
keys=${args[#]/%/\\|}; keys=${keys// }; grep "${keys}" file_name
Edit2:
Even better than the version using IFS:
printf -v keys "%s\\|" "${args[#]}"; grep "${keys}" file_name
I tend to use process substitution for everything. It's convenient when combined with grep's -f option:
Obtain patterns from FILE, one per line.
(Depending on the context, you might even want to combine that with -F, -x or -w, etc., for awesome effects.)
So:
#! /usr/bin/env bash
t=(8 12 24)
seq 30 | grep -f <(printf '%s\n' "${t[#]}")
and I get:
8
12
18
24
28
I basically write a pseudo-file with one item of the array per line, and then tell grep to use each of these lines as a pattern.
The command
( IFS="|" ; grep --perl-regexp "${args[*]}" ) <file_name
searches the file for each keyword in an array. It does so by constructing regular expression word1|word2|word3 that matches any word from the alternatives given (in perl mode).
If I there is a way to join array elements into a string, delimiting them with sequence of characters (namely, \|), it could be done without perl regexp.
perhaps something like this;
cat file_name |while read line
for arg in ${args[#]}
do
echo $line | grep -q -w $arg}
done
done
not tested!

Resources