How to only concatenate files with same identifier using bash script? - bash

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!

while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.

for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done

A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net

A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Related

Update numbers in filenames

I have a set of filenames which are ordered numerically like:
13B12363_1B1_0.png
13B12363_1B1_1.png
13B12363_1B1_2.png
13B12363_1B1_3.png
13B12363_1B1_4.png
13B12363_1B1_5.png
13B12363_1B1_6.png
13B12363_1B1_7.png
13B12363_1B1_8.png
13B12363_1B1_9.png
13B12363_1B1_10.png
[...]
13B12363_1B1_495.png
13B12363_1B1_496.png
13B12363_1B1_497.png
13B12363_1B1_498.png
13B12363_1B1_499.png
After some postprocessing, I removed some files and I would like to update the ordering number and replace the actual number by its new position. Looking at this previous question I end up doing something like:
(1) ls -v | cat -n | while read n f; do mv -i $f ${f%%[0-9]+.png}_$n.png; done
However, this command do not recognize the "ordering number + png" and just append the new number at the end of the filename. Something like 13B12363_1B1_10.png_9.png
On the other hand, if I do:
(2) ls -v * | cat -n | while read n f; do mv $f ${f%.*}_$n.png; done
The ordering number is added without issues. Like 13B12363_1B1_10_9.png
So, for (1) it seems I am not specifying the digit correctly but I am not able to find the correct syntax. So far I tried [0-9], [0-9]+, [[:digits:]] and [[:digits:]]+. Which should be the proper one?
Additionally, in (2) I am wondering how I should specify rename (CentOS version) to remove the numbers between the second and the third underscore. Here I have to say that I have some filenames like 20B12363_22_10_9.png, so I should somehow specify second and third underscore.
Using Bash's built-in Basic Regex Engine and a null delimited list of files.
Tested with sample
#!/usr/bin/env bash
prename=$1
# Bash setting to return empty result if no match found
shopt -s nullglob
# Create a temporary directory to prevent file rename collisions
tmpdir=$(mktemp -d) || exit 1
# Add a trap to remove the temporary directory on EXIT
trap 'rmdir -- "$tmpdir"' EXIT
# Initialize file counter
n=0
# Generate null delimited list of files
printf -- %s\\0 "${prename}_"*'.png' |
# Sort the null delimited list on 3rd field numeric order with _ separator
sort --zero-terminated --field-separator=_ --key=3n |
# Iterate the null delimited list
while IFS= read -r -d '' f; do
# If Bash Regex match the file name AND
# file has a different sequence number
if [[ "$f" =~ (.*)_([0-9]+)\.png$ ]] && [[ ${BASH_REMATCH[2]} -ne $n ]]; then
# Use captured Regex match group 1 to rename file with incrementing counter
# and move it to the temporary folder to prevent rename collision with
# existing file
echo mv -- "$f" "$tmpdir/${BASH_REMATCH[1]}_$((n)).png"
fi
# Increment file counter
n=$((n+1))
done
# Move back the renamed files in place
mv --no-clobber -- "$tmpdir/*" ./
# $tempdir removal is automatic on EXIT
# If something goes wrong, some files remain in it and it is not deleted
# so these can be dealt with manually
Remove the echo if the result matches your expectations.
Output from the sample
mv -- 13B12363_1B1_495.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_11.png
mv -- 13B12363_1B1_496.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_12.png
mv -- 13B12363_1B1_497.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_13.png
mv -- 13B12363_1B1_498.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_14.png
mv -- 13B12363_1B1_499.png /tmp/tmp.O2HmbyD7d5/13B12363_1B1_15.png
Do not parse ls.
read interprets \ and splits on IFS. bashfaq how to read a stream line by line
In ${f%%replacement} expansion the replacement is not regex, but globulation. Rules differ. + means literally +.
You could shopt -o extglob and then ${f%%+([0-9]).png}. Or write a loop. Or match the _ too and do f=${f%%.png}; f="${f%_[0-9]*}_".
Or something along (untested):
find . -maxdepth 1 -mindepth 1 -type f -name '13B12363_1B1_*.png' |
sort -t_ -n -k3 |
sed 's/\(.*\)[0-9]+\.png$/&\t\1/' |
{
n=1;
while IFS=$'\t' read -r from to; do
echo mv "$from" "$to$((n++)).png";
done;
}
Another alternative, with perl:
perl -e 'while(<#ARGV>){$o=$_;s/\d+(?=\D*$)/$i++.".renamed"/e;die if -e $_;rename $o,$_}while(<*.renamed>){$o=$_;s/\.renamed$//;die if -e $_;rename $o,$_}' $(ls -v|sed -E "s/$|^/'/g"|paste -sd ' ' -)
This solution should avoid rename collisions by: first renaming files adding extra ".renamed" extension. And then removing the ".renamed" extension as the last step. Also, There are checks to detect rename collision.
Anyways, please backup your data before trying :)
The perl script unrolled and explained:
while(<#ARGV>){ # loop through arguments.
# filenames are passed to "$_" variable
# save old file name
$o=$_;
# if not using variable, regex replacement (s///) uses topic variable ($_)
# e flag ==> evals the replacement
s/\d+(?=\D*$)/$i++.".renamed"/e; # works on $_
# Detect rename collision
die if -e $_;
rename $o,$_
}
while(<*.renamed>){
$o=$_;
s/\.renamed$//; # remove .renamed extension
die if -e $_;
rename $o,$_
}
The regex:
\d+ # one number or more
(?=\D*$) # followed by 0 or more non-numbers and end of string

How can I save only a substring of file names from a directory without the file extension?

I have a directory that I'm reading from and I want to save only the date representation as a string.
I am close to getting it , although I know there is probably an easier way. Here is what I have so far:
#files are in the format of "THIS_20200420.csv" so I want only "20200420"
declare -a arr
declare -a arr2
FILES=test2/*.csv
for file in $FILES
do
arr=(${arr[*]} "${file##*/}")
done
for i in "${arr[#]}"
do
arr2+=$(echo $i | cut -c6-13)
done
for item in "${arr2[#]}"
do
echo $item
done
the output shows the array only having one element which is all the strings concatenated:
20200110202001202020021920200220202004202020042220200110202001202020021920200220202004202020042220200219202002202020042020200422
Im bashing my head against my computer at this point.
arr=(
"THIS_20200420.csv"
"THIS_20200421.csv"
"THIS_20200422.csv"
"THIS_20200423.csv"
"THIS_20200424.csv"
"THIS_20200425.csv"
"THIS_20200426.csv"
"THIS_20200427.csv"
"THIS_20200428.csv"
"THIS_20200429.csv"
"THIS_20200430.csv" )
arr=( ${arr[#]//*_} )
arr=( ${arr[#]//.*} )
echo "arr: ${arr[#]}"
Explanation:
arr=( ${arr[#]//*_} ) will match all char up to '_' for each element, and replace them with empty string.
arr=( ${arr[#]//.*} ) will match all char after '.' for each element, and replace them with empty string.
For more information on parameter expansion, a good reference is TLDP's guide on parameter expansion.
Try this
declare -a arrayname=($(ls -1 test2/*.csv | grep -o '[0-9]*'))
Demo:
$ls -1 *csv
THIS_20200420.csv
THIS_20200421.csv
THIS_20200422.csv
THIS_20200423.csv
THIS_20200424.csv
THIS_20200425.csv
THIS_20200426.csv
THIS_20200427.csv
THIS_20200428.csv
THIS_20200429.csv
THIS_20200430.csv
$declare -a arrayname=($(ls -1 *csv | grep -o '[0-9]*'))
$echo ${arrayname[#]}
20200420 20200421 20200422 20200423 20200424 20200425 20200426 20200427 20200428 20200429 20200430
$echo ${arrayname[2]}
20200422
$
You could achieve this using a loop with awk:
$ for file in *.csv; do echo $file | awk -F '[^[:alnum:]]' '{print $2}'; done
The -F '[^[:alnum:]]' tells awk to use non alphanumeric characters as the delimiter.
Another way to do this is to use bash shell parameter expansion to echo only the part of the filename you want. This obviously only works if your filenames have consistent formatting:
$ for file in *.csv; do echo "${file:5:8}"; done
I thought it would be nice to use bash parameter expansion to strip the unwanted prefix and suffix but you can't have nested expansion (afaict) so this is the best I could come up with:
$ for file in *.csv; do echo "$(tmp=${file%.csv}; echo ${tmp#THIS_})"; done
Meet Cut! A good friend of Linux Users
for file in ./*.csv; do echo $file | cut -d "_" -f 2 | cut -d "." -f 1 ; done
This one line should do the trick!
Example:
Use an array for the files assignment and parameter expansion.
#!/usr/bin/env bash
shopt -s nullglob
##: Save the files ending in *.csv in an array
## so it expands properly, variable assignment does not expand the glob *
files=(test2/*.csv)
##: Remain only the files that end with .csv without the pathname, longest match
files=("${files[#]##*/}")
##: Remain only the file names without the .csv extention
files=("${files[#]%.csv}")
##: Remain only the filename after the _ from the beginning, shortest match.
files=("${files[#]#*_}")
printf '%s ' "${files[#]}"

Bash shell script to find missing files from filename

I have a folder that should contain 1485 files, named PA0001.png, PA0002.png ... up to PA1485.png
Some of them are missing and I'd like to write a shell script able to identify the missing ones and print them, as a list, in a .txt file (preferably without the leading string PA and the .png extension, but with the leading zeroes, if any)
I have no clue on how to proceed though, maybe using awk? But I'm still quite of a noob... Any help would be much appreciated!
You can get the list of the sequence number of missing files using bash loop
# Redirect output, per answer
exec > file.txt
for ((i=1 ; i<=1485 ; i++)) ; do
# Convert to 4 digit zero padded
printf -v id '%04d' $i
if [ ! -f "PA$id.png" ] ; then
echo $id
fi
done
Here's a slight refactoring of the existing answer, with explanations in the comments.
# Assign each number in the sequence to i; loop until we have done them all
for ((i=1 ; i<=1485 ; i++)) ; do
# Format the number with padding for the file name part
printf -v id '%04d' "$i"
# If a file with this name does not exist,
if [ ! -f "PA$id.png" ] ; then
# Print it to standard output
echo "$id"
fi
# Redirect the loop's standard output to a file
done >missing.txt
You can do exactly this without a single Bash loop:
#!/usr/bin/env bash
{
find . \
-maxdepth 1 \
-regextype posix-extended \
-regex '.*/([[:digit:]]){4}\.png' \
-printf '%f\n'
printf '%04d.png\n' {1..1485}
} | sort | uniq --unique
It combines the list of files with the list of expected files;
then sort and print the unique entries that are those that are only in the printed expected list, so are missing files.

How to compare 2 files word by word and storing the different words in result output file

Suppose there are two files:
File1.txt
My name is Anamika.
File2.txt
My name is Anamitra.
I want result file storing:
Result.txt
Anamika
Anamitra
I use putty so can't use wdiff, any other alternative.
not my greatest script, but it works. Other might come up with something more elegant.
#!/bin/bash
if [ $# != 2 ]
then
echo "Arguments: file1 file2"
exit 1
fi
file1=$1
file2=$2
# Do this for both files
for F in $file1 $file2
do
if [ ! -f $F ]
then
echo "ERROR: $F does not exist."
exit 2
else
# Create a temporary file with every word from the file
for w in $(cat $F)
do
echo $w >> ${F}.tmp
done
fi
done
# Compare the temporary files, since they are now 1 word per line
# The egrep keeps only the lines diff starts with > or <
# The awk keeps only the word (i.e. removes < or >)
# The sed removes any character that is not alphanumeric.
# Removes a . at the end for example
diff ${file1}.tmp ${file2}.tmp | egrep -E "<|>" | awk '{print $2}' | sed 's/[^a-zA-Z0-9]//g' > Result.txt
# Cleanup!
rm -f ${file1}.tmp ${file2}.tmp
This uses a trick with the for loop. If you use a for to loop on a file, it will loop on each word. NOT each line like beginners in bash tend to believe. Here it is actually a nice thing to know, since it transforms the files into 1 word per line.
Ex: file content == This is a sentence.
After the for loop is done, the temporary file will contain:
This
is
a
sentence.
Then it is trivial to run diff on the files.
One last detail, your sample output did not include a . at the end, hence the sed command to keep only alphanumeric charactes.

Phrase similar filenames as a string with separated by coma and different filenames by whitespace

I am trying to get a string from file names in a directory, grouping together with separated by coma and different file names separated by a single whitespace. Please see the Expected output in the end.
Files in the directory
usa_la2_sky_1.csv
usa_la2_sky_2.csv
usa_nyc1_sky_1.csv
usa_nyc1_sky_2.csv
I tried:
for f in *.csv ; do
input=$input,$f
done
echo $input | sed s/,//
Output with my above code:
usa_la2_sky_1.csv,usa_la2_sky_2.csv,usa_nyc1_sky_1.csv,usa_nyc1_sky_2.csv
Expected output:
usa_la2_sky_1.csv,usa_la2_sky_2.csv usa_nyc1_sky_1.csv,usa_nyc1_sky_2.csv
You can do it easily, but you need to know what the last filename was. You can handle that by saving in a variable (originally set empty). Then just compare the initial part of the the filename for each with a simple parameter expansion (POSIX compliant), e.g.
#!/bin/bash
last= ## last originally empty
for i in *.csv; do ## loop over each file
if [ -z "$last" ]; then ## if last empty, output file
printf "%s" "$i"
elif [ "$last" = "${i%_*}" ]; then ## if last matches beginning of file
printf ",%s" "$i" ## output comma and file
else
printf " %s" "$i" ## no match, output space and file
fi
last="${i%_*}" ## save beginning of filename in last
done
echo "" ## tidy up with final newline
Example Use/Output
With your files in a sample directory, e.g.
$ tree .
.
├── usa_la2_sky_1.csv
├── usa_la2_sky_2.csv
├── usa_nyc1_sky_1.csv
└── usa_nyc1_sky_2.csv
Running the script produces:
$ bash myscript
usa_la2_sky_1.csv,usa_la2_sky_2.csv usa_nyc1_sky_1.csv,usa_nyc1_sky_2.csv
Where you have comma-separated similar filenames in groups separated by a space (which is what I understood you were asking for).
Try this Shellcheck-clean pure Bash code:
#! /bin/bash -p
shopt -s nullglob # Globs that match nothing expand to nothing
input='' oldbase=''
for f in *.csv ; do
base=${f%_*}
[[ $base == "$oldbase" ]] && sep=, || sep=' '
input+=${input:+$sep}$f
oldbase=$base
done
printf '%s\n' "$input"
shopt -s nullglob prevents the code trying to process a spurious (literal) *.csv file if there are no CSV files in the current directory.
base=${f%_*} sets $base to the filename up to, but not including, the last _ character in it. (For example, $base for usa_la2_sky_1.csv is usa_la2_sky.) See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)).
input+=${input:+$sep}$f appends the current filename, possibly preceded by a separator, to the current input string. ${input:+$sep} expands to nothing if $input is empty, and to the value of $sep otherwise. The effect of this is to have no separator at the start of $input. See the "Use an alternate value" section in Parameter expansion [Bash Hackers Wiki].Another option is to simply always add the separator (input+=$sep$f) and remove the leading separator afterwards. One way to remove the leading separator is input=${input#?}.
This will do it:
ls *.csv | awk '{key=$0;sub(/_[^_]*csv/,"",key);a[key]=(key in a)?a[key]","$0:$0}
END{for (i in a){print a[i]}}' |
paste -s -d ' '
We use ls to list all files ending in .csv. Then we use awk to group the files. We make the key by stripping out each _1.csv suffix. All these string are stored in an array and separated by ",". In the end we will print these. Since you wanted separate the group by space I used paste -s for this. This will paste each line in one line separated by a spaces indicated by -d ' '.

Resources