Bash - Concatenate files in a directory ordered by date - bash

i need some help with a simple script i m writing. The script takes as input a directory that contains files in the likes of :
FILENAME20160220.TXT
FILENAME20160221.TXT
FILENAME20160222.TXT
...
The script needs to have the directory as input, concatenate them into a new file called :
FILENAME.20160220_20160222.TXT
The above filenames need to have the "Earliest"_"Latest" date it finds. The script i ve written so far is this, but it doesnt produce the necessary output. Can someone help me tinker with it?
declare FILELISTING="FILELISTING.TXT"
declare SOURCEFOLDER="/Cat_test/cat_test/"
declare TEMPFOLDER="/Cat_Test/cat_test/temp/"
# Create temporary folder
cd $SOURCEFOLDER
mkdir $TEMPFOLDER
chk_abnd $?
# Move files into temporary folder
mv *.TXT $SOURCEFOLDER $TEMPFOLDER
chk_abnd $?
# Change directory to temporary folder
cd $TEMPFOLDER
chk_abnd $?
# Iterate through files in temp folder and create temporary listing files
for FILE in $TEMPFOLDER
do
echo $FILE >> $FILELISTING
done
# Iterate through the lines of FILELISTING and store dates into array for sorting
while read lines
do
array[$i] = "${$line:x:y}"
(( i++ ))
done <$FILELISTING
# Sort dates in array
for ((i = 0; i < $n ; i++ ))
do
for ((j = $i; j < $n; j++ ))
do
if [ $array[$i] -gt $array[$j] ]
then
t=${array[i]}
array[$i]=${array[$j]}
array[$j]=$t
fi
done
done
# Get first and last date of array and construct output filename
OT_FILE=FILENAME.${array[1]}_${array[-1]}.txt
# Sort files in folder
# Cat files into one
cat *.ACCT > "$OT_FILE.temp"
chk_abnd $?
# Remove Hex 1A
# tr '\x1A' '' < "$OT_FILE.temp" > $OT_FILE
# Cleanup - Remove File Listing
rm $FILE_LISTING
chk_abnd $?
rm $OT_FILE.temp
chk_abnd $?

Assuming that the base list of your files can be identified using FILENAME*.TXT which is nice and simple, ls can be used to generate an ordered list which will by default be ordered ascending alphabetically and thus (because of the date format you've chosen) in date ascending order.
You can get the earliest and lateest dates as follows:
$ earliest=$( ls -1 FILENAME*.TXT | head -1 | cut -c9-16 )
$ echo $earliest
20160220
$ latest=$( ls -1 FILENAME*.TXT | tail -1 | cut -c9-16 )
$ echo $latest
20160222
Therefore your file name can be produced using:
filename="FILENAME.${earliest}_${latest}.TXT"
And the concatenation should be as simple as:
cat $( ls -1 FILENAME*.TXT ) > ${filename}
though if you are writing to the same directory, you may wish to direct the output first to a temporary name that doesn't meet this pattern and then rename it. Perhaps something like:
earliest=$( ls -1 FILENAME*.TXT | head -1 | cut -c9-16 )
latest=$( ls -1 FILENAME*.TXT | tail -1 | cut -c9-16 )
filename="FILENAME.${earliest}_${latest}.TXT"
cat $( ls -1 FILENAME*.TXT ) > temp_${filename}
mv temp_${filename} ${filename}

Here are some hints, cat does most of the work.
If your filenames have fixed size date fields, as in your example, lexical sorting is enough.
ls -1 FILENAME* > allfiles
aggname=$(cat allfiles | sed -rn '1s/([^0-9]*)/\1./p;$s/[^0-9]*//p' |
paste -sd-)
cat allfiles | xargs cat > $aggname
you can combine the last two steps into one, but more readable this way.
don't reinvent the wheel.

Related

How to only concatenate files with same identifier using bash script?

I have a directory with files, some have the same ID, which is given in the first part of the file name before the first underscore (always). e.g.:
S100_R1.txt
S100_R2.txt
S111_1_R1.txt
S111_R1.txt
S111_R2.txt
S333_R1.txt
I want to concatenate those identical IDs (and if possible placing the original files in another dir, e.g. output:
original files (folder)
S100_merged.txt
S111_merged.txt
S333_R1.txt
Small note: I imaging that perhaps a solution would be to place all files which will be processed by the code in a new directory and than in a second step move the files with the appended "merged" back to the original dir or something like this...
I am extremely new to bash scripting, so I really can't produce this code. I am use to R language and I can think how it should be but can't write it.
My pitiful attempt is something like this:
while IFS= read -r -d '' id; do
cat *"$id" > "./${id%.txt}_grouped.txt"
done < <(printf '%s\0' *.txt | cut -zd_ -f1- | sort -uz)
or this:
for ((k=100;k<400;k=k+1));
do
IDList= echo "S${k}_S*.txt" | awk -F'[_.]' '{$1}'
while [ IDList${k} == IDList${k+n} ]; do
cat IDList${k}_S*.txt IDList${k+n}_S*.txt S${k}_S*.txt S${k}_S*.txt >cat/S${k}_merged.txt &;
done
Sometimes there are only one version of the file (e.g. S333_R1.txt) sometime two (S100*), three (S111*) or more of the same.
I am prepared for harsh critique for this question because I am so far from a solution, but if someone would be willing to help me out I would greatly appreciate it!
while read $fil;
do
if [[ "$(find . -maxdepth 1 -name $line"_*.txt" | wc -l)" -gt "1" ]]
then
cat $line_*.txt >> "$line_merged.txt"
fi
done <<< "$(for i in *_*.txt;do echo $i;done | awk -F_ '{ print $1 }')"
Search for files with _.txt and run the output into awk, printing the strings before "_". Run this through a while loop. Check if the number of files for each prefix pattern is greater than 1 using find and if it is, cat the files with that prefix pattern into a merged file.
for id in $(ls | grep -Po '^[^_]+' | uniq) ; do
if [ $(ls ${id}_*.txt 2> /dev/null | wc -l) -gt 1 ] ; then
cat ${id}_*.txt > _${id}_merged.txt
mv ${id}_*.txt folder
fi
done
for f in _*_merged.txt ; do
mv ${f} ${f:1}
done
A plain bash loop with preprocessing:
# first get the list of files
find . -type f |
# then extract the prefix
sed 's#./\([^_]*\)_#\1\t&#' |
# then in a loop merge the files
while IFS=$'\t' read prefix file; do
cat "$file" >> "${prefix}_merged.txt"
done
That script is iterative - one file at a time. To detect if there is one file of specific prefix, we have to look at all files at a time. So first an awk script to join list of filenames with common prefix:
find . -type f | # maybe `sort |` ?
# join filenames with common prefix
awk '{
f=$0; # remember the file path
gsub(/.*\//,"");gsub(/_.*/,""); # extract prefix from filepath and store it in $0
a[$0]=a[$0]" "f # Join path with leading space in associative array indexed with prefix
}
# Output prefix and filanames separated by spaces.
# TBH a tab would be a better separator..
END{for (i in a) print i a[i]}
' |
# Read input separated by spaces into a bash array
while IFS=' ' read -ra files; do
#first array element is the prefix
prefix=${files[0]}
unset files[0]
# rest is the files
case "${#files[#]}" in
0) echo super error; ;;
# one file - preserve the filename
1) cat "${files[#]}" > "$outdir"/"${files[1]}"; ;;
# more files - do a _merged.txt suffix
*) cat "${files[#]}" > "$outdir"/"${prefix}_merged.txt"; ;;
esac
done
Tested on repl.
IDList= echo "S${k}_S*.txt"
Executes the command echo with the environment variable IDList exported and set to empty with one argument equal to S<insert value of k here>_S*.txt.
Filename expansion (ie. * -> list of files) is not executed inside " double quotes.
To assign a result of execution into a variable, use command substitution var=$( something seomthing | seomthing )
IDList${k+n}_S*.txt
The ${var+pattern} is a variable expansion that does not add two variables together. It uses pattern when var is set and does nothing when var is unset. See shell parameter expansion and this my answer on ${var-pattern}, but it's similar.
To add two numbers use arithemtic expansion $((k + n)).
awk -F'[_.]' '{$1}'
$1 is just invalid here. To print a line, print it {print %1}.
Remember to check your scripts with http://shellcheck.net
A pure bash way below. It uses only globs (no need for external commands like ls or find for this question) to enumerate filenames and an associative array (which is supported by bash since the version 4.0) in order to compute frequencies of ids. Parsing ls output to list files is questionable in bash. You may consider reading ParsingLs.
#!/bin/bash
backupdir=original_files # The directory to move the original files
declare -A count # Associative array to hold id counts
# If it is assumed that the backup directory exists prior to call, then
# drop the line below
mkdir "$backupdir" || exit
for file in [^_]*_*; do ((++count[${file%%_*}])); done
for id in "${!count[#]}"; do
if ((count[$id] > 1)); then
mv "$id"_* "$backupdir"
cat "$backupdir/$id"_* > "$id"_merged.txt
fi
done

Bash: Subshell behaviour of ls

I am wondering why I do not get se same output from:
ls -1 -tF | head -n 1
and
echo $(ls -1 -tF | head -n 1)
I tried to get the last modified file, but using it inside a sub shell sometimes I get more than one file as result?
Why that and how to avoid?
The problem arises because you are using an unquoted subshell and -F flag for ls outputs shell special characters appended to filenames.
-F, --classify
append indicator (one of */=>#|) to entries
Executable files are appended with *.
When you run
echo $(ls -1 -tF | head -n 1)
then
$(ls -1 -tF | head -n 1)
will return a filename, and if it happens to be an executable and also be the prefix to another file, then it will return both.
For example if you have
test.sh
test.sh.backup
then it will return
test.sh*
which when echoed expands to
test.sh test.sh.backup
Quoting the subshell prevents this expansion
echo "$(ls -1 -tF | head -n 1)"
returns
test.sh*
I just found the error:
If you use echo $(ls -1 -tF | head -n 1)
the file globing mechanism may result in additional matches.
So echo "$(ls -1 -tF | head -n 1)" would avoid this.
Because if the result is an executable it contains a * at the end.
I tried to place the why -F in a comment, but now I decided to put it here:
I added the following lines to my .bashrc, to have a shortcut to get last modified files or directories listed:
function L {
myvar=$1; h=${myvar:="1"};
echo "last ${h} modified file(s):";
export L=$(ls -1 -tF|fgrep -v / |head -n ${h}| sed 's/\(\*\|=\|#\)$//g' );
ls -l $L;
}
function LD {
myvar=$1;
h=${myvar:="1"};
echo "last ${h} modified directories:";
export LD=$(ls -1 -tF|fgrep / |head -n $h | sed 's/\(\*\|=\|#\)$//g'); ls -ld $LD;
}
alias ol='L; xdg-open $L'
alias cdl='LD; cd $LD'
So now I can use L (or L 5) to list the last (last 5) modified files. But not directories.
And with L; jmacs $L I can open my editor, to edit it. Traditionally I used my alias lt='ls -lrt' but than I have to retype the name...
Now after mkdir ... I use cdl to change to that dir.

Find files which share part of a filename

In my current directory there are many files. Some of the files share part of their filename.
e.g.:
XGAE_537493_GSR.FITS
TGFE_537493_RRF.FITS
EGRE_537497_HDR.FITS
TRTE_537497_YUH.FITS
TRXX_537499_YDF.FITS
.
.
Files 1 & 2 would be a match, as would files 3 & 4. File 5 has no match. Therefore, files 1,2,3 and 4 would be moved.
I want to move the files which share part of their filename, in order to separate them from the ones that don't.
I was attempting to do this using bash. I googled but couldn't locate websites that were quite describing the process I need. So far in pseudo-code I have:
FOR F IN *
IF ${FILE:5:10} MATCHES ANY OTHER ${FILE:5:10}
MOVE ALL MATCHES TO ANOTHER DIRECTORY
Any information to help me move in the right direction would be appreciated.
Try this:
for f in ./*.FITS ; do
middleBit=$(echo $f| cut -d'_' -f 1)
count=$(ls *middleBit*.FITS | wc -l)
if [ $count -ge 1 ]
then
for match in *middleBit*.FITS ; do
mv $match ./somewhere
done
fi
done
Using associative array in BASH 4 you can do it easily:
#!/bin/bash
declare -A arr
for f in *.FITS; do
k="${f:5:6}"
[[ ${arr[$k]} ]] && mv "$f" /dest/ || arr["$k"]=1
done
if your file structure is fixed, you can scan them and find duplicates in sub fields of the file name in awk.
for example
$ ls -1 | awk -F_ 'NF==3{f[$2]=(a[$2]++?f[$2] OFS $0:$0)}
END{for(k in f) if(a[k]>1) print f[k]} '
TGFE_537493_RRF.FITS
XGAE_537493_GSR.FITS
you can then pipe the results to a cp command
$ ... | xargs -I file cp file file.DUP
adds suffix DUP to duplicate file names, or
$ ... | xargs -I file mv file anotherlocation/
moves to anotherlocation.

Move files from directories listed in file

I have a directory structure like the following toy example
DirectoryTo
DirectoryFrom
-Dir1
---File1.txt
---File2.txt
---File3.txt
-Dir2
---File4.txt
---File5.txt
---File6.txt
-Dir3
---File1.txt
---File5.txt
---File7.txt
I'm trying to copy all the files from DirectoryFrom to DirectoryTo, keeping the newer file if there are duplicates.
DirectoryTo
-File1.txt
-File2.txt
-File3.txt
-File4.txt
-File5.txt
-File6.txt
-File7.txt
DirectoryFrom
-Dir1
---File1.txt
---File2.txt
---File3.txt
-Dir2
---File4.txt
---File5.txt
---File6.txt
-Dir3
---File1.txt
---File5.txt
---File7.txt
I've created a text file with a list of all the subdirectories. This list is in the order such that the NEWEST files will be listed first:
Filelist.txt
C:/DirectoryFrom/Dir1
C:/DirectoryFrom/Dir2
C:/DirectoryFrom/Dir3
So what I'd like to do is loop through each directory in Filelist.txt, copy the files, and NOT replace if the file already exists.
I'd like to do this at the command line, in a shell script, or possibly in Python. I'm pretty new to Python, but have a little experience with the command line. However, I've never done something this complicated.
In reality, I have ~60 folders, each with 50-200 files in them, to give you a feel for how many I have. Also, each file is ~75MB.
I've done something similar in R before, but it's slow and not really meant for this. But here's what I've tried for a shell script, edited to fit this toy example:
#!/bin/bash
for line in Filelist.txt
do
cp -n line C:/DirectoryTo/
done
If you have only one one directory level in your DirectoryFrom then you can use:
cp -n DirectoryFrom/*/* DirectoryTo
explanation : copy every file which exist in subdirectories of DirectoryFrom to DirectoryTo if it doesn't exist
n flag is for not overwriting files if they already exist.
cp will also ignore directories if they exist in subdirectories of DirectoryTo
# Create test environnement :
mkdir C:/DirectoryTo
mkdir C:/DirectoryFrom
cd C:/DirectoryFrom
mkdir Dir1 Dir2 Dir3
(
cat << EOF
Dir1/File1.txt
Dir1/File2.txt
Dir1/File3.txt
Dir2/File4.txt
Dir2/File5.txt
Dir2/File6.txt
Dir3/File1.txt
Dir3/File5.txt
Dir3/File7.txt
EOF
)| while read f
do
echo "$f : `date`"
echo "$f : `date`" > $f
sleep 1
done
# create Filelist.txt file :
(
cat << EOF
C:/DirectoryFrom/Dir1
C:/DirectoryFrom/Dir2
C:/DirectoryFrom/Dir3
EOF
) > Filelist.txt
# Generate the liste of all files :
cd C:/DirectoryFrom
cat Filelist.txt | while read f; do ls -1 $f; done | sort -u > filenames.txt
cat filenames.txt
# liste of all files path, sorted by time order :
cd C:/DirectoryFrom
ls -1tr */* > all_filespath_sorted.txt
cat all_filespath_sorted.txt
# selected files to be copied :
cat filenames.txt | while read f; do cat all_filespath_sorted.txt | grep $f | tail -1 ; done
# copy of selected files:
cat filenames.txt | while read f; do cat all_filespath_sorted.txt | grep $f | tail -1 ; done | while read c
do
echo $c
cp -p $c C:/DirectoryTo
done
# verifying :
cd C:/DirectoryTo
ls -ltr
# or
ls -1 | while read f; do echo -e "\n$f\n-------"; cat $f; done
#------------------------------------------------
# Other solution for a limited number of files :
#------------------------------------------------
# To list files by order :
find `cat Filelist.txt | xargs` -type f | xargs ls -1tr
# To copy files, the newer will replace the older :
find `cat Filelist.txt | xargs` -type f | xargs ls -1tr | while read c
do
echo $c
cp -p $c C:/DirectoryTo
done

Incrementing number in filenames in bash

I'm trying to take a list of files and rename them, incrementing a number in their filenames. The directory contains a bunch of files named like:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 senreg10.csv.2
senreg1.csv.3 senreg2.csv.3 ... senreg10.csv.3
senreg1.csv.4 senreg2.csv.4 senreg10.csv.4
... ... ...
senreg1.csv.10 senreg2.csv.10 senreg10.csv.10
senreg1.csv.11 senreg2.csv.11 senreg10.csv.11
I want to increment all of the files that end in 3 or higher so I can insert a new file with suffix 3, so I made a text file called 'renames.txt' containing all the filenames that I want to rename. Then, I tried using a for loop to do the actual renaming.
for f in `cat renames.txt`
do
newfile=`echo $f | awk 'BEGIN { FS = "."}; { printf $1 "." $2 "." $3+1 }'`
mv "$f" "$newfile"
done
I want to end up with something like:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 senreg10.csv.2
senreg1.csv.4 senreg2.csv.4 ... senreg10.csv.4
senreg1.csv.5 senreg2.csv.5 senreg10.csv.5
... ... ...
senreg1.csv.11 senreg2.csv.11 senreg10.csv.11
senreg1.csv.12 senreg2.csv.12 senreg10.csv.12
But instead I get:
senreg1.csv senreg2.csv senreg10.csv
senreg1.csv.1 senreg2.csv.1 senreg10.csv.1
senreg1.csv.2 senreg2.csv.2 ... senreg10.csv.2
senreg1.csv.12 senreg2.csv.12 senreg10.csv.12
The contents of senregX.csv.12 are the same as the original senregX.csv.3. Hope this explanation made sense. Anybody know what's going on here?
You need to rename the files in reverse.
11 -> 12
10 -> 11
9 -> 10
and so on.
This script do what you want, without temporary files, only to have diversity of solutions:
#!/bin/bash
for file in $(ls -1 *[0-9]) # list files ending with a number
do
# get file name and id
name=$(echo $file | sed 's/\(.*\)\.\([0-9]\+\)$/\1/g');
id=$(echo $file | sed 's/.*\.\([0-9]\+\)$/\1/g');
if [ $id -ge 3 ]
then
((id += 1))
# We need to backup the files because we may override some files
cp $file "_$name.$id"
fi
done
# remove old files
for file in $(ls -1 [!_]*[0-9])
do
id=$(echo $file | sed 's/.*\.\([0-9]\+\)$/\1/g');
if [ $id -ge 3 ]
then
rm $file;
fi
done
# finish
for file in $(ls -1 _*[0-9])
do
name=$(echo $file | tr -d '_');
mv "$file" "$name";
done

Resources