How to list latest timestamps files from partial duplicate name - bash

I have more than 10k files like as given in below example. I would like to filter out partial duplicate i.e 123456 is common in all listed files which are consider to be duplicate and out of these duplicate files I need file of latest time stamps
123456_20200425-012034.xml
123456_20200424-120102.xml
123456_20200425-121102.xml
234567_20200323-112232.xml
123456_20200423-111102.xml --- consider that this file is having latest
timestamps out of all above duplicate file
How to do it using bash ?
Also output should have files which are not duplicate. It means out of 10K files few files are not duplicate those files should include in output.
Output is require like (latest timestamps files)
123456_20200423-111102.xml
234567_20200323-112232.xml

I have done like this:
list=$(ls | awk -F _ '{print $1}' | uniq)
for i in $list
do
mv "$(find . -type f -name "$i*" -print | sort -n -t _ -k 2 | tail -1)" ../destination
done
1) Stored uniq files in list
2) Executed list file in to loop, find latest timestamp file and move it to destination folder

Because we can assume that globs are sorted alphanumerically, we can use a wildcard to iterate over the files and build a set of results:
#!/bin/bash
# change INPUTDIR to your input directory
INPUTDIR=.
seen=
store=()
for file in "$INPUTDIR"/* ; do
if [[ "$seen" != *"${file%_*}"* ]] ; then
store+=( "$file" )
seen="$seen ${file%_*}"
fi
done
# results
echo "${store[#]}"
Explanation:
Iterate over all files in a directory.
Get the filename before the first underscore (i.e. 123456). If we've haven't it before (i.e. "$seen" != *"${file%_*}"*), add it to our list of files to store. If we have seen it before, skip the file.
Print the results.

Related

merge same name multiple part pdf files by order

how merge same name's pdf files
poetry_2.pdf
poetry_3.pdf
poetry_4.pdf
metaphysics_2.pdf
metaphysics_3.pdf
i look for
poetry.pdf
metaphysics.pdf
failed this loop to check pdf files and merge with pfunite
for file1 in *_02.pdf ; do
# get your second_ files
file2=${file1/_02.pdf/_03.pdf}
# merge them together
pdfunite $file1 $file2 $file1.pdf
done
First, you need a list of prefixes (e.g. poetry, metaphysics). Then, iterate over that list and unite prefix_*.pdf into prefix.pdf.
Here we generate the list of prefixes by searching for files ending with _NUMBER.pdf and removing that last part. This assumes that filenames do not contain linebreaks.
printf %s\\n *_*.pdf | sed -En 's/_[0-9]+\.pdf$//p' | sort -u |
while IFS= read -r prefix; do
pdfunite "$prefix"_*.pdf "$prefix.pdf"
done

Concatenating multiple fastq files and renaming to parent folder

Currently, I have genome sequencing data as fq.gz files and each sample consists of a folder titled with the sample name containing 2-8 pairs of forward and reverse reads. I would like to concatenate all of the forward and reverse files into one forward and one reverse file for each sample while maintaining pair order.
My data are organized as follows:
/ParentFolder/SampleA/V549_1.fq.gz
/ParentFolder/SampleA/V549_2.fq.gz
/ParentFolder/SampleA/V550_1.fq.gz
/ParentFolder/SampleA/V550_2.fq.gz
/ParentFolder/SampleB/V588_1.fq.gz
/ParentFolder/SampleB/V588_2.fq.gz
/ParentFolder/SampleB/V599_1.fq.gz
/ParentFolder/SampleB/V599_2.fq.gz
In order to concatenate the files, I tried the following:
ls *_1.fq.gz | sort | xargs cat > SampleA_1.fq.gz
ls *_2.fq.gz | sort | xargs cat > SampleA_2.fq.gz
This works for one sample folder, however, since I have many dozens of samples, I would like to write a bash script to accomplish this across all my samples and rename the concatenated files to name of their parent folder. I am still learning but I am a bit lost, I would greatly appreciate any help with this problem.
I have attempted the following, without success:
for i in $(find ./ -type f -name "*.fq.gz"; done | sort | uniq)
do echo "Merging 1"
cat "$i"*_1.fq.gz > "$i"CG1-1_1.fq.gz
Thank you for any input/advice/solutions.
Consider the following logic:
for each suffix (_1, _2):
Find all the fq.gz files
Extract list of folders
For each folder
Merge all the containing 'fq' files into new 'all.FOLDER.fq.gz'
p=pp
for suffix in 1 2 ; do
# Find all dirs containing suffix files.
dirs=$(printf '%s\n' $p/*/*_$suffix.fq.gz | sed 's:/[^/]*$::' | uniq)
for d in $dirs ; do
# Merge, and store in parent.
(cd $d ; cat *_${suffix}.fq.gz > ../${d##*/}_${suffix}.fq.gz)
done
done
Notes:
code assume no special characters in folder names.
More compact files will be created if the merge process will uncompressed the original data, and re-compress. (gzcat *.gz

How to delete files from directory using CSV in bash

I have 600,000+ images in a directory. The filenames look like this:
1000000-0.jpeg
1000000-1.jpeg
1000000-2.jpeg
1000001-0.jpeg
1000002-0.jpeg
1000003-0.jpeg
The first number is a unique ID and the second number is an index.
{unique-id}-{index}.jpeg
How would I load the unique-id's in from a .CSV file and remove each file whose Unique ID matches the Unique ID's in the .CSV file?
The CSV file looks like this:
1000000
1000001
1000002
... or I can have it separated by semicolons like so (if necessary):
1000000;1000001;1000002
You can set the IFS variable to ; and loop over the values read into an array:
#! /bin/bash
while IFS=';' read -a ids ; do
for id in "${ids[#]}" ; do
rm $id-*.jpg
done
done < file.csv
Try running the script with echo rm ... first to verify it does what you want.
If there's exactly one ID per line, this will show you all matching file names:
ls | grep -f unique-ids.csv
If that list looks correct, you can delete the files with:
ls | grep -f unique-ids.csv | xargs rm
Caveat: This is a quick and dirty solution. It'll work if the file names are all named the way you say. Beware it could easily be tricked into deleting the wrong things by a clever attacker or a particularly hapless user.
You could use find and sed:
find dir -regextype posix-egrep \
-regex ".*($(sed 's/\;/|/g' ids.csv))-[0-9][0-9]*\.jpeg"
replace dir with your search directory, and ids.csv with your CVS file. To delete the files you could include -delete option.

Comparing two directories to produce output

I am writing a Bash script that will replace files in folder A (source) with folder B (target). But before this happens, I want to record 2 files.
The first file will contain a list of files in folder B that are newer than folder A, along with files that are different/orphans in folder B against folder A
The second file will contain a list of files in folder A that are newer than folder B, along with files that are different/orphans in folder A against folder B
How do I accomplish this in Bash? I've tried using diff -qr but it yields the following output:
Files old/VERSION and new/VERSION differ
Files old/conf/mime.conf and new/conf/mime.conf differ
Only in new/data/pages: playground
Files old/doku.php and new/doku.php differ
Files old/inc/auth.php and new/inc/auth.php differ
Files old/inc/lang/no/lang.php and new/inc/lang/no/lang.php differ
Files old/lib/plugins/acl/remote.php and new/lib/plugins/acl/remote.php differ
Files old/lib/plugins/authplain/auth.php and new/lib/plugins/authplain/auth.php differ
Files old/lib/plugins/usermanager/admin.php and new/lib/plugins/usermanager/admin.php differ
I've also tried this
(rsync -rcn --out-format="%n" old/ new/ && rsync -rcn --out-format="%n" new/ old/) | sort | uniq
but it doesn't give me the scope of results I require. The struggle here is that the data isn't in the correct format, I just want files not directories to show in the text files e.g:
conf/mime.conf
data/pages/playground/
data/pages/playground/playground.txt
doku.php
inc/auth.php
inc/lang/no/lang.php
lib/plugins/acl/remote.php
lib/plugins/authplain/auth.php
lib/plugins/usermanager/admin.php
List of files in directory B (new/) that are newer than directory A (old/):
find new -newermm old
This merely runs find and examines the content of new/ as filtered by -newerXY reference with X and Y both set to m (modification time) and reference being the old directory itself.
Files that are missing in directory B (new/) but are present in directory A (old/):
A=old B=new
diff -u <(find "$B" |sed "s:$B::") <(find "$A" |sed "s:$A::") \
|sed "/^+\//!d; s::$A/:"
This sets variables $A and $B to your target directories, then runs a unified diff on their contents (using process substitution to locate with find and remove the directory name with sed so diff isn't confused). The final sed command first matches for the additions (lines starting with a +/), modifies them to replace that +/ with the directory name and a slash, and prints them (other lines are removed).
Here is a bash script that will create the file:
#!/bin/bash
# Usage: bash script.bash OLD_DIR NEW_DIR [OUTPUT_FILE]
# compare given directories
if [ -n "$3" ]; then # the optional 3rd argument is the output file
OUTPUT="$3"
else # if it isn't provided, escape path slashes to underscores
OUTPUT="${2////_}-newer-than-${1////_}"
fi
{
find "$2" -newermm "$1"
diff -u <(find "$2" |sed "s:$2::") <(find "$1" |sed "s:$1::") \
|sed "/^+\//!d; s::$1/:"
} |sort > "$OUTPUT"
First, this determines the output file, which either comes from the third argument or else is created from the other inputs using a replacement to convert slashes to underscores in case there are paths, so for example, running as bash script.bash /usr/local/bin /usr/bin would output its file list to _usr_local_bin-newer-than-_usr_bin in the current working directory.
This combines the two commands and then ensures they are sorted. There won't be any duplicates, so you don't need to worry about that (if there were, you'd use sort -u).
You can get your first and second files by changing the order of arguments as you invoke this script.

How to get the most recent timestamped file in BASH

I'm writing a deployment script that saves timestamped backup files to a backups directory. I'd like to do a rollback implementation that would roll back to the most recent file.
My backups directory:
$:ls
. 1341094065_public_html_bu 1341094788_public_html_bu
.. 1341094390_public_html_bu
1341093920_public_html_bu 1341094555_public_html_bu
I want to identify the most recent file (by timestamp in the filename) in the backup directory, and save its name to a variable, then cp it to ../public_html, and so on...
ls -t will sort files by mtime. ls -t | head -n1 will select the newest file. This is independent of any naming scheme you have, which may or may not be a plus.
...and a more "correct" way, which won't break when filenames contain newlines, and also not when there are no matching files (unexpanded glob results)
for newestfile in ./* ; do : ; done
if test -e "$newestfile"; then do something with "$newestfile" ; fi
The latest-timestamped filename should sort last alphabetically. So you can then use tail -n1 to extract it.
For files that don't have newlines in their names:
shopt -s nullglob
printf '%s\n' "$buDir"/* | tail -n 1

Resources