only copying files with unique content - bash

I am trying to filter trough data, and would like to copy only the files which have only 1 representative of a certain group. For example, the file might look like:
sample_AAAAA_9824_r1
GGAAGCATCGTGGGAACTGCTTCACTAAGAAGGAAGTCACAGTTACTTCATAGATATCCATCACTAAAYGTGAGTAGATTGTGTTAATGTGTTATATATGACTGAAAAATTTTGCCTGGATCAGAATACGAAACCTTCTTGAGATATTGTAATGAATTTCAGTCATATGAGAAGTGATGGAGGGGGTGTGAATACATATACTGTGTCATTATCCATGCAGTATkATACTRCAAAGTTC-----
sample_AACCC_12358_r1
GGAAGCATCGTGGGAACTGCTTCACTAAGAAGGAAGTCACAGTTACTTCATAGATATCCATCACTAAATGTGAGTAGATTGTGTTAATGTGTTATATATGACTGAAAAWTTTTGCCTGGATCAGAATACGAAACCTTCTTGAGATATTGTAATGAATTTCAGTCATATGAGAAGTGATGGAGGGGGTGTGAATACATATACTGTGTCATTATCCATGCAGTATTATACTGCAAAGTTC-----
sample_AATTT_3905_r1
GGAAGCATCGTGGGAACTGCTTCACTAAGAAGGAAGTCACAGTTACTTCATAGATATCCATCACTAAATGTGAGTAGATTGTGTTAATGTGTTATATATGACTGAAAAATTTTGCCTGGATCAGAATACGAAACCTTCTTGAGATATTTTCAGTCATATGAGAATTGATGGAGGGGGTGTGAATACATATACTGTGTCATTATCCATGCAGTATGATACTACAAAGTTCCTTCCCATA-----
sample_ACGTA_178_r1
GGAAGCATCGTAGGAACTGCTTCACTAAGAAGGAAGTCACAGTTACTTCATAGATATCCATCACTAAATGTGAGTAGATTGTGTTAATGTGTTATATATGACTGAAAATTTTTGCCTGGATCAGAATACGAAACCTTCTTGAGATATTGTAATGAATTTCAGTCATATGAGAAGCGATGGAGGGGGTGTGAATACATATACTGTGTCATTATCCATGCAGTATGATACTACAAAGTTC-----
sample_ACTGC_9933_r1
GGAAGCATCGTRGGAACTGCTTCACTAAGAAGGAAGTCACAGTTACTTCATAGATATCCATCACTAAATGTGAGTAGATTGTGTTAATGTGTTATATATGACTGAAAAwTTTTGCCTGGATCAGAATACGAAACCTTCTTGAGATATTGTAATGAATTTCAGTCATATGAGAAGYGATGGAGGGGGTGTGAATACATATACTGTGTCATTATCCATGCAGTATGATACTACAAAGTTC-----
I have about 36000 of these files, and would like to copy only those to a different folder which have only one entry per sample (1 sample is for example sample ACTGC). There are 26 sample "numbers", consisting of 5 letters (e.g. AAAAA, AATTTT, ACGTC,...) the following number and "r1" is irrelevant.
I have been looking through different bash scripts for this, but cannot find the exact thing i need. I can count the occurence of each sample in a file, but this is probably not the way to go...
any help is greatly appreciated,
Yannick

You can use a loop to compare using cmp based on the output of sort vs the output of sort | uniq:
for f in files/*
do if cmp -s <(grep sample ${f} | cut -d'_' -f2 | sort) <(grep sample ${f} | cut -d'_' -f2 | sort | uniq)
then
echo "copying file ${f} here..."
# ... copy
else
"not copying file ${f} here" # do nothing...!
fi
done

Related

piping output to uniq or sort -u not returning expected result

I have tens of thousands of files that have names with similar, often repeating prefixes. I want to loop through all filenames and get a list of unique prefixes.
AB-61-GA_0001c.txt
AB-61-GA_aseguh.xml
AM-81-BU_0678.mp4
AM-81-BU_ochyu.doc
AM-92-LA_gatyt.csv
I want to end up with output:
AB-61-GA
AM-81-BU
AM-92-LA
For that I've put together the following shell script
#!/bin/bash
for i in *.*
do
UNIQUEOBJECT=$(echo "$i" | cut -d '_' -f 1 | sort -u)
echo "$UNIQUEOBJECT"
done
For some reason I end up with the list of prefixes (everything before the underscore) with identical prefixes still repeating. Obviously this is just a lack of understanding of bash scripting on my part but what am I doing wrong?
Thanks
The problem is that your for loop is sending one filename at a time. So you sort and unique a single filename.
You could do something like (syntax may not be quite right as I don't have a Linux box for testing at the moment)
#!/bin/bash
UNIQUEOBJECT=$(for i in *.*
do
echo "$i"
done | cut -d '_' -f 1 | sort -u)
echo "$UNIQUEOBJECT"
You need to generate the list before you sort. Your original was generating the list after sorting.

Count and remove extraneous files (bash)

I am getting stuck on finding a succint solution to the following.
In a given directory, I have the following files:
10_MIDAP.nii.gz
12_MIDAP.nii.gz
14_MIDAP.nii.gz
16_restAP.nii.gz
18_restAP.nii.gz
I am only supposed to have two "MIDAP" files and one "restAP" file. The additional files may not contain the full data, so I need to remove them. These are likely going to be smaller in size and/or the earlier sequence number (e.g., 10).
I know how to count / echo the number of files:
MIDAP=`find $DATADIR -name "*MIDAP.nii.gz" | wc -l`
RestAP=`find $DATADIR -name "*restAP.nii.gz" | wc -l`
echo "MIDAP files = $MIDAP"
echo "RestAP files = $RestAP"
Any suggestions on how to succinctly remove the unneeded files, such that I end up with two "MIDAP" files and one "restAP" (in cases where there are extraneous files)? As of now, imagining it would be something like this...
if (( $MIDAP > 2 )); then
...magic happens
fi
Thanks for any advice!
here is an approach
create test files
$ for i in {1..10}; do touch ${i}_restAP; touch ${i}_MIDAP; done
sort based on numbers, and remove the top N-1 (or N-2) files.
$ find . -name '*restAP*' | sort -V | head -n -1 | xargs rm
$ find . -name '*MIDAP*' | sort -V | head -n -2 | xargs rm
$ ls -1
10_MIDAP
10_restAP
9_MIDAP
you may want to change the sort if based on file size.

Append data to the end of a specific line in text file

I admit to being a novice in bash script, but can't quite seem to figure out how to accomplish a key step in a script and couldn't quite find what I was looking for in other threads.
I am trying to extract some specific data (numerical values) from multiple .xml files and add those to a space or tab delimited text file. The files will be generated over time so I need a way to append a new dataset to the pre-existing text file.
For instance, I would like to extract values for 3 different categories, 1 per row or column, and the value for each category from multiple xml files. Basically, I want to build a continuous graph of the data from each of 3 categories over time.
I have the following code which will successfully extract the 3 numbers from the xml file and trim the unnecessary text:
#!/bin/sh
grep "<observation name=\"meanGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"meanBrightGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanBrightGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"std\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"std\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
This gives the output:
1.12
0.33
134.1
I would like to then read in another xml file to get:
1.12 1.45
0.33 0.54
134.1 144.1
I would be grateful for any help with doing this! Thanks in advance.
Erik
It's much safer to use proper XML handling tools. For example, in xsh, you can write something like
$f1 := open /Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml ;
$f2 := open /path/to/the/second/file.xml ;
echo ($f1 | $f2)//observation[#name="meanGhost"] ;
echo ($f1 | $f2)//observation[#name="meanBrightGhost"] ;
echo ($f1 | $f2)//observation[#name="std"] ;

Using the first column of a file as input in a script

I am having some problems with using the first column ${1} as input to a script.
Currently the portions of the script looks like this.
#!/bin/bash
INPUT="${1}"
for NAME in `cat ${INPUT}`
do
SIZE="`du -sm /FAServer/na3250-a/homes/${NAME} | sed 's|/FAServer/na3250-a/homes/||'`"
DATESTAMP=`ls -ld /FAServer/na3250-a/homes/${NAME} | awk '{print $6}'`
echo "${SIZE} ${DATESTAMP}"
done
However, I want to modify the INPUT="${1}" to take the first {1} within a specific file. This is so I can run the lines above in another script and use a file that is previously generated as the input. Also to have the output go out to a new file.
So something like:
INPUT="$location/DisabledActiveHome ${1}" ???
Here's my full script below.
#!/bin/bash
# This script will search through Disabled Users OU and compare that list of
# names against the current active Home directories. This is to find out
# how much space those Home directories take up and which need to be removed.
# MUST BE RUN AS SUDO!
# Setting variables for _adm and storage path.
echo "Please provide your _adm account name:"
read _adm
echo "Please state where you want the files to be generated: (absolute path)"
read location
# String of commands to lookup information using ldapsearch
ldapsearch -x -LLL -h "REDACTED" -D $_adm#"REDACTED" -W -b "OU=Accounts,OU=Disabled_Objects,DC="XX",DC="XX",DC="XX"" "cn=*" | grep 'sAMAccountName'| egrep -v '_adm$' | cut -d' ' -f2 > $location/DisabledHome
# Get a list of all the active Home directories
ls /FAServer/na3250-a/homes > $location/ActiveHome
# Compare the Disabled accounts against Active Home directories
grep -o -f $location/DisabledHome $location/ActiveHome > $location/DisabledActiveHome
# Now get the size and datestamp for the disabled folders
INPUT="${1}"
for NAME in `cat ${INPUT}`
do
SIZE="`du -sm /FAServer/na3250-a/homes/${NAME} | sed 's|/FAServer/na3250-a/homes/||'`"
DATESTAMP=`ls -ld /FAServer/na3250-a/homes/${NAME} | awk '{print $6}'`
echo "${SIZE} ${DATESTAMP}"
done
I'm new to all of this so any help is welcome. I will be happy to clarify any and all questions you might have.
EDIT: A little more explanation because I'm terrible at these things.
The lines of code below came from a previous script are a FOR loop:
INPUT="${1}"
for NAME in `cat ${INPUT}`
do
SIZE="`du -sm /FAServer/na3250-a/homes/${NAME} | sed 's|/FAServer/na3250-a/homes/||'`"
DATESTAMP=`ls -ld /FAServer/na3250-a/homes/${NAME} | awk '{print $6}'`
echo "${SIZE} ${DATESTAMP}"
done
It is executed by typing:
./Script ./file
The FILE that is being referenced has one column of user names and no other data:
User1
User2
User3
etc.
The Script would take the file and look at the first users name, which is reference by
INPUT=${1}
then run a DU command on that user and find out what the size of their HOME drive is. That would be reported by the SIZE variable. It will do the same thing with the DATESTAMP in regards to when the HOME drive was created for the user. When it is done doing the tasks for that user, it would move on to the next one in the column until it is done.
So following that logic, I want to automate the entire process. Instead of doing this in two steps, I would like to make this all a one step process.
The first process would be to generate the $location/DisabledActiveHome file, which would have all of the disabled users names. Then to run the last portion to get the Size and creation date of each HOME drive for all the users in the DisabledActiveHome file.
So to do that, I need to modify the
INPUT=${1}
line to reflect the previously generated file.
$location/DisabledActiveHome
I don't understand your question really, but I think you want this. Say your file is called file.txt and looks like this:
1 99
2 98
3 97
4 96
You can get the first column like this:
awk '{print $1}' file.txt
1
2
3
4
If you want to use that in your script, do this
while read NAME; do
echo $NAME
done < <(awk '{print $1}' file.txt)
1
2
3
4
Or you may prefer cut like this:
while read NAME; do
echo $NAME
done < <(cut -d" " -f1 file.txt)
1
2
3
4
Or this may suit even better
while read NAME OtherUnwantedJunk; do
echo $NAME
done < file.txt
1
2
3
4
This last, and probably best, solution above uses IFS, which is bash's Input Field Separator, so if your file looked like this
1:99
2:98
3:97
4:96
you would do this
while IFS=":" read NAME OtherUnwantedJunk; do
echo $NAME
done < file.txt
1
2
3
4
INPUT="$location/DisabledActiveHome" worked like a charm. I was confused about the syntax and the proper usage and output

Batch to rename files with metadata name

I recently accidentally formatted a 2TB hard drive mac os jounaled!
I was able to recover files with Data Rescue 3, the only problem is the program didn't gave me the files as they were, root tree, and name.
For example I had
|-Music
||-Enya
|||-Sonadora.mp3
|||-Now we are free.mp3
|-Documents
||-CV.doc
||-LetterToSomeone.doc
...and so on
And now I got
|-MP3
||-M0001.mp3
||-M0002.mp3
|-DOCUMENTS
||-D0001.doc
||-D0002.doc
So with a huge amount of data it would take me centuries to manually open, see what is it and rename.
Is there some batch which can scan all my subfolders and take the previous name? By metadata perhaps?
Or do you know a better tool which will keep the same name and path of files (doesn't matter if must pay, ther's always a solution for that :P)
Thank you
My contribution for you music at least...
The idea is to go through all of the MP3 files found, and distributed them based on their ID3 tags.
I'd do something like :
for i in `find /MP3 -type f -iname "*.mp3"`;
do
ARTIST=`id3v2 -l $i | grep TPE1 | cut -d":" -f2 | sed -e 's/^[[:space:]]*//'`; # This gets you the Artist
ALBUM=`id3v2 -l $i | grep TALB | cut -d":" -f2 | sed -e 's/^[[:space:]]*//'`; # This gets you the Album title
TRACK_NUM=`id3v2 -l $i | grep TRCK | cut -d":" -f2 | sed -e 's/^[[:space:]]*//'`; # This gets the track ID/position, like "2/13"
TR_TITLE=`id3v2 -l $i | grep TIT2 | cut -d":" -f2 | sed -e 's/^[[:space:]]*//'`; # Track title
mkdir -p /MUSIC/$ARTIST/$ALBUM/;
cp $i /MUSIC/$ARTIST/$ALBUM/$TRACK_NUM.$TR_TITLE.mp3
done
Basically :
* It looks for all ".mp3" files in /MP3
* then analyses each file's ID3 tags, and parses them to fill 4 variables, using "id3v2" tool (you'll need to install it first). The tags are cleaned to get only the value, sed is used to trim the leading spaces that might pollute.
* then creates (if needed), a tree in /MUSIC/ with Artist name and album name
* then copies the input files to the new tree, and renames it thanks to the tags.

Resources