Find pairs of files in one directory with a specific pattern - bash

I need to find pairs of files with a specific pattern in one directory:
HU_IP_number_something.bam & HU_inp_number_something.bam
NOC_IP_number_something.bam & NOC_inp_number_something.bam
Numbers are 1...N for each pair
I have a solution but it works only for one set of files HU_* or NOC_* in one directory.
How can I improve it to find pairs, when both HU_* and NOC_* are in one directory?
for ip in *IP*.bam
do
num=$(echo $ip | sed 's/[^0-9]//g')
input=$(find -name *_inp_${num}*.bam)
echo ip sample: $ip
echo input sample: $input
done
Examples of files in one directory:
HU_inp_1-sorted.bam
HU_IP_1-sorted.bam
NOC_inp_1-sorted.bam
NOC_IP_1-sorted.bam
for 1,2,3,...N

The following builds an array, $a for each iteration of a for loop.
$ for f in *IP*.bam; do s=${f#*_}; a=( *${s} ); declare -p a; done
declare -a a=([0]="HU_IP_number_something.bam" [1]="NOC_IP_number_something.bam")
declare -a a=([0]="HU_IP_number_something.bam" [1]="NOC_IP_number_something.bam")
This works steps through all the files you've specified in your filespec, stripping off the first "field" (as denoted by the underscore separator), and using globbing to collect the relevant files in the array.
You can test for the length of the array (${#a[#]}) to make sure you have two entries.
If you want to group by the second field instead of the first, you need a little more processing:
$ for f in *IP*.bam; do s1=${f%%_*}; s2=${f#*_}; s2=${s2#*_}; a=( ${s1}*${s2} ); declare -p a; done
declare -a a=([0]="HU_IP_number_something.bam" [1]="HU_inp_number_something.bam")
declare -a a=([0]="NOC_IP_number_something.bam" [1]="NOC_inp_number_something.bam")
The technique here, using ${var#pattern} and ${var%pattern} is called Parameter Expansion, and you can find more details about it in the bash man page. Here too.

Do you only want to match HU to HU and NOC to NOC? If so:
If you add a line
pre=$(echo $ip | awk -F "_" '{print $1}')
then change you input to
input=$(find -name $pre_inp_${num}*.bam)

Related

How do i add whitepsaces to a String while filling it up in a for-loop in Bash?

Have a string as follows:
files="applications/dbt/Dockerfile applications/dbt/cloudbuild.yaml applications/dataform/Dockerfile applications/dataform/cloudbuild.yaml"
Want to extract the first two directories and save it as another string like this:
"applications/dbt applications/dbt applications/dataform pplications/dataform"
But while filling up the second string, its being saved as
applications/dbtapplications/dbtapplications/dataformapplications/dataform
What i tried:
files="applications/dbt/Dockerfile applications/dbt/cloudbuild.yaml applications/dataform/Dockerfile applications/dataform/cloudbuild.yaml"
arr=($files)
#extracting the first two directories and saving it to a new string
for i in ${arr[#]}; do files2+=$(echo "$i" | cut -d/ -f 1-2); done
echo $files2
files2 echoes the following
applications/dbtapplications/dbtapplications/dataformapplications/dataform
Reusing your code as much as possible:
(assuming to only remove the last right part):
arr=( applications/dbt/Dockerfile applications/dbt/cloudbuild.yaml applications/dataform/Dockerfile applications/dataform/cloudbuild.yaml )
#extracting the first two directories and saving it to a new string
for file in "${arr[#]}"; do
files2+="${file%/*} "
done
echo "$files2"
applications/dbt applications/dbt applications/dataform
You could use a for loop as requested
for dir in ${files};
do file2+=$(printf '%s ' "${dir%/*}")
done
which will give output
$ echo "$file2"
applications/dbt applications/dbt applications/dataform applications/dataform
However, it would be much easier with sed
$ sed -E 's~([^/]*/[^/]*)[^ ]*~\1~g' <<< $files
applications/dbt applications/dbt applications/dataform applications/dataform
Convert the string in an array first. Assuming there are no white/blank/newline space embedded in your strings/path name. Something like
#!/usr/bin/env bash
files="applications/dbt/Dockerfile applications/dbt/cloudbuild.yaml applications/dataform/Dockerfile applications/dataform/cloudbuild.yaml"
mapfile -t array <<< "${files// /$'\n'}"
Now check the value of the array
declare -p array
Output
declare -a array=([0]="applications/dbt/Dockerfile" [1]="applications/dbt/cloudbuild.yaml" [2]="applications/dataform/Dockerfile" [3]="applications/dataform/cloudbuild.yaml")
Remove all the last / from the path name in the array.
new_array=("${array[#]%/*}")
Check the new value
declare -p new_array
Output
declare -a new_array=([0]="applications/dbt" [1]="applications/dbt" [2]="applications/dataform" [3]="applications/dataform")
Now the value is an array, assign it to a variable or do what ever you like with it. Like what was mentioned in the comment section. Use an array from the start.
Assign the first 2 directories/path in a variable (weird requirement)
new_var="${new_array[#]::2}"
declare -p new_var
Output
declare -- new_var="applications/dbt applications/dbt"

BASH: File sorting according to file name

I need to sort 12000 filles into 1000 groups, according to its name and create for each group a new folder containing filles of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varried from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were splitted on 12 separate parts (second column).
Here is an example for a small subset based on 3 systems devided by 12 parts, totally 36 filles.
7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg
7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg
7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg
I need to group these filles based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contrain 12 filles for each system in the following manner:
Folder1, name: 7000_lig_cne_1, it contains 12 filles: 7000_{this is from 01 to 12}_lig_cne_1.dlg
Folder2, name: 7000_lig_cne_2, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
Folder1000, name: 7000_lig_cne_1000, it contains 12 filles 7000_{this is from 01 to 12}_lig_cne_1000.dlg
Assuming that all *.dlg filles are present withint the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:
#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"
#name of the output
output=${home}/sub_folders_to_analyse
#now here all magic comes
rm -r ${output}
mkdir ${output}
# sed sollution
for i in ${FILES}/*.dlg # define this better to suit your needs
do
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
mkdir -p ${output}/"${experiment}_lig$n"
cp "$i" ${output}/"${experiment}_lig$n"
done
! Note: there I indicated beggining of the name of each folder as ${experiment} to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the coppied filles? Manually it could be achived via skipping the second column in the name of the folder
cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987
Iterate over files. Extract the destination directory name from the filename. Move the file.
for i in *.dlg; do
# extract last number with your favorite tool
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
echo mkdir -p "folder$n"
echo mv "$i" "folder$n"
done
Notes:
Do not use upper case variables in your scripts. Use lower case variables.
Remember to quote variables expansions.
Check your scripts with http://shellcheck.net
Tested on repl
update: for OP's foldernaming convention:
for i in *.dlg; do
foldername="$HOME/output/${i%%_*}_${i#*_*_}"
echo mkdir -p "$foldername"
echo mv "$i" "$foldername"
done
This might work for you (GNU parallel):
ls *.dlg |
parallel --dry-run 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d'
Pipe the output of ls command listing files ending in .dlg to parallel, which creates directories and moves the files to them.
Run the solution as is, and when satisfied the output of the dry run is ok, remove the option --dry-run.
The solution could be one instruction:
parallel 'd={=s/^(7000_).*(lig.*)\.dlg/$1$2/=};mkdir -p $d;mv {} $d' ::: *.dlg
Using POSIX shell's built-in grammar only and sort:
#!/usr/bin/env sh
curdir=
# Create list of files with newline
# Safe since we know there is no special
# characters in name
printf -- %s\\n *.dlg |
# Sort the list by 5th key with _ as field delimiter
sort -t_ -k5 |
# Iterate reading the _ delimited fields of the sorted list
while IFS=_ read -r _ _ c d e; do
# Compose the new directory name
newdir="${c}_${d}_${e%.dlg}"
# If we enter a new group / directory
if [ "$curdir" != "$newdir" ]; then
# Make the new directory current
curdir="$newdir"
# Create the new directory
echo mkdir -p "$curdir"
# Move all its files into it
echo mv -- *_"$curdir.dlg" "$curdir/"
fi
done
Optionally as a sort and xargs arguments stream:
printf -- %s\\n * |
sort -u -t_ -k5
xargs -n1 sh -c
'd="lig_cne_${0##*_}"
d="${d%.dlg}"
echo mkdir -p "$d"
echo mv -- *"_$d.dlg" "$d/"
'
Here is a very simple awk script that do the trick in single sweep.
script.awk
BEGIN{FS="[_.]"} # make field separator "_" or "."
{ # for each filename
dirName=$1"_"$3"_"$4"_"$5; # compute the target dir name from fields
sysCmd = "mkdir -p " dirName"; cp "$0 " "dirName; # prepare bash command
system(sysCmd); # run bash command
}
running script.awk
ls -1 *.dlg | awk -f script.awk
oneliner awk script
ls -1 *.dlg | awk 'BEGIN{FS="[_.]"}{d=$1"_"$3"_"$4"_"$5;system("mkdir -p "d"; cp "$0 " "d);}'

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.
You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired
sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.
This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

Storing Result in Array and Substring in Bash

I have the list of committed files on svn stored in a variable as the following:
REPOS="$1"
TXN="$2"
DETAILED_FILES=`svnlook changed -t $TXN $REPOS`
DETAILED_FILES looks like:
U data0.xml A data1.xml UU all_data.xml
How can I remove all change type prefixes? such as U |data0.xml
Also, is it possible to store these in an array?
And can I get the full path of these files by svnlook?
A more proper way would be:
repos=$1
txn=$2
files=()
while read -r _ f; do
files+=( "$f" )
done < <(svnlook -t "$txn" "$repos")
Mind the quotes! (you used quotes where they are useless—yet harmless—but omitted the mandatory ones!).
Yes, just do:
FILES=( $(echo $DETAILED_FILES | cut -c 3-) )
Now FILES is an array and you can access the array elements by iterating over them:
for i in "${FILES[#]}"; do echo "$i"; done
Explicitly, ${FILES[0]} will get you the first element, ${FILES[1]} second and so on.
I am not familiar with svnlook so can not answer your second question.

UNIX - Replacing variables in sql with matching values from .profile file

I am trying to write a shell which will take an SQL file as input. Example SQL file:
SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'
Now the script should extract all variables, which in this case everything starting with %%. So the output file will be something as below:
%%DB
%%TBLEXT
%%CITY
Now I should be able to extract the matching values from the user's .profile file for these variables and create the SQL file with the proper values.
SELECT *
FROM tempdb.TBL_abc
WHERE CITY = 'Chicago'
As of now I am trying to generate the file1 which will contain all the variables. Below code sample -
sed "s/[(),']//g" "T:/work/shell/sqlfile1.sql" | awk '/%%/{print $NF}' | awk '/%%/{print $NF}' > sqltemp2.sql
takes me till
%%DB.TBL_%%TBLEXT
%%CITY
Can someone help me in getting to file1 listing the variables?
You can use grep and sort to get a list of unique variables, as per the following transcript:
$ echo "SELECT *
FROM %%DB.TBL_%%TBLEXT
WHERE CITY = '%%CITY'" | grep -o '%%[A-Za-z0-9_]*' | sort -u
%%CITY
%%DB
%%TBLEXT
The -o flag to grep instructs it to only print the matching parts of lines rather than the entire line, and also outputs each matching part on a distinct line. Then sort -u just makes sure there are no duplicates.
In terms of the full process, here's a slight modification to a bash script I've used for similar purposes:
# Define all translations.
declare -A xlat
xlat['%%DB']='tempdb'
xlat['%%TBLEXT']='abc'
xlat['%%CITY']='Chicago'
# Check all variables in input file.
okay=1
for key in $(grep -o '%%[A-Za-z0-9_]*' input.sql | sort -u) ; do
if [[ "${xlat[$key]}" == "" ]] ; then
echo "Bad key ($key) in file:"
grep -n "${key}" input.sql | sed 's/^/ /'
okay=0
fi
done
if [[ ${okay} -eq 0 ]] ; then
exit 1
fi
# Process input file doing substitutions. Fairly
# primitive use of sed, must change to use sed -i
# at some point.
# Note we sort keys based on descending length so we
# correctly handle extensions like "NAME" and "NAMESPACE",
# doing the longer ones first makes it work properly.
cp input.sql output.sql
for key in $( (
for key in ${!xlat[#]} ; do
echo ${key}
done
) | awk '{print length($0)":"$0}' | sort -rnu | cut -d':' -f2) ; do
sed "s/${key}/${xlat[$key]}/g" output.sql >output2.sql
mv output2.sql output.sql
done
cat output.sql
It first checks that the input file doesn't contain any keys not found in the translation array. Then it applies sed substitutions to the input file, one per translation, to ensure all keys are substituted with their respective values.
This should be a good start, though there may be some edge cases such as if your keys or values contain characters sed would consider important (like / for example). If that is the case, you'll probably need to escape them such as changing:
xlat['%%UNDEFINED']='0/0'
into:
xlat['%%UNDEFINED']='0\/0'

Resources