Looping over files and its edition based on array information - bash

I have a directory with a lot of files, which can be grouped based on their names. For example here I have 4 groups with 5 files in each:
ls - ./
# group 1
NpXynWT_apo_300K_0.pdb
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
# group 2
NpXynWT_apo_340K_0.pdb
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
NpXynWT_apo_340K_3.pdb
NpXynWT_apo_340K_4.pdb
# group 3
NpXynWT_com_300K_0.pdb
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4
NpXynWT_com_340K_0.pdb
NpXynWT_com_340K_1.pdb
NpXynWT_com_340K_2.pdb
NpXynWT_com_340K_3.pdb
NpXynWT_com_340K_4.pdb
So here each of the 5 files of the same group is different by the end suffix from 0 to 4:
NpXynWT_apo_300K_0 ... NpXynWT_apo_300K_4
NpXynWT_apo_340K_0 ... NpXynWT_apo_340K_4
etc
I need to loop over all of these 40 files and
pre-process each of the fille: adding inside of it "MODEL + A number of the file" (thus a number in range between 0 and 4) before the first string, and "ENDMDL" on the last string.
cat together the pre-processed files of the same group
In summary, as the result my script should create 4 new "combined" files, which will consist of 5 subfiles from the initial list.
For the realisation I created an array of the groups and looped it providing index from 0 to 4 as well as two loops: 1)pre-processing of each file; 2) CAT the pre-processed files together:
# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
# pre-process files
for model in "${systems[#]}"; do
i="0"
while [ $i -lt 5 ]; do
# EDIT EXISTING FILES
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
done
# cat pre-processed filles
for model in ${systems[#]}; do
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
1 - Would it be possible to merge together the both loops ? E.g. should it be the same as
# pre-processing PBDs and it catting
for model in "${systems[#]}"; do
##echo "$model"
i="0"
while [ $i -lt 5 ]; do
k=$[$i+1]
## do something with pdb
sed -i "1 i\MODEL $k" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
#gedit "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
# now we cat together the post-processed files
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
2- would it be possible simplify two operations from the first loop of the edition of the file?
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb

how to match info from array "groups" to the files present in the folder ?
Use find. It is there to find files.
groups=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
for group in ${groups[#]}; do
find . -name "${group}_*.pdb" -type f
done
You can be even more exact by using -regex and similar find options.

Related

Merge multiple text files on one column in bash

How do you merge multiple plain text files (>2) on the first column? For example, I have three files like the below:
cat file1.txt
a 1
b 2
c 3
cat file2.txt
a 2
b 3
c 4
cat file3.txt
a 3
b 4
c 5
I am trying to merge these files into one file like in the first column like this:
cat ideal.txt
a 1 2 3
b 2 3 4
c 3 4 5
How about join?
Join lines of two sorted files on a common field.
More information: https://www.gnu.org/software/coreutils/join
join file1.txt file2.txt > join1.txt
join join1.txt file3.txt > ideal.txt
cat ideal.txt
Here's a script, I named the file "jj" you might use in order to work with many a file. To run it type: ./jj file1.txt file2.txt file3.txt
#!/usr/bin/env bash
# define temporary location, WIP/CACHE
tmp="/tmp/outjointmp"
# define target location
out="/tmp/outjoin"
# truncate both files, just in case there is any residue from anything
: > "$out"
: > "$tmp"
# first, copy the contents of the first file into the target file
cat "$1" > "$out"
# loop through all remaining arguments
while [[ $# -gt 1 ]]; do
join "$out" "$2" > "$tmp"
shift
# copy over the temp into destination file
cat "$tmp" > "$out"
done
cat "$out"
result&output:
$ ./jj file1.txt file2.txt file3.txt
a 1 2 3
b 2 3 4
c 3 4 5
A recursive function using process substitution should do the trick in order to join more than two files:
#!/bin/bash
join_rec() {
if (($# <= 2)); then
join "$#"
else
join "$1" <(join_rec "${#:2}")
fi
}
join_rec file*.txt > joined_file
assuming input files are sorted.

shell script to create multiple files, incrementing from last file upon next execution

I'm trying to create a shell script that will create multiple files (or a batch of files) of a specified amount. When the amount is reached, script stops. When the script is re-executed, the files pick up from the last file created. So if the script creates files 1-10 on first run, then on the next script execution should create 11-20, and so on.
enter code here
#!/bin/bash
NAME=XXXX
valid=true
NUMBER=1
while [ $NUMBER -le 5 ];
do
touch $NAME$NUMBER
((NUMBER++))
echo $NUMBER + "batch created"
if [ $NUMBER == 5 ];
then
break
fi
touch $NAME$NUMBER
((NUMBER+5))
echo "batch complete"
done
Based on my comment above and your description, you can write a script that will create 10 numbered files (by default) each time it is run, starting with the next available number. As mentioned, rather than just use a raw-unpadded number, it's better for general sorting and listing to use zero-padded numbers, e.g. 001, 002, ...
If you just use 1, 2, ... then you end up with odd sorting when you reach each power of 10. Consider the first 12 files numbered 1...12 without padding. a general listing sort would produce:
file1
file11
file12
file2
file3
file4
...
Where 11 and 12 are sorted before 2. Adding leading zeros with printf -v avoids the problem.
Taking that into account, and allowing the user to change the prefix (first part of the file name) by giving it as an argument, and also change the number of new files to create by passing the count as the 2nd argument, you could do something like:
#!/bin/bash
prefix="${1:-file_}" ## beginning of filename
number=1 ## start number to look for
ext="txt" ## file extension to add
newcount="${2:-10}" ## count of new files to create
printf -v num "%03d" "$number" ## create 3-digit start number
fname="$prefix$num.$ext" ## form first filename
while [ -e "$fname" ]; do ## while filename exists
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
while ((newcount--)); do ## loop newcount times
touch "$fname" ## create filename
((! newcount)) && break; ## newcount 0, break (optional)
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
Running the script without arguments will create the first 10 files, file_001.txt - file_010.txt. Run a second time, it would create 10 more files file_011.txt to file_020.txt.
To create a new group of 5 files with the prefix of list_, you would do:
bash scriptname list_ 5
Which would result in the 5 files list_001.txt to list_005.txt. Running again with the same options would create list_006.txt to list_010.txt.
Since the scheme above with 3 digits is limited to 1000 files max (if you include 000), there isn't a big need to get the number from the last file written (bash can count to 1000 quite fast). However, if you used 7-digits, for 10 million files, then you would want to parse the last number with ls -1 | tail -n 1 (or version sort and choose the last file). Something like the following would do:
number=$(ls -1 "$prefix"* | tail -n 1 | grep -o '[1-9][0-9]*')
(note: that is ls -(one) not ls -(ell))
Let me know if that is what you are looking for.

Bash: iterate from last to first member of an array [duplicate]

This question already has answers here:
Is there a way to get the git root directory in one command?
(22 answers)
Closed 2 years ago.
I'm attempting to find the "root" of a folder. I'm doing this in a Bash script with the following (at least in my head):
# Get current directory (e.g. /foo/bar/my/subdir)
CURR_DIR = `cwd`
# Break down into array of folder names
DIR_ARRAY=(${CURR_DIR//\// })
# Iterate over items in DIR_ARRAY starting with "subdir"
<HELP WITH FOR LOOP SYNTAX>
# Each loop:
# build path to current item in DIR_ITER; e.g.
# iter N: DIR_ITER=/foo/bar/my/subdir
# iter N-1: DIR_ITER=/foo/bar/my
# iter N-2: DIR_ITER=/foo/bar
# iter 0: DIR_ITER=/foo
# In each loop:
# get the contents of directory using "ls -a"
# look for .git
# set ROOT=DIR_ITER
export ROOT
I've Googled for looping in Bash but it all uses the "for i in ARRAY" form, which doesn't guarantee reverse iteration order. What's the recommended way to achieve what I want to do?
One idea on reverse index referencing.
First our data:
$ CURR_DIR=/a/b/c/d/e/f
$ DIR_ARRAY=( ${CURR_DIR//\// } )
$ typeset -p DIR_ARRAY
declare -a DIR_ARRAY=([0]="a" [1]="b" [2]="c" [3]="d" [4]="e" [5]="f")
Our list of indices:
$ echo "${!DIR_ARRAY[#]}"
0 1 2 3 4 5
Our list of indices in reverse:
$ echo "${!DIR_ARRAY[#]}" | rev
5 4 3 2 1 0
Looping through our reverse list of indices:
$ for i in $(echo "${!DIR_ARRAY[#]}" | rev)
do
echo $i
done
5
4
3
2
1
0
As for working your way up the directory structure using this 'reverse' index strategy:
$ LOOP_DIR="${CURR_DIR}"
$ for i in $(echo "${!DIR_ARRAY[#]}" | rev)
do
echo "${DIR_ARRAY[${i}]}:${LOOP_DIR}"
LOOP_DIR="${LOOP_DIR%/*}"
done
f:/a/b/c/d/e/f
e:/a/b/c/d/e
d:/a/b/c/d
c:/a/b/c
b:/a/b
a:/a
Though we could accomplish the same thing a) without the array and b) using some basic parameter expansions, eg:
$ LOOP_DIR="${CURR_DIR}"
$ while [ "${LOOP_DIR}" != '' ]
do
subdir="${LOOP_DIR##*/}"
echo "${subdir}:${LOOP_DIR}"
LOOP_DIR="${LOOP_DIR%/*}"
done
f:/a/b/c/d/e/f
e:/a/b/c/d/e
d:/a/b/c/d
c:/a/b/c
b:/a/b
a:/a
You can use dirname in a loop, to find the parent folder, then move up until you e.g., find the .git folder.
Quick example:
#!/usr/bin/env bash
set -eu
for arg in "$#"
do
current=$arg
while true
do
if [ -d "$current/.git" ]
then
echo "$arg: .git in $current"
break
fi
parent="$(dirname "$current")"
if [ "$parent" == "$current" ]
then
echo "No .git in $arg"
break
fi
current=$parent
done
done
For each parameter you pass to this script, it will print where it found the .git folder up the directory tree, or print an error if it didn't find it.

Loop Script from Input File

I have a reference file with device names in them. For example WABEL8499IPM101. I'm using this script to set the base name (without the last 3 digits) to look at the reference file and see what is already used. If 101 is used it will create a file for me with 102, 103 if I request 2 total. I'm looking to use an input file to run it multiple times. I'm also trying to figure out how to start at 101 if there isn't a name found when searching the reference file
I would like to loop this using an input file instead of manually entering bash test.sh WABEL8499IPM 2 each time. I would like to be able to build an input file of all the names that need compared and then output. It would also be nice that if there isn't a match that it starts creating names at WABEL8499IPM101 instead of just WABEL8499IPM1.
Input file example:
ColumnA (BASE NAME) ColumnB (QUANTITY)
WABEL8499IPM 2
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum);
do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Usage:
bash test.sh WABEL8499IPM 2
Result in the log file:
WABEL8499IPM109
WABEL8499IPM110
Just wrap a loop around your code instead of assuming the args come in on the command line.
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
while read device_name quantityNum
do max_sequence_name=$( grep -o -e "$device_name[0-9]*" $SRCFILE |
sort --reverse | head -n 1)
max_sequence_num=${max_sequence_name: -3}
array_new_sequence_name=()
for i in $(seq 1 $quantityNum)
do cnum=$((max_sequence_num + i))
array_new_sequence_name+=("$device_name$cnum")
done
for sqn in ${array_new_sequence_name[#]};
do echo $sqn >> $LOGFILE
done
done < input.file
I'd maybe pass the input file as the parameter now.

Shell script: segregate multiple files

I have this in my local directory ~/Report:
Rep_{ReportType}_{Date}_{Seq}.csv
Rep_0001_20150102_0.csv
Rep_0001_20150102_1.csv
Rep_0102_20150102_0.csv
Rep_0503_20150102_0.csv
Rep_0503_20150102_0.csv
Using shell-script,
How do I get multiple files from a local directory with a fixed batch size?
How do I segregate/group the files together by report type (0001 files are grouped together, 0102 grouped together, 0503 grouped together, etc.)
I will generate a sequence file (using forqlift) for EACH group/report type. The output would be Report0001.seq, Report0102.seq, Report0503.seq (3 sequence files). In which I will save to a different directory.
Note: In sequence files, the key is the filename of csv (Rep_0001_20150102.csv), and the value is the content of the file. It is stored as [String, BytesWritable].
This is my code:
1 reportTypes=(0001 0102 8902)
2
3 # collect all files matching expression into an array
4 filesWithDir=(~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-1].csv)
5
6 # take only the first hundred
7 filesWithDir =( "${filesWithDir[#]:0:100}" )
8
9 # files="${filesWithDir[#]##*/}" #### commented out since forqlift cannot create sequence file without the path/to/file
10 # echo ${files[#]}
11
12 shopt -s nullglob
13
14 # Line 21 is commented out since it has a bug. It collects files in
15 # current directory when it should be filtering the "files array" created
16 # in line 7
17
18
19 for i in ${reportTypes[#]}; do
20 printf -v val '%04d' "$i"
21 # files=("Rep_${val}_"*.csv)
# solution to BUG: (filter files array)
groupFiles=( $( for j in ${filesWithDir[#]} ; do echo $j ; done | grep ${val} ) )
22
23 # Generate sequence file for EACH Report Type
24 forqlift create --file="Report${val}.seq" "${groupFiles[#]}"
25 done
(Note: The sequence file output should be in current directory, not in ~/Report)
It's easy to take only a subset of an array:
# collect all files matching expression into an array
files=( ~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv )
# take only the first hundred
files=( "${files[#]:0:100}" )
The second part is trickier: Bash has associative arrays ("maps"), but the only legal values which can be stored in arrays are strings -- not other arrays -- so you can't store a list of filenames as a value associated with a single entry (without serializing the array to and from a string -- a moderately tricky thing to do safely, since file paths in UNIX can contain any character other than NUL, newlines included).
It's better, then, to just generate the array as you need it.
shopt -s nullglob # allow a glob to expand to zero arguments
for ((i=1; i<=1000; i++)); do
printf -v val '%04d' "$i" # pad digits: 12 -> 0012
files=( "Rep_${val}_"*.csv ) # collect files that match
## emit NUL-separated list of files, if any were found
#(( ${#files[#]} )) && printf '%s\0' "${files[#]}" >"Reports.$val.txt"
# Create a sequence file with forqlift
forqlift create --file="Reports-${val}.seq" "${files[#]}"
done
If you really don't want to do that, then we can put something together that uses namevars for redirection:
#!/bin/bash
# This only works with bash 4.3
re='^REP_([[:digit:]]{4})_[[:digit:]]{8}.csv$'
counter=0
for f in *; do
[[ $f =~ $re ]] || continue # skip files not matching regex
if ((++counter > 100)); then break; fi # stop after 100 files
group=${BASH_REMATCH[1]} # retrieve first regex group
declare -g -a "array${group}" # declare an array
declare -n group_arr="array${group}" # redirect group_arr to that array
group_arr+=( "$f" ) # append to the array
done
for varname in "${!array#}"; do
declare -n group_arr="$varname"
## NUL-delimited form
#printf '%s\0' "${group_arr[#]}" \
# >"collection${varname#array}" # write to files named collection0001, etc.
# forqlift sequence file form
forqlift create --file="Reports-${varname#array}.seq" "${group_arr[#]}"
done
I would move away from shell scripts and start to look towards perl.
#!/usr/bin/env perl
use strict;
use warnings;
my %groups;
while ( my $filename = glob ( "~/Reports/Rep_*.csv" ) ) {
my ( $group, $id ) = ( $filename =~ m,/Rep_(\d{4})_(\d{8})\.csv$, );
next unless $group; #undefined means it didn't match;
#anything past 100 in a group is discarded:
if ( #{$groups{$group}} < 100 ) {
push ( #{$groups{$group}}, $filename );
}
}
foreach my $group ( keys %groups ) {
print "$group contains:\n";
print join ("\n", #{$groups{$group});
}
Another alternative is to clobber some bash commands together with regexp.
See implementation below
# Explanation:
# ls -p = List all files and directories in local directory by path
# grep -v / = ignore subdirectories
# grep "^Rep_\d{4}_\d{8}\.csv$" = Look for files matching your regexp
# tail -100 = get 100 results
for file in $(ls -p | grep -v / | grep "^Rep_\d{4}_\d{8}\.csv$" | tail -100);
do echo $file;
# Use reg exp to extract the desired sequence
re="^Rep_([[:digit:]]{4})_([[:digit:]]{8}).csv$";
if [[ $name =~ $re ]]; then
sequence = ${BASH_REMATCH[1};
# Didn't end up using date, but in case you want it
# date = ${BASH_REMATCH[2]};
# Just in case the sequence file doesn't exist
if [ ! -f "$sequence" ] ; then
touch "$sequence"
fi
# Output/Concat your filename to the sequence file, which you can
# read in later to do whatever administrative tasks you wish to do
# to them
echo "$file" >> "$sequence"
fi
done;

Resources