Shell script: segregate multiple files

Shell script: segregate multiple files - bash

I have this in my local directory ~/Report:
Rep_{ReportType}_{Date}_{Seq}.csv
Rep_0001_20150102_0.csv
Rep_0001_20150102_1.csv
Rep_0102_20150102_0.csv
Rep_0503_20150102_0.csv
Rep_0503_20150102_0.csv
Using shell-script,
How do I get multiple files from a local directory with a fixed batch size?
How do I segregate/group the files together by report type (0001 files are grouped together, 0102 grouped together, 0503 grouped together, etc.)
I will generate a sequence file (using forqlift) for EACH group/report type. The output would be Report0001.seq, Report0102.seq, Report0503.seq (3 sequence files). In which I will save to a different directory.
Note: In sequence files, the key is the filename of csv (Rep_0001_20150102.csv), and the value is the content of the file. It is stored as [String, BytesWritable].
This is my code:
1 reportTypes=(0001 0102 8902)
2
3 # collect all files matching expression into an array
4 filesWithDir=(~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[0-1].csv)
5
6 # take only the first hundred
7 filesWithDir =( "${filesWithDir[#]:0:100}" )
8
9 # files="${filesWithDir[#]##*/}" #### commented out since forqlift cannot create sequence file without the path/to/file
10 # echo ${files[#]}
11
12 shopt -s nullglob
13
14 # Line 21 is commented out since it has a bug. It collects files in
15 # current directory when it should be filtering the "files array" created
16 # in line 7
17
18
19 for i in ${reportTypes[#]}; do
20 printf -v val '%04d' "$i"
21 # files=("Rep_${val}_"*.csv)
# solution to BUG: (filter files array)
groupFiles=( $( for j in ${filesWithDir[#]} ; do echo $j ; done | grep ${val} ) )
22
23 # Generate sequence file for EACH Report Type
24 forqlift create --file="Report${val}.seq" "${groupFiles[#]}"
25 done
(Note: The sequence file output should be in current directory, not in ~/Report)

It's easy to take only a subset of an array:
# collect all files matching expression into an array
files=( ~/Report/Rep_[0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv )
# take only the first hundred
files=( "${files[#]:0:100}" )
The second part is trickier: Bash has associative arrays ("maps"), but the only legal values which can be stored in arrays are strings -- not other arrays -- so you can't store a list of filenames as a value associated with a single entry (without serializing the array to and from a string -- a moderately tricky thing to do safely, since file paths in UNIX can contain any character other than NUL, newlines included).
It's better, then, to just generate the array as you need it.
shopt -s nullglob # allow a glob to expand to zero arguments
for ((i=1; i<=1000; i++)); do
printf -v val '%04d' "$i" # pad digits: 12 -> 0012
files=( "Rep_${val}_"*.csv ) # collect files that match
## emit NUL-separated list of files, if any were found
#(( ${#files[#]} )) && printf '%s\0' "${files[#]}" >"Reports.$val.txt"
# Create a sequence file with forqlift
forqlift create --file="Reports-${val}.seq" "${files[#]}"
done
If you really don't want to do that, then we can put something together that uses namevars for redirection:
#!/bin/bash
# This only works with bash 4.3
re='^REP_([[:digit:]]{4})_[[:digit:]]{8}.csv$'
counter=0
for f in *; do
[[ $f =~ $re ]] || continue # skip files not matching regex
if ((++counter > 100)); then break; fi # stop after 100 files
group=${BASH_REMATCH[1]} # retrieve first regex group
declare -g -a "array${group}" # declare an array
declare -n group_arr="array${group}" # redirect group_arr to that array
group_arr+=( "$f" ) # append to the array
done
for varname in "${!array#}"; do
declare -n group_arr="$varname"
## NUL-delimited form
#printf '%s\0' "${group_arr[#]}" \
# >"collection${varname#array}" # write to files named collection0001, etc.
# forqlift sequence file form
forqlift create --file="Reports-${varname#array}.seq" "${group_arr[#]}"
done

I would move away from shell scripts and start to look towards perl.
#!/usr/bin/env perl
use strict;
use warnings;
my %groups;
while ( my $filename = glob ( "~/Reports/Rep_*.csv" ) ) {
my ( $group, $id ) = ( $filename =~ m,/Rep_(\d{4})_(\d{8})\.csv$, );
next unless $group; #undefined means it didn't match;
#anything past 100 in a group is discarded:
if ( #{$groups{$group}} < 100 ) {
push ( #{$groups{$group}}, $filename );
}
}
foreach my $group ( keys %groups ) {
print "$group contains:\n";
print join ("\n", #{$groups{$group});
}

Another alternative is to clobber some bash commands together with regexp.
See implementation below
# Explanation:
# ls -p = List all files and directories in local directory by path
# grep -v / = ignore subdirectories
# grep "^Rep_\d{4}_\d{8}\.csv$" = Look for files matching your regexp
# tail -100 = get 100 results
for file in $(ls -p | grep -v / | grep "^Rep_\d{4}_\d{8}\.csv$" | tail -100);
do echo $file;
# Use reg exp to extract the desired sequence
re="^Rep_([[:digit:]]{4})_([[:digit:]]{8}).csv$";
if [[ $name =~ $re ]]; then
sequence = ${BASH_REMATCH[1};
# Didn't end up using date, but in case you want it
# date = ${BASH_REMATCH[2]};
# Just in case the sequence file doesn't exist
if [ ! -f "$sequence" ] ; then
touch "$sequence"
fi
# Output/Concat your filename to the sequence file, which you can
# read in later to do whatever administrative tasks you wish to do
# to them
echo "$file" >> "$sequence"
fi
done;

Related

shell script to create multiple files, incrementing from last file upon next execution

I'm trying to create a shell script that will create multiple files (or a batch of files) of a specified amount. When the amount is reached, script stops. When the script is re-executed, the files pick up from the last file created. So if the script creates files 1-10 on first run, then on the next script execution should create 11-20, and so on.
enter code here
#!/bin/bash
NAME=XXXX
valid=true
NUMBER=1
while [ $NUMBER -le 5 ];
do
touch $NAME$NUMBER
((NUMBER++))
echo $NUMBER + "batch created"
if [ $NUMBER == 5 ];
then
break
fi
touch $NAME$NUMBER
((NUMBER+5))
echo "batch complete"
done

Based on my comment above and your description, you can write a script that will create 10 numbered files (by default) each time it is run, starting with the next available number. As mentioned, rather than just use a raw-unpadded number, it's better for general sorting and listing to use zero-padded numbers, e.g. 001, 002, ...
If you just use 1, 2, ... then you end up with odd sorting when you reach each power of 10. Consider the first 12 files numbered 1...12 without padding. a general listing sort would produce:
file1
file11
file12
file2
file3
file4
...
Where 11 and 12 are sorted before 2. Adding leading zeros with printf -v avoids the problem.
Taking that into account, and allowing the user to change the prefix (first part of the file name) by giving it as an argument, and also change the number of new files to create by passing the count as the 2nd argument, you could do something like:
#!/bin/bash
prefix="${1:-file_}" ## beginning of filename
number=1 ## start number to look for
ext="txt" ## file extension to add
newcount="${2:-10}" ## count of new files to create
printf -v num "%03d" "$number" ## create 3-digit start number
fname="$prefix$num.$ext" ## form first filename
while [ -e "$fname" ]; do ## while filename exists
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
while ((newcount--)); do ## loop newcount times
touch "$fname" ## create filename
((! newcount)) && break; ## newcount 0, break (optional)
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
Running the script without arguments will create the first 10 files, file_001.txt - file_010.txt. Run a second time, it would create 10 more files file_011.txt to file_020.txt.
To create a new group of 5 files with the prefix of list_, you would do:
bash scriptname list_ 5
Which would result in the 5 files list_001.txt to list_005.txt. Running again with the same options would create list_006.txt to list_010.txt.
Since the scheme above with 3 digits is limited to 1000 files max (if you include 000), there isn't a big need to get the number from the last file written (bash can count to 1000 quite fast). However, if you used 7-digits, for 10 million files, then you would want to parse the last number with ls -1 | tail -n 1 (or version sort and choose the last file). Something like the following would do:
number=$(ls -1 "$prefix"* | tail -n 1 | grep -o '[1-9][0-9]*')
(note: that is ls -(one) not ls -(ell))
Let me know if that is what you are looking for.

Looping over files and its edition based on array information

I have a directory with a lot of files, which can be grouped based on their names. For example here I have 4 groups with 5 files in each:
ls - ./
# group 1
NpXynWT_apo_300K_0.pdb
NpXynWT_apo_300K_1.pdb
NpXynWT_apo_300K_2.pdb
NpXynWT_apo_300K_3.pdb
NpXynWT_apo_300K_4.pdb
# group 2
NpXynWT_apo_340K_0.pdb
NpXynWT_apo_340K_1.pdb
NpXynWT_apo_340K_2.pdb
NpXynWT_apo_340K_3.pdb
NpXynWT_apo_340K_4.pdb
# group 3
NpXynWT_com_300K_0.pdb
NpXynWT_com_300K_1.pdb
NpXynWT_com_300K_2.pdb
NpXynWT_com_300K_3.pdb
NpXynWT_com_300K_4.pdb
# group 4
NpXynWT_com_340K_0.pdb
NpXynWT_com_340K_1.pdb
NpXynWT_com_340K_2.pdb
NpXynWT_com_340K_3.pdb
NpXynWT_com_340K_4.pdb
So here each of the 5 files of the same group is different by the end suffix from 0 to 4:
NpXynWT_apo_300K_0 ... NpXynWT_apo_300K_4
NpXynWT_apo_340K_0 ... NpXynWT_apo_340K_4
etc
I need to loop over all of these 40 files and
pre-process each of the fille: adding inside of it "MODEL + A number of the file" (thus a number in range between 0 and 4) before the first string, and "ENDMDL" on the last string.
cat together the pre-processed files of the same group
In summary, as the result my script should create 4 new "combined" files, which will consist of 5 subfiles from the initial list.
For the realisation I created an array of the groups and looped it providing index from 0 to 4 as well as two loops: 1)pre-processing of each file; 2) CAT the pre-processed files together:
# list of 4 groups
systems=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
# pre-process files
for model in "${systems[#]}"; do
i="0"
while [ $i -lt 5 ]; do
# EDIT EXISTING FILES
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
done
# cat pre-processed filles
for model in ${systems[#]}; do
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
1 - Would it be possible to merge together the both loops ? E.g. should it be the same as
# pre-processing PBDs and it catting
for model in "${systems[#]}"; do
##echo "$model"
i="0"
while [ $i -lt 5 ]; do
k=$[$i+1]
## do something with pdb
sed -i "1 i\MODEL $k" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb
#gedit "${pdbs}"/"${model}"_"$i"_FA.pdb
i=$[$i+1]
done
# now we cat together the post-processed files
cat "${pdbs}"/"${model}"_[0-4]_FA.pdb > "${output}/${model}.pdb"
done
2- would it be possible simplify two operations from the first loop of the edition of the file?
sed -i "1 i\MODEL $i" "${pdbs}"/"${model}"_"$i"_FA.pdb
echo "ENDMDL" >> "${pdbs}"/"${model}"_"$i"_FA.pdb

how to match info from array "groups" to the files present in the folder ?
Use find. It is there to find files.
groups=(NpXynWT_apo_300K NpXynWT_apo_340K NpXynWT_com_300K NpXynWT_com_340K)
for group in ${groups[#]}; do
find . -name "${group}_*.pdb" -type f
done
You can be even more exact by using -regex and similar find options.

Script to pick random directory in bash

I have a directory full of directories containing exam subjects I would like to work on randomly to simulate the real exam.
They are classified by difficulty level:
0-0, 0-1 .. 1-0, 1-1 .. 2-0, 2-1 ..
I am trying to write a shell script allowing me to pick one subject (directory) randomly based on the parameter I pass when executing the script (0, 1, 2 ..).
I can't quite figure it, here is my progress so far:
ls | find . -name "1$~" | sort -r | head -n 1
What am I missing here?

There's no need for any external commands (ls, find, sort, head) for this at all:
#!/usr/bin/env bash
set -o nullglob # make globs expand to nothing, not themselves, when no matches found
dirs=( "$1"*/ ) # list directories starting with $1 into an array
# Validate that our glob actually had at least one match
(( ${#dirs[#]} )) || { printf 'No directories start with %q at all\n' "$1" >&2; exit 1; }
idx=$(( RANDOM % ${#dirs[#]} )) # pick a random index into our array
echo "${dirs[$idx]}" # and look up what's at that index

Iterate two arrays simultaneously - bash n sed

I need help with executing two arrays simulataneously in bash and using sed to replace a word with the words in the second array in each separate file. There are two lists:
List1 - contains the names of the directories, which have similar parameter file "surfnet.par" in each directory (the file in different directories have the same name "surfnet.par")
List1.txt
3IT6_1
3IT6_3
3IT6_6
3IT6_9
3IT6_11
3IT6_12
3IT6_19
3IT6_23
3IT6_54
3IT6_62
List2 - contains numbers corresponding to each directory which has to be replaced with a specific word (single occurrence) in the file "surfnet.par" existing in different directories
List2.txt
11351
11357
11371
11384
11350
11373
11383
11365
11377
11382
To make it still clear
I want to replace a word "Resnum" in "surfnet.par" in the directory "3IT6_1" of List1 with "11351" of List2, likewise, replace the same word in "surfnet.par" of 3IT6_2 with 11357, 3IT6_3 with 11357, 3IT6_6 with 11371 and so on.
I have tried pushing the list to array and then using a for loop to replace the word, but failed in doing so, as it took the first value of List2 and replace in all "surfnet.par" files in different directories. The script I have been using is as below:
#!/bin/bash
declare -a dir
declare -a res
dir=(`cat "List1.txt" `)
res=(`cat 'List2.txt'`)
for i in "${dir[#]}"
do
echo $i
cd $i
sed -e "s/Resnum/${res$[0]}/g" surfnet.par > surfnet2.par
cd ..
done
I will appreciate it very much, if any of you can help me resolve this code and point out the modification that needs to be done. In case, my code doesn't make any sense please provide me the solution using bash, awk, sed or perl

I think you may have been a whole lot closer to a solution than you thought. This is one situation in bash where making use of a c-style loop and iterating on an index can come in very handy. The following slight changes to your code should work, give it a try (note: I added a check on cd and used the starting directory as current to enable use of absolute paths):
#!/bin/bash
declare -a dir
declare -a res
dir=( $(<List1.txt) )
res=( $(<List2.txt) )
current="$PWD"
for ((i = 0; i < ${#dir[#]}; i++))
do
cd "$current/${dir[i]}" || {
echo "failed to change to ${dir[i]}"
continue
}
printf "%3d %-8s Resnum -> %s\n" $i ${dir[i]} ${res[i]}
sed -e "s/Resnum/${res[i]}/g" surfnet.par > surfnet2.par
done
Example Use
Tested with your ListX.txt files with cd and sed calls commented out.
$ bash resnum.sh
0 3IT6_1 Resnum -> 11351
1 3IT6_3 Resnum -> 11357
2 3IT6_6 Resnum -> 11371
3 3IT6_9 Resnum -> 11384
4 3IT6_11 Resnum -> 11350
5 3IT6_12 Resnum -> 11373
6 3IT6_19 Resnum -> 11383
7 3IT6_23 Resnum -> 11365
8 3IT6_54 Resnum -> 11377
9 3IT6_62 Resnum -> 11382
Note: in bash for indexed arrays, the use of the $ on the index variable is not required. (e.g. ${dir[$i]} is fine as ${dir[i]}. It is treated the same as if it were enclosed in ((..)) as in the loop declaration.
Note2: you should probably add a validation that both values are available at the top of the loop before calling cd to change to the desired directory:
## validate both values available
[ -z ${dir[i]} -o -z ${res[i]} ] && {
echo "'${dir[i]}' or '${res[i]}' missing."
continue
}

If you don't like to type too much, you can do this
while read d s; do sed 's/target/'"$s"'/g' "$d"/f.txt > "$d"/f2.txt; done < <(paste list1 list2)
appropriately replace target with your search word, f.txt f2.txt list1 and list2 with the file names you use. It should be clear which is which.

Your question bears a Perl tag so I assume that Perl solutions are acceptable
Your question isn't very clear, but I think this program should help you
use strict;
use warnings;
use v5.10.1;
use autodie;
my #dirs = slurp('Listdirs204.txt');
my #res = slurp('LastHetatmRes.txt');
die "File sizes don't match" unless #dirs == #res;
for my $i ( 0 .. $#dirs ) {
my ($dir, $res) = ($dirs[$i], $res[$i]);
my $file = "$dir/surfnet.par";
my #lines = slurp($file);
s/Resnum/$res/g for #lines;
open my $fh, '>', $file;
print $fh "$_\n" for #lines;
close $fh;
}
sub slurp {
open my $fh, '<', shift;
my #lines = <$fh>;
chomp #lines;
#lines;
}

bash create dir with sequential numbers

I am creating a script to run on OS X which will be run often by a novice user, and so want to protect a directory structure by creating a fresh one each time with an n+1 over the last:
target001 with the next run creating target002
I have so far:
lastDir=$(find /tmp/target* | tail -1 | cut -c 6-)
let n=$n+1
mkdir "$lastDir""$n"
However, the math isn't working here.

What about
mktemp?
Create a temporary file or directory, safely, and print its name.
TEMPLATE must contain at least 3 consecutive `X's in last component.
If TEMPLATE is not specified, use tmp.XXXXXXXXXX, and --tmpdir is
implied. Files are created u+rw, and directories u+rwx, minus umask
restrictions.

Use this line to calculate the new sequence number:
...
n=$(printf "%03d" $(( 10#$n + 1 )) )
mkdir "$lastDir""$n"
10# to force base 10 arithmetic. Provided $n beeing the last secuence already e.g. "001".

No pipes and subprocesses:
targets=( /tmp/target* ) # all dirs in an array
lastdir=${targets[#]: (-1):1} # select filename from last array element
lastdir=${lastdir##*/} # remove path
lastnumber=${lastdir/target/} # remove 'target'
lastnumber=00$(( 10#$lastnumber + 1 )) # increment number (base 10), add leading zeros
mkdir /tmp/target${lastnumber: -3} # make dir; last 3 chars from lastnumber
A version with 2 parameters:
path='/tmp/x/y/z' # path without last part
basename='target' # last part
targets=( $path/${basename}* ) # all dirs in an array
lastdir=${targets[#]: (-1):1} # select path from last entry
lastdir=${lastdir##*/} # select filename
lastnumber=${lastdir/$basename/} # remove 'target'
lastnumber=00$(( 10#$lastnumber + 1 )) # increment number (base 10), add leading zeros
mkdir $path/$basename${lastnumber: -3} # make dir; last 3 chars from lastnumber

Complete solution using extended test [[ and BASH_REMATCH :
[[ $(find /tmp/target* | tail -1) =~ ^(.*)([0-9]{3})$ ]]
mkdir $(printf "${BASH_REMATCH[1]}%03d" $(( 10#${BASH_REMATCH[2]} + 1 )) )
Provided /tmp/target001 is your directory pattern.

Like this:
lastDir=$(find /tmp/target* | tail -1)
let n=1+${lastDir##/tmp/target}
mkdir /tmp/target$(printf "%03d" $n)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Shell script: segregate multiple files - bash

Related

shell script to create multiple files, incrementing from last file upon next execution

Looping over files and its edition based on array information

Script to pick random directory in bash

Iterate two arrays simultaneously - bash n sed

bash create dir with sequential numbers

Categories

Resources