look into file names and concatenate them in bash - bash

I have never programed in bash before.
I am reading all the files that are in a directory, then I need to look into their names and check if the have R1 or R2, depending on that I need concatenate all files that have R1 in the same and all files that have R2 in the name.
So as a final output I would like to have something like:
String 1 = file1_R1.gz file2_R1.gz file3_R1.gz...
String 2 = file1_R2.gz file2_R2.gz file3_R2.gz...
how can I do that? the only code that I have so far is:
#!/bin/bash
list=$(echo *.gz)
strR1="R1"
strR2="R2"
if [ "$list" = "*.gz" ] ; then list=""; fi
for str in $list
do
if echo "$strR1" | grep -q "$str"; then
echo "str";
else
echo "no file";
fi
done
I can read all the files in the directory but when I do the if I cannot find any file with R1, and I know that there are at least 4 files with R1 in the name.
Thank you!

Why write a loop to do the filtering wildcards will already do for you? (You're using the same thing to filter *.gz already.)
files1=`echo *R1*.gz`
files2=`echo *R2*.gz`
Since you said you want to concatenate all those files, that would then be
zcat *R1.gz > result_R1
zcat *R2.gz > result_R2
or something like that.

Related

How do you compress multiple folders at a time, using a shell?

There are n folders in the directory named after the date, for example:
20171002 20171003 20171005 ...20171101 20171102 20171103 ...20180101 20180102
tips: Dates are not continuous.
I want to compress every three folders in each month into one compression block.
For example:
tar jcvf mytar-20171002_1005.tar.bz2 20171002 20171003 20171005
How to write a shell to do this?
You need to do a for loop on your ls variable, then parse the directory name.
dir_list=$(ls)
prev_month=""
times=0
first_dir=""
last_dir=""
dir_list=()
for i in $dir_list; do
month=${i:0:6} #here month will be year plus month
if [ "$month" = "$prev_month" ]; then
i=$(($i+1))
if [ "$i" -eq "3" ]; then
#compress here
dir_list=()
first_dir=""
last_dir=""
else
last_dir=$i
dir_list+=($i)
fi
else
if [ "$first_dir" = "" ]; then
first_dir=$i
else
#compress here
first_dir="$i"
last_dir=""
dir_list=()
fi
fi
This code is not tested and may contain syntaxe error. '#compress here' need to be replace by a loop on the array to create a string to compress.
Assuming you don't have too many directories (I think the limit is several hundred), then you can use Bash's array manipulation.
So, you first load all your directory names into a Bash array:
dirs=( $(ls) )
(I'm going to assume files have no spaces in their names, otherwise it gets a bit dicey)
Then you can use Bash's array slice syntax to pop 3 elements at a time from the array:
while [ "${#dirs[#]}" -gt 0 ]; do
dirs_to_compress=( "${dirs[#]:0:3}" )
dirs=( "${dirs[#]:3}" )
# do something with dirs_to_compress
done
The rest should be pretty easy.
You can achieve this with xargs, a bash while loop, and awk:
ls | xargs -n3 | while read line; do
tar jcvf $(echo $line | awk '{print "mytar-"$1"_"substr($NF,5,4)".tar.bz2"}') $line
done
unset folders
declare -A folders
g=3
for folder in $(ls -d */); do
folders[${folder:0:6}]+="${folder%%/} "
done
for folder in "${!folders[#]}"; do
for((i=0; i < $(echo ${folders[$folder]} | tr ' ' '\n' | wc -l); i+=g)) do
group=(${folders[$folder]})
groupOfThree=(${group[#]:i:g})
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[-1]:4:4}.tar.bz2 ${groupOfThree[#]}
done
done
This script finds all folders in the current directory, seperates them in groups of months, makes groups of at most three folders and creates a .tar.bz2 for each of them with the name you used in the question.
I tested it with those folders:
20171101 20171102 20171103 20171002 20171003 20171005 20171007 20171009 20171011 20171013 20180101 20180102
And the created tars are:
mytar-20171002_1005.tar.bz2
mytar-20171007_1011.tar.bz2
mytar-20171013_1013.tar.bz2
mytar-20171101_1103.tar.bz2
mytar-20180101_0102.tar.bz2
Hope that helps :)
EDIT: If you are using bash version < 4.2 then replace the line:
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[-1]:4:4}.tar.bz2 ${groupOfThree[#]}
by:
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[`expr ${#groupOfThree[#]} - 1`]:4:4}.tar.bz2 ${groupOfThree[#]}
That's because bash version < 4.2 doesn't support negative indices for arrays.

How to modify the filename in shell script?

I have 3 text files.I want to modify the file name of those files using for loop as below .
Please find the files which I have
1234.xml
333.xml
cccc.xml
Output:
1234_R.xml
333_R.xml
cccc_R.xml
Depending on your distribution, you can use rename:
rename 's/(.*)(\.xml)/$1_R$2/' *.xml
Just basic unix command mv work both on move and rename
mv 1234.xml 1234_R.xml
If you want do it by a large amount, do like this:
[~/bash/rename]$ touch 1234.xml 333.xml cccc.xml
[~/bash/rename]$ ls
1234.xml 333.xml cccc.xml
[~/bash/rename]$ L=`ls *.xml`
[~/bash/rename]$ for x in $L; do mv $x ${x%%.*}_R.xml; done
[~/bash/rename]$ ls
1234_R.xml 333_R.xml cccc_R.xml
[~/bash/rename]$
You can use a for loop to iterate over a list of words (e.g. with the list of file names returned by ls) with the (bash) syntax:
for name [ [ in [ word ... ] ] ; ] do list ; done
Then use mv source dest to rename each file.
One nice trick here, is to use basename, which strips directory and suffix from filenames (e.g. basename 333.xml will just return 333).
If you put all this together, the following should work:
for f in `ls *.xml`; do mv $f `basename $f`_R.xml; done

Indexing files and parsing name

I have a directory, ./grd_files/lat36/ that has 7 files in it (n36e114.grd, n36e115.grd, n36e116.grd, n36e117.grd, n36e118.grd, n36e119.grd, n36e120.grd. Also beneath ./grd_files/ are other folders named lat37, lat38, lat39. Each contains some files named in the same format as those in lat36, only instead of n36e114.grd, the file for the e114 longitude in the lat37 folder would be called n37e114. Now, not all lat** folders contain all the longitudes, but I need them to.
I have written a part of the script to determine which lat** folder has the most columns in it (it is lat36 with 7 longitudes). I want to compare the longitudes that exist in lat36 folder to the other folders, and if a column is missing in another folder, I will make it. I can handle the if then statement, but I am stumped on how to compare the lists in bash.
I was thinking to make a list of the file names in the row1 folder, and compare that to the to the files in the other folders, but the names won't and shouldn't match -- only the column part of the name will and should match. So far I have tried to make an array of the file names and then parse it for just the column part of the name. Note that these are actually map tiles, so the names are really in the format of coordinates in northing (row) and easing (col) e.g. n36e114.grd. So I want to isolate all the e114 style parts of the names and check and make sure that they exist in the other rows. I hope that makes sense. Below is what I attempted, but I am not great in bash syntax so I'm stumped. Thanks so much for the help.
col_list_raw=( $(find $maxdirectory -name ".grd" -exec basename {} .grd \;) )
col_list=( for c in ${col_list_raw[#]}; do echo ${col_list_raw[$c]:3:7}; done )
where $maxdirectory is the one with the most columns.*
UPDATE: I have removed what I described in italics above and attempted to incorporate the solution from John1024. Below is the code.
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*lon/lon/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
Where latXX is the folder with the most longitudes. John1024's first loop works nicely, and I get the correct lists for each of the lat** folders, but the second loop straight up compares the lists , returning:
lat37 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat38 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat39 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
I need that loop to compare only part of the file name. ie I want to check each folder for the existence of each longitude. So that if file `n37e114.grd' exists, nothing happens, but if it does not exist, that information is returned and I can execute a command based on the missing file. I hope my edits clear up the naming convention and are understandable. Thanks again for the help. AM
SOLUTION:
thanks to the help of #John1024 I was able to find a solution. I have reproduced the final solution below. Following this, I read in the *.out files and conduct my command on each line of them.
cd ./grd_files
for lat in */
do
ls "$lat" | sed 's/[a-z][1-9][1-9].*\([a-z][0-9][0-9]*\).grd/\1/' >"${lat%/}.tmp"
done
for file in *.tmp
do
lat=$(echo $file | awk -F "." '{print $1}')
grep -vFf "$file" ${xXX}.tmp >${lat}missing.out
[ -s ${lat}missing.out ] && echo ${file%.tmp} is missing $(cat ${lat}missing.out)
done
The question includes two different naming schemes for the files. Both would work the same, but to keep it simple and intuitive, this answer uses the first scheme.
It is possible to loop through bash arrays to find the missing columns. However, grep is well-suited to this task, greatly simplifies the logic, and, if there are many columns and rows, it is likely much faster. Using grep:
cd ./grd_files
for row in row*/
do
ls "$row" | sed 's/.*col/col/' >"${row%/}.tmp"
done
for f in row*.tmp
do
grep -vFf "$f" row1.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
The first loop above, creates lists of columns that exist in each of the rows. These lists are saved in temporary files name row1.tmp, row2.tmp, etc.
The second loop compares each of those lists to the reference row, row1.tmp. The list of columns missing from that row are saved in temporary file missing.tmp. If missing.tmp has a nonzero size, then there are missing columns and a report is generated.
For cleanup, one might want to delete the tmp files. If so, add this line to the end of the script:
rm row*.tmp missing.tmp
Fancier version
Using process substitution, the need for many of the temporary files can be eliminated:
trap "rm missing.tmp" EXIT
for row in row*/
do
ls row1/ | sed 's/.*col/col/' | grep -vFf <(ls "$row" | sed 's/.*col/col/') >missing.tmp
[ -s missing.tmp ] && echo $row is missing $(cat missing.tmp)
done
This version also uses trap to assure that the sole remaining temporary file is removed when the script is finished.
Using the other naming scheme as per revised question
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*n[0-9][0-9]e/e/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
As I told you in the comment, supplying the test data is a good practice. In this case you would got much more answers if supplied a script what creating a test case, something such next:
mkdir grid
cd grid
mkdir lat3{5..9}
#if you don't know the {3..9} expansion, simply write
#mkdir lat36 lat37 lat38 lat39
touch lat35/n35e111.grd
touch lat36/n36e11{4..9}.grd lat36/n36e120.grd
touch lat37/n37e11{4,6,8}.grd
touch lat38/n38e11{4..9}.grd
#39 missing all files
Such script what creating an test case helps much more as full page of words. ;) Or, if no script, at least supply the output of find like find grid -print. Your first edit helps a bit, (I missed it) and +100 to #John1024's work.
Now about the solution.
Your final solution have one problem. What if the directory with the MOST LONGITUDES (your latXX) missing some gridfile what exists in some other directories? E.g. it has the most gridfiles, but still not all. Like in the above test case, the lat36 contains 7 files (most of all), but sill missing a file n36e111.grd (because the 111 exists only in the lat35)?
Therefore i created an alternative solution, what eliminates this problem and show the result as the next matrix:
111 114 115 116 117 118 119 120
35: + no no no no no no no # the 111 is here
36: no + + + + + + + # the dir with a MOST of longitudes but missing 111
37: no + no + no + no no
38: no + + + + + + no
39: no no no no no no no no # missing all longitudes
the script
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
print_matrix() {
echo -ne "\t"
paste -s - <<<"$known_longs"
for lat in $known_lats
do
echo -en "$lat:"
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && echo -en "\t+" || echo -en "\tno"
done
echo
done
}
print_matrix
The logic is easy:
search for all known longs e.g. for the filenames what contains eNNN
search for all known lats e.g. for the directories wit latNN
in a cycle test the existence if the files
The above printed matrix is probably not very useful, because you probably want do something with the found or missing files, so here is an action variant of the script.
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
do_if_exists() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
#do nothing
}
do_if_missing() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
echo "from lat$xlat missing $filename"
}
do_actions() {
for lat in $known_lats
do
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && do_if_exists $lat $long || do_if_missing $lat $long
done
done
}
do_actions
what for the missing file do an action (echoes what missing), and the output is the next:
from lat35 missing n35e114.grd
from lat35 missing n35e115.grd
from lat35 missing n35e116.grd
from lat35 missing n35e117.grd
from lat35 missing n35e118.grd
from lat35 missing n35e119.grd
from lat35 missing n35e120.grd
from lat36 missing n36e111.grd
from lat37 missing n37e111.grd
from lat37 missing n37e115.grd
from lat37 missing n37e117.grd
from lat37 missing n37e119.grd
from lat37 missing n37e120.grd
from lat38 missing n38e111.grd
from lat38 missing n38e120.grd
from lat39 missing n39e111.grd
from lat39 missing n39e114.grd
from lat39 missing n39e115.grd
from lat39 missing n39e116.grd
from lat39 missing n39e117.grd
from lat39 missing n39e118.grd
from lat39 missing n39e119.grd
from lat39 missing n39e120.grd
Of course, is possible optimise more, like:
do the find only once (helps if the directory tree is large - by creating a list of filenames by the find command
don't test each file, but test the existence of the filename in the previously created list of filenames
like in the next
startdir="./test/grid"
(cd "$startdir" || err "can cd to $start" || exit 1
gridlist="/tmp/griglist.$$"
trap "rm -f $gridlist;exit" 0 2
find . -regex '\./lat[0-9][0-9]*.*' -print >$gridlist
known_longs=($(sed -n 's:^.*/n[0-9][0-9]*e\([0-9][0-9]*\)\.grd$:\1:p' $gridlist | sort -u))
known_lats=($(grep -oP '/lat\K\d+((?=/?)|$)' $gridlist | sort -u))
full_list() {
for lat in ${known_lats[#]}
do
for long in ${known_longs[#]}
do
echo "./lat${lat}/n${lat}e${long}.grd"
done
done
}
comm -13 $gridlist <(full_list)) | while read missing
do
#do something with the miising file
echo "$missing"
done

Getting different output files

I'm doing a test with these files:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R2_001.fastq
I want to get the files that have the same code until the first _ (underscore) and have the code R1 in different output files. The output files should be called according with the code until the first _ (underscore).
-This is my code, but I'm having trouble on making the output files.
#!/bin/bash
for i in {900..995}; do
if [[ ${i} -eq ${i} ]]; then
cat comp${i}_*_R1_001.fastq
fi
done
-I want to have two outputs:
One output will have all lines from:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and its name should be comp900_R1.out
The other output will have lines from:
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
and its name should be comp995_R1.out
Finally, as I said, this is a small test. I want my script to work with a lot of files that have the same characteristics.
Using awk:
ls -1 *.fastq | awk -F_ '$8 == "R1" {system("cat " $0 ">>" $1 "_R1.out")}'
List all files *.fastq into awk, splitting on _. Check if 8:th part $8 is R1, then append cat >> the file into first part $1 + _R1.out, which will be comp900_R1.out or comp995_R1.out. It is assumed that no filenames contain spaces or other special characters.
Result:
File comp900_R1.out containing all lines from
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and file comp995_R1.out containing all lines from
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
My stab at a general solution:
#!/bin/bash
for f in *_R1_*; do
code=$(echo $f | cut -d _ -f 1)
cat $f >> ${code}_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
done
Iterates over files with _R1_ in it, then appends its output to a file based on code.
cut pulls out the code by splitting the filename (-d _) and returning the first field (-f 1).

Need to pick Latest File From a Dir Using Shell Script

I am new to Shell Script and I got a requirement to pick the latest files from a dir using Shell script
Directory Name : FTPDIR
File In this Dir will be of
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
Note: (Need to Pick one Latest from each Group)- Below is the out put which I need to get after executing the shell script
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
Ex: In the latest File APC5502015VP092020121451.csv the no 092020121451 is the date part in the format : MMDDYYYYHHMM and string part is APC5502015VP (Length Not Fixed in String Part)
I need to pick those three files from the dir using shell script
Can you help me to resolve this?
It's going to be really problematic to do this safely in just bash. As Jonathan mentioned, "special" characters like spaces or newlines may bung up your script.
If we can assume that there won't be any of those, then we can do most of job in bash, without involving other tools.
# Make an associative array to record types, in the second loop...
declare -A a
for file in *.csv; do
# First, we convert the filenames into something that can be sorted.
# The next three lines account for your "unknown length" in the first part
# of the filename. We assume the date+time is the 12 chars before ".csv".
new="$(rev <<<"$file")"
new="${new:4:12}"
new="$(rev <<<"$new")"
new="${new:4:4}${new:0:2}${new:2:2}${new:8:4}"
len=$(( ${#file} - 16 ))
echo "$new ${file:0:$len} $file"
done | sort | while read date type file; do
# Next, we print only the first of each "type"...
if [[ ${a[$type]} -eq 0 ]]; then
a[$type]=1
echo "$file"
fi
# And stop once we have collected three types.
if [[ ${#a[*]} -ge 3 ]]; then
break
fi
done
As I say, this doesn't handle newlines in filenames.
Note also that this uses rev and sort, which are not built in to bash. The rev parts could be done internally, using more code, which might make them execute faster, but you'd only see a difference in very extreme cases. There's not much we can do about sort, since there isn't a built-in within bash.
This Perl script works on the given data. No doubt it could be improved.
#!/usr/bin/env perl
use strict;
use warnings;
my %bases;
while (<>)
{
chomp;
my $name = $_;
my($prefix, $mmdd, $yyyy, $hhmm) = ($name =~ m/(.*)(\d{4})(\d{4})(\d{4})\.csv/);
#print "$name = $prefix $yyyy $mmdd $hhmm\n";
my $stamp = "$yyyy$mmdd$hhmm";
if (!exists($bases{$prefix}) || ($stamp > $bases{$prefix}->{stamp}))
{
$bases{$prefix} = { name => $name, stamp => $stamp };
}
}
foreach my $prefix (sort keys %bases)
{
print "$bases{$prefix}->{name}\n";
}
Output:
APC5502015VP092020121451.csv
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
this is the awk solution:
cd FTPDIR
ls -1|awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}'
Testted Below:
> cat temp
APC5502015VP072020121826.csv
APC5502015VP082020122314.csv
APC5502015VP092020121451.csv
CBC5502015VP092020122045.csv
CBC5502015VP102020122045.csv
S5502015VP072020121620.csv
S5502015VP072020122314.csv
S5502015VP092020122045.csv
> awk -F"VP" '{split($2,a,".");if(a[1]>b[$1]){b[$1]=$2}}END{for(i in b)print i"VP"b[i]}' temp
CBC5502015VP102020122045.csv
S5502015VP092020122045.csv
APC5502015VP092020121451.csv

Resources