Indexing files and parsing name - bash

I have a directory, ./grd_files/lat36/ that has 7 files in it (n36e114.grd, n36e115.grd, n36e116.grd, n36e117.grd, n36e118.grd, n36e119.grd, n36e120.grd. Also beneath ./grd_files/ are other folders named lat37, lat38, lat39. Each contains some files named in the same format as those in lat36, only instead of n36e114.grd, the file for the e114 longitude in the lat37 folder would be called n37e114. Now, not all lat** folders contain all the longitudes, but I need them to.
I have written a part of the script to determine which lat** folder has the most columns in it (it is lat36 with 7 longitudes). I want to compare the longitudes that exist in lat36 folder to the other folders, and if a column is missing in another folder, I will make it. I can handle the if then statement, but I am stumped on how to compare the lists in bash.
I was thinking to make a list of the file names in the row1 folder, and compare that to the to the files in the other folders, but the names won't and shouldn't match -- only the column part of the name will and should match. So far I have tried to make an array of the file names and then parse it for just the column part of the name. Note that these are actually map tiles, so the names are really in the format of coordinates in northing (row) and easing (col) e.g. n36e114.grd. So I want to isolate all the e114 style parts of the names and check and make sure that they exist in the other rows. I hope that makes sense. Below is what I attempted, but I am not great in bash syntax so I'm stumped. Thanks so much for the help.
col_list_raw=( $(find $maxdirectory -name ".grd" -exec basename {} .grd \;) )
col_list=( for c in ${col_list_raw[#]}; do echo ${col_list_raw[$c]:3:7}; done )
where $maxdirectory is the one with the most columns.*
UPDATE: I have removed what I described in italics above and attempted to incorporate the solution from John1024. Below is the code.
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*lon/lon/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
Where latXX is the folder with the most longitudes. John1024's first loop works nicely, and I get the correct lists for each of the lat** folders, but the second loop straight up compares the lists , returning:
lat37 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat38 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat39 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
I need that loop to compare only part of the file name. ie I want to check each folder for the existence of each longitude. So that if file `n37e114.grd' exists, nothing happens, but if it does not exist, that information is returned and I can execute a command based on the missing file. I hope my edits clear up the naming convention and are understandable. Thanks again for the help. AM
SOLUTION:
thanks to the help of #John1024 I was able to find a solution. I have reproduced the final solution below. Following this, I read in the *.out files and conduct my command on each line of them.
cd ./grd_files
for lat in */
do
ls "$lat" | sed 's/[a-z][1-9][1-9].*\([a-z][0-9][0-9]*\).grd/\1/' >"${lat%/}.tmp"
done
for file in *.tmp
do
lat=$(echo $file | awk -F "." '{print $1}')
grep -vFf "$file" ${xXX}.tmp >${lat}missing.out
[ -s ${lat}missing.out ] && echo ${file%.tmp} is missing $(cat ${lat}missing.out)
done

The question includes two different naming schemes for the files. Both would work the same, but to keep it simple and intuitive, this answer uses the first scheme.
It is possible to loop through bash arrays to find the missing columns. However, grep is well-suited to this task, greatly simplifies the logic, and, if there are many columns and rows, it is likely much faster. Using grep:
cd ./grd_files
for row in row*/
do
ls "$row" | sed 's/.*col/col/' >"${row%/}.tmp"
done
for f in row*.tmp
do
grep -vFf "$f" row1.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
The first loop above, creates lists of columns that exist in each of the rows. These lists are saved in temporary files name row1.tmp, row2.tmp, etc.
The second loop compares each of those lists to the reference row, row1.tmp. The list of columns missing from that row are saved in temporary file missing.tmp. If missing.tmp has a nonzero size, then there are missing columns and a report is generated.
For cleanup, one might want to delete the tmp files. If so, add this line to the end of the script:
rm row*.tmp missing.tmp
Fancier version
Using process substitution, the need for many of the temporary files can be eliminated:
trap "rm missing.tmp" EXIT
for row in row*/
do
ls row1/ | sed 's/.*col/col/' | grep -vFf <(ls "$row" | sed 's/.*col/col/') >missing.tmp
[ -s missing.tmp ] && echo $row is missing $(cat missing.tmp)
done
This version also uses trap to assure that the sole remaining temporary file is removed when the script is finished.
Using the other naming scheme as per revised question
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*n[0-9][0-9]e/e/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..

As I told you in the comment, supplying the test data is a good practice. In this case you would got much more answers if supplied a script what creating a test case, something such next:
mkdir grid
cd grid
mkdir lat3{5..9}
#if you don't know the {3..9} expansion, simply write
#mkdir lat36 lat37 lat38 lat39
touch lat35/n35e111.grd
touch lat36/n36e11{4..9}.grd lat36/n36e120.grd
touch lat37/n37e11{4,6,8}.grd
touch lat38/n38e11{4..9}.grd
#39 missing all files
Such script what creating an test case helps much more as full page of words. ;) Or, if no script, at least supply the output of find like find grid -print. Your first edit helps a bit, (I missed it) and +100 to #John1024's work.
Now about the solution.
Your final solution have one problem. What if the directory with the MOST LONGITUDES (your latXX) missing some gridfile what exists in some other directories? E.g. it has the most gridfiles, but still not all. Like in the above test case, the lat36 contains 7 files (most of all), but sill missing a file n36e111.grd (because the 111 exists only in the lat35)?
Therefore i created an alternative solution, what eliminates this problem and show the result as the next matrix:
111 114 115 116 117 118 119 120
35: + no no no no no no no # the 111 is here
36: no + + + + + + + # the dir with a MOST of longitudes but missing 111
37: no + no + no + no no
38: no + + + + + + no
39: no no no no no no no no # missing all longitudes
the script
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
print_matrix() {
echo -ne "\t"
paste -s - <<<"$known_longs"
for lat in $known_lats
do
echo -en "$lat:"
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && echo -en "\t+" || echo -en "\tno"
done
echo
done
}
print_matrix
The logic is easy:
search for all known longs e.g. for the filenames what contains eNNN
search for all known lats e.g. for the directories wit latNN
in a cycle test the existence if the files
The above printed matrix is probably not very useful, because you probably want do something with the found or missing files, so here is an action variant of the script.
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
do_if_exists() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
#do nothing
}
do_if_missing() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
echo "from lat$xlat missing $filename"
}
do_actions() {
for lat in $known_lats
do
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && do_if_exists $lat $long || do_if_missing $lat $long
done
done
}
do_actions
what for the missing file do an action (echoes what missing), and the output is the next:
from lat35 missing n35e114.grd
from lat35 missing n35e115.grd
from lat35 missing n35e116.grd
from lat35 missing n35e117.grd
from lat35 missing n35e118.grd
from lat35 missing n35e119.grd
from lat35 missing n35e120.grd
from lat36 missing n36e111.grd
from lat37 missing n37e111.grd
from lat37 missing n37e115.grd
from lat37 missing n37e117.grd
from lat37 missing n37e119.grd
from lat37 missing n37e120.grd
from lat38 missing n38e111.grd
from lat38 missing n38e120.grd
from lat39 missing n39e111.grd
from lat39 missing n39e114.grd
from lat39 missing n39e115.grd
from lat39 missing n39e116.grd
from lat39 missing n39e117.grd
from lat39 missing n39e118.grd
from lat39 missing n39e119.grd
from lat39 missing n39e120.grd
Of course, is possible optimise more, like:
do the find only once (helps if the directory tree is large - by creating a list of filenames by the find command
don't test each file, but test the existence of the filename in the previously created list of filenames
like in the next
startdir="./test/grid"
(cd "$startdir" || err "can cd to $start" || exit 1
gridlist="/tmp/griglist.$$"
trap "rm -f $gridlist;exit" 0 2
find . -regex '\./lat[0-9][0-9]*.*' -print >$gridlist
known_longs=($(sed -n 's:^.*/n[0-9][0-9]*e\([0-9][0-9]*\)\.grd$:\1:p' $gridlist | sort -u))
known_lats=($(grep -oP '/lat\K\d+((?=/?)|$)' $gridlist | sort -u))
full_list() {
for lat in ${known_lats[#]}
do
for long in ${known_longs[#]}
do
echo "./lat${lat}/n${lat}e${long}.grd"
done
done
}
comm -13 $gridlist <(full_list)) | while read missing
do
#do something with the miising file
echo "$missing"
done

Related

How do you compress multiple folders at a time, using a shell?

There are n folders in the directory named after the date, for example:
20171002 20171003 20171005 ...20171101 20171102 20171103 ...20180101 20180102
tips: Dates are not continuous.
I want to compress every three folders in each month into one compression block.
For example:
tar jcvf mytar-20171002_1005.tar.bz2 20171002 20171003 20171005
How to write a shell to do this?
You need to do a for loop on your ls variable, then parse the directory name.
dir_list=$(ls)
prev_month=""
times=0
first_dir=""
last_dir=""
dir_list=()
for i in $dir_list; do
month=${i:0:6} #here month will be year plus month
if [ "$month" = "$prev_month" ]; then
i=$(($i+1))
if [ "$i" -eq "3" ]; then
#compress here
dir_list=()
first_dir=""
last_dir=""
else
last_dir=$i
dir_list+=($i)
fi
else
if [ "$first_dir" = "" ]; then
first_dir=$i
else
#compress here
first_dir="$i"
last_dir=""
dir_list=()
fi
fi
This code is not tested and may contain syntaxe error. '#compress here' need to be replace by a loop on the array to create a string to compress.
Assuming you don't have too many directories (I think the limit is several hundred), then you can use Bash's array manipulation.
So, you first load all your directory names into a Bash array:
dirs=( $(ls) )
(I'm going to assume files have no spaces in their names, otherwise it gets a bit dicey)
Then you can use Bash's array slice syntax to pop 3 elements at a time from the array:
while [ "${#dirs[#]}" -gt 0 ]; do
dirs_to_compress=( "${dirs[#]:0:3}" )
dirs=( "${dirs[#]:3}" )
# do something with dirs_to_compress
done
The rest should be pretty easy.
You can achieve this with xargs, a bash while loop, and awk:
ls | xargs -n3 | while read line; do
tar jcvf $(echo $line | awk '{print "mytar-"$1"_"substr($NF,5,4)".tar.bz2"}') $line
done
unset folders
declare -A folders
g=3
for folder in $(ls -d */); do
folders[${folder:0:6}]+="${folder%%/} "
done
for folder in "${!folders[#]}"; do
for((i=0; i < $(echo ${folders[$folder]} | tr ' ' '\n' | wc -l); i+=g)) do
group=(${folders[$folder]})
groupOfThree=(${group[#]:i:g})
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[-1]:4:4}.tar.bz2 ${groupOfThree[#]}
done
done
This script finds all folders in the current directory, seperates them in groups of months, makes groups of at most three folders and creates a .tar.bz2 for each of them with the name you used in the question.
I tested it with those folders:
20171101 20171102 20171103 20171002 20171003 20171005 20171007 20171009 20171011 20171013 20180101 20180102
And the created tars are:
mytar-20171002_1005.tar.bz2
mytar-20171007_1011.tar.bz2
mytar-20171013_1013.tar.bz2
mytar-20171101_1103.tar.bz2
mytar-20180101_0102.tar.bz2
Hope that helps :)
EDIT: If you are using bash version < 4.2 then replace the line:
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[-1]:4:4}.tar.bz2 ${groupOfThree[#]}
by:
tar jcvf mytar-${groupOfThree[0]}_${groupOfThree[`expr ${#groupOfThree[#]} - 1`]:4:4}.tar.bz2 ${groupOfThree[#]}
That's because bash version < 4.2 doesn't support negative indices for arrays.

Bash: how to turn list of unique no data values into variables that can be used?

basepath=Desktop/DEM
dir=(ls -1 type -f)
cd $dir
for f in *.tif; do gdalinfo "$f" | grep -o 'NoData Value\=[-0-9]*' || echo "NoData Value=None"; done > test.txt
cat test.txt | sort | uniq > uniquenodata.txt #this is to find unique no data values in a directory
nodatalist=$(cat uniquenodata.txt)
rightnodata=-9999
I have made the BASH script above to find out the different no data values in a directory.
My goal is to have separate folders that have only one type of no data value, I need to somehow create a for loop that will convert the list of unique no data values ($nodatalist) and check each tif's no data value and send it to the corresponding folder that has these no data values. I am very new to BASH and do not know how to turn a list of values into a variable that can be used in a for loop.
A more efficient approach is to move the files immediately. Create the destination directory if it doesn't exist.
for f in *.tif; do
i=$(gdalinfo "$f" | grep -o 'NoData Value=[-0-9]*') && d=${i#NoData Value=} || d="None"
mkdir -p "$d"
mv "$f" "$d"/
done
As an aside, these lines look like a syntax error:
dir=(ls -1 type -f)
cd $dir
This will effectively cd test if you have a directory by this name. Maybe you actually mean find -type f but this obviously doesn't produce a directory (-type f specifically selects regular files which aren't directories).
You can use variable indirection, a variable whose value is the name of another variable
for d in "${nodatalist[#]}"; do
echo "${!d}"
done
As this example illustrates
declare -a a=("b" "c" "d");
b=1
c=2
d=3
for i in "${a[#]}"; do
echo "name: $i, value: ${!i}"
done
Output:
name: b, value: 1
name: c, value: 2
name: d, value: 3

Finding the file name in a directory with a pattern

I need to find the latest file - filename_YYYYMMDD in the directory DIR.
The below is not working as the position is shifting each time because of the spaces between(occurring mostly at file size field as it differs every time.)
please suggest if there is other way.
report =‘ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | cut -d “ “ -f9’
You can use AWK to cut the last field . like below
report=`ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | awk '{print $NF}'`
Cut may not be an option here
If I understand you want to loop though each file in the directory and file the largest 'YYYYMMDD' value and the filename associated with that value, you can use simple POSIX parameter expansion with substring removal to isolate the 'YYYYMMDD' and compare against a value initialized to zero updating the latest variable to hold the largest 'YYYYMMDD' as you loop over all files in the directory. You can store the name of the file each time you find a larger 'YYYYMMDD'.
For example, you could do something like:
#!/bin/sh
name=
latest=0
for i in *; do
test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }
done
printf "%s\n" "$name"
Example Directory
$ ls -1rt
filename_20120615
filename_20120612
filename_20120115
filename_20120112
filename_20110615
filename_20110612
filename_20110115
filename_20110112
filename_20100615
filename_20100612
filename_20100115
filename_20100112
Example Use/Output
$ name=; latest=0; \
> for i in *; do \
> test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }; \
> done; \
> printf "%s\n" "$name"
filename_20120615
Where the script selects filename_20120615 as the file with the greatest 'YYYYMMDD' of all files in the directory.
Since you are using only tools provided by the shell itself, it doesn't need to spawn subshells for each pipe or utility it calls.
Give it a test and let me know if that is what you intended, let me know if your intent was different, or if you have any further questions.

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.
You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired
sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.
This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

Recursively check length of directory name

I need to determine if there are any directory names > 31 characters in a given directory (i.e. look underneath that root).
I know I can use something like find /path/to/root/dir -type d >> dirnames.txt
This will give me a text file of complete paths.
What I need is to get the actual number of characters in each directory name. Not sure if parsing the above results w/sed or awk makes sense. Looking for ideas/thoughts/suggestions/tips on how to accomplish this. Thanks!
This short script does it all in one go, i.e. finds all directory names and then outputs any which are greater than 31 characters in length (along with their length in characters):
for d in `find /path/to/root/dir -type d -exec basename {} \;` ; do
len=$(echo $d | wc -c)
if [ $len -gt 31 ] ; then
echo "$d = $len characters"
fi
done
Using your dirnames.txt file created by your find cmd, you can then sort the data by length of pathname, i.e.
awk '{print length($0) "\t" $0}' dirnames.txt | sort +0nr -1 > dirNamesWithSize.txt
This will present the longest path names (based on the value of length) at the top of the file.
I hope this helps.
Try this
find . -type d -exec bash -c '[ $(wc -c <<<"${1##*/}") -gt 32 ] && echo "${1}"' -- {} \; 2>/dev/null
The one bug, which I consider minor, is that it will over-count directory name length by 1 every time.
If what you wanted was the whole path rather than the last path component, then use this:
find . -type d | sed -e '/.\{32,\}/!d'
This version also has a bug, but only when file names have embedded newlines.
The output of both commands is a list of file names which match the criteria. Counting the length of each one is trivial from there.

Resources