How to find files that contain MULTIPLE newlines at their end? - macos

I want to find all files that have multiple new line characters at the end of their content.
How is this possible?

This bash command prints all files in current directory and its subdirectories that terminate with at least one empty line at the end after a sequence of one or more lines (i.e. at least a sequence of two \n):
find . -type f -print | while read a; do tail -2 "$a" | ( read x && read y && [ x"$x" = x ] && echo "$a" ); done

FYI: It's possible to search for this with PHPStorm using RegEx search term \n+\n\Z

Related

Rename multiple datetime files in Unix by inserting - and _ characters

I have many files in a directory that I want to rename so that they are recognizable according to a certain convention:
SURFACE_OBS:2019062200
SURFACE_OBS:2019062206
SURFACE_OBS:2019062212
SURFACE_OBS:2019062218
SURFACE_OBS:2019062300
etc.
How can I rename them in UNIX to be as follows?
SURFACE_OBS:2019-06-22_00
SURFACE_OBS:2019-06-22_06
SURFACE_OBS:2019-06-22_12
SURFACE_OBS:2019-06-22_18
SURFACE_OBS:2019-06-23_00
A bash shell loop using mv and parameter expansion could do it:
for file in *:[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]]
do
prefix=${file%:*}
suffix=${file#*:}
mv -- "${file}" "${prefix}:${suffix:0:4}-${suffix:4:2}-${suffix:6:2}_${suffix:8:2}"
done
This loop picks up every file that matches the pattern:
* -- anything
: -- a colon
[[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]][[:digit:]] -- 10 digits
... and then renames it by inserting dashes and and underscore in the desired locations.
I've chosen the wildcard for the loop carefully so that it tries to match the "input" files and not the renamed files. Adjust the pattern as needed if your actual filenames have edge cases that cause the wildcard to fail (and thus rename the files a second time).
#!/bin/bash
strindex() {
# get position of character in string
x="${1%%"$2"*}"
[[ "$x" = "$1" ]] && echo -1 || echo "${#x}"
}
get_new_filename() {
# change filenames like: SURFACE_OBS:2019062218
# into filenames like: SURFACE_OBS:2019-06-22_18
src_str="${1}"
# add last underscore 2 characters from end of string
final_underscore_pos=${#src_str}-2
src_str="${src_str:0:final_underscore_pos}_${src_str:final_underscore_pos}"
# get position of colon in string
colon_pos=$(strindex "${src_str}" ":")
# get dash locations relative to colon position
y_dash_pos=${colon_pos}+5
m_dash_pos=${colon_pos}+8
# now add dashes in date
src_str="${src_str:0:y_dash_pos}-${src_str:y_dash_pos}"
src_str="${src_str:0:m_dash_pos}-${src_str:m_dash_pos}"
echo "${src_str}"
}
# accept path as argument or default to /tmp/baz/data
target_dir="${1:-/tmp/baz/data}"
while read -r line ; do
# since file renaming depends on position of colon extract
# base filename without path in case path has colons
base_dir=${line%/*}
filename_to_change=$(basename "${line}")
echo "mv ${line} ${base_dir}/$(get_new_filename "${filename_to_change}")"
# find cmd attempts to exclude files that have already been renamed
done < <(find "${target_dir}" -name 'SURFACE*' -a ! -name '*_[0-9]\{2\}$')

Calling external command in find and using pipe

I am wondering if there is a way to search for all the files from a certain directory including subdirectories using a find command on AIX 6.x, before calling an external command (e.g. hlcat) to display/convert them into a readable format, which can then be piped through a grep command to find a pattern instead of using loops in the shell?
e.g. find . -type f -name “*.hl7” -exec hlcat {} | grep -l “pattern” \;
The above command would not work and I have to use a while loop to display the content and search for the pattern as follows:
find . -type f -name “*.hl7” -print | while read file; do
hlcat $file | grep -l “pattern”;
done
At the same time, these HL7 files have been renamed with round brackets which prevent them from being open without having to include double quotes around the file name.
e.g. hlcat (patient) filename.hl7 will fail to open.
hlcat “(patient) filename.hl7” will work.
In short, I am looking for a clean concise one-liner approach within the find command and view and search their content these HL7 files with round bracket names.
Many thanks,
George
P.S. HL7 raw data is made up of one continuous line and is not readable unless it is converted into a workable reading format using tools such as hlcat.
in
Update: The easy way
find . -type f -name '*.hl7' -exec grep -iEl 'Barry|Jolene' {} +
note: You may get some false positives though. See below for a targeted search.
Searching for a first name in a bunch of HL7v2 files:
1. Looking into the HL7v2 file format
Example of HL7v2 PID segment:
PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M|||
PID Segment decomposition:
Seq
NAME
HHIC USE
LEN
0
PID keyword
Segment Type
3
3
Patient ID
Medical Record Num
250
5
Patient Name
Last^First^Middle
250
7
Date/Time Of Birth
YYYYMMDD
26
8
Sex
F, M, or U
1
2. Writing targeted searches
With grep (AIX):
find . -type f -name '*.hl7' -exec grep -iEl '^PID\|([^|]*\|){4}[^^|]*\^(Barry|Jolene)\^' {} +
With awk:
find . -type f -name '*.hl7' -exec awk -v firstname='^(Barry|Jolene)$' '
BEGIN { FS="|" }
FNR == 1 { if( found ) print filename; found = 0; filename = FILENAME }
$1 == "PID" { split($6, name, "^"); if (toupper(name[2]) ~ toupper(firstname)) { found = 1 } }
END { if ( found ) print filename }
' {} +
remark: The good part about this awk solution is that you pass the first name regexp as an argument. This solution is easily extendable, for example for searching the last name.

Indexing files and parsing name

I have a directory, ./grd_files/lat36/ that has 7 files in it (n36e114.grd, n36e115.grd, n36e116.grd, n36e117.grd, n36e118.grd, n36e119.grd, n36e120.grd. Also beneath ./grd_files/ are other folders named lat37, lat38, lat39. Each contains some files named in the same format as those in lat36, only instead of n36e114.grd, the file for the e114 longitude in the lat37 folder would be called n37e114. Now, not all lat** folders contain all the longitudes, but I need them to.
I have written a part of the script to determine which lat** folder has the most columns in it (it is lat36 with 7 longitudes). I want to compare the longitudes that exist in lat36 folder to the other folders, and if a column is missing in another folder, I will make it. I can handle the if then statement, but I am stumped on how to compare the lists in bash.
I was thinking to make a list of the file names in the row1 folder, and compare that to the to the files in the other folders, but the names won't and shouldn't match -- only the column part of the name will and should match. So far I have tried to make an array of the file names and then parse it for just the column part of the name. Note that these are actually map tiles, so the names are really in the format of coordinates in northing (row) and easing (col) e.g. n36e114.grd. So I want to isolate all the e114 style parts of the names and check and make sure that they exist in the other rows. I hope that makes sense. Below is what I attempted, but I am not great in bash syntax so I'm stumped. Thanks so much for the help.
col_list_raw=( $(find $maxdirectory -name ".grd" -exec basename {} .grd \;) )
col_list=( for c in ${col_list_raw[#]}; do echo ${col_list_raw[$c]:3:7}; done )
where $maxdirectory is the one with the most columns.*
UPDATE: I have removed what I described in italics above and attempted to incorporate the solution from John1024. Below is the code.
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*lon/lon/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
Where latXX is the folder with the most longitudes. John1024's first loop works nicely, and I get the correct lists for each of the lat** folders, but the second loop straight up compares the lists , returning:
lat37 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat38 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
lat39 is missing n36e114.grd n36e115.grd n36e116.grd n36e117.grd n36e118.grd n36e119.grd n36e120.grd
I need that loop to compare only part of the file name. ie I want to check each folder for the existence of each longitude. So that if file `n37e114.grd' exists, nothing happens, but if it does not exist, that information is returned and I can execute a command based on the missing file. I hope my edits clear up the naming convention and are understandable. Thanks again for the help. AM
SOLUTION:
thanks to the help of #John1024 I was able to find a solution. I have reproduced the final solution below. Following this, I read in the *.out files and conduct my command on each line of them.
cd ./grd_files
for lat in */
do
ls "$lat" | sed 's/[a-z][1-9][1-9].*\([a-z][0-9][0-9]*\).grd/\1/' >"${lat%/}.tmp"
done
for file in *.tmp
do
lat=$(echo $file | awk -F "." '{print $1}')
grep -vFf "$file" ${xXX}.tmp >${lat}missing.out
[ -s ${lat}missing.out ] && echo ${file%.tmp} is missing $(cat ${lat}missing.out)
done
The question includes two different naming schemes for the files. Both would work the same, but to keep it simple and intuitive, this answer uses the first scheme.
It is possible to loop through bash arrays to find the missing columns. However, grep is well-suited to this task, greatly simplifies the logic, and, if there are many columns and rows, it is likely much faster. Using grep:
cd ./grd_files
for row in row*/
do
ls "$row" | sed 's/.*col/col/' >"${row%/}.tmp"
done
for f in row*.tmp
do
grep -vFf "$f" row1.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
The first loop above, creates lists of columns that exist in each of the rows. These lists are saved in temporary files name row1.tmp, row2.tmp, etc.
The second loop compares each of those lists to the reference row, row1.tmp. The list of columns missing from that row are saved in temporary file missing.tmp. If missing.tmp has a nonzero size, then there are missing columns and a report is generated.
For cleanup, one might want to delete the tmp files. If so, add this line to the end of the script:
rm row*.tmp missing.tmp
Fancier version
Using process substitution, the need for many of the temporary files can be eliminated:
trap "rm missing.tmp" EXIT
for row in row*/
do
ls row1/ | sed 's/.*col/col/' | grep -vFf <(ls "$row" | sed 's/.*col/col/') >missing.tmp
[ -s missing.tmp ] && echo $row is missing $(cat missing.tmp)
done
This version also uses trap to assure that the sole remaining temporary file is removed when the script is finished.
Using the other naming scheme as per revised question
cd ./grd_files
for row in lat*/
do
ls "$row" | sed 's/.*n[0-9][0-9]e/e/' >"${row%/}.tmp"
done
for f in lat*.tmp
do
grep -vFf "$f" ${latXX}.tmp >missing.tmp
[ -s missing.tmp ] && echo ${f%.tmp} is missing $(cat missing.tmp)
done
cd ..
As I told you in the comment, supplying the test data is a good practice. In this case you would got much more answers if supplied a script what creating a test case, something such next:
mkdir grid
cd grid
mkdir lat3{5..9}
#if you don't know the {3..9} expansion, simply write
#mkdir lat36 lat37 lat38 lat39
touch lat35/n35e111.grd
touch lat36/n36e11{4..9}.grd lat36/n36e120.grd
touch lat37/n37e11{4,6,8}.grd
touch lat38/n38e11{4..9}.grd
#39 missing all files
Such script what creating an test case helps much more as full page of words. ;) Or, if no script, at least supply the output of find like find grid -print. Your first edit helps a bit, (I missed it) and +100 to #John1024's work.
Now about the solution.
Your final solution have one problem. What if the directory with the MOST LONGITUDES (your latXX) missing some gridfile what exists in some other directories? E.g. it has the most gridfiles, but still not all. Like in the above test case, the lat36 contains 7 files (most of all), but sill missing a file n36e111.grd (because the 111 exists only in the lat35)?
Therefore i created an alternative solution, what eliminates this problem and show the result as the next matrix:
111 114 115 116 117 118 119 120
35: + no no no no no no no # the 111 is here
36: no + + + + + + + # the dir with a MOST of longitudes but missing 111
37: no + no + no + no no
38: no + + + + + + no
39: no no no no no no no no # missing all longitudes
the script
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
print_matrix() {
echo -ne "\t"
paste -s - <<<"$known_longs"
for lat in $known_lats
do
echo -en "$lat:"
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && echo -en "\t+" || echo -en "\tno"
done
echo
done
}
print_matrix
The logic is easy:
search for all known longs e.g. for the filenames what contains eNNN
search for all known lats e.g. for the directories wit latNN
in a cycle test the existence if the files
The above printed matrix is probably not very useful, because you probably want do something with the found or missing files, so here is an action variant of the script.
start="./test/grid"
cd "$start" || err "can cd to $start" || exit 1
known_longs=$(find . -type f -name \*.grd -print | sed 's:.*/n.*e\([0-9][0-9]*\)\.grd:\1:' | sort -u)
known_lats=$(find . -type d -print | grep -oP 'lat\K\d+(?=/?)' | sort -u)
do_if_exists() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
#do nothing
}
do_if_missing() {
local xlat="$1"
local xlong="$2"
filename="n${xlat}e${xlong}.grd"
echo "from lat$xlat missing $filename"
}
do_actions() {
for lat in $known_lats
do
for long in $known_longs
do
[[ -e "./lat${lat}/n${lat}e${long}.grd" ]] && do_if_exists $lat $long || do_if_missing $lat $long
done
done
}
do_actions
what for the missing file do an action (echoes what missing), and the output is the next:
from lat35 missing n35e114.grd
from lat35 missing n35e115.grd
from lat35 missing n35e116.grd
from lat35 missing n35e117.grd
from lat35 missing n35e118.grd
from lat35 missing n35e119.grd
from lat35 missing n35e120.grd
from lat36 missing n36e111.grd
from lat37 missing n37e111.grd
from lat37 missing n37e115.grd
from lat37 missing n37e117.grd
from lat37 missing n37e119.grd
from lat37 missing n37e120.grd
from lat38 missing n38e111.grd
from lat38 missing n38e120.grd
from lat39 missing n39e111.grd
from lat39 missing n39e114.grd
from lat39 missing n39e115.grd
from lat39 missing n39e116.grd
from lat39 missing n39e117.grd
from lat39 missing n39e118.grd
from lat39 missing n39e119.grd
from lat39 missing n39e120.grd
Of course, is possible optimise more, like:
do the find only once (helps if the directory tree is large - by creating a list of filenames by the find command
don't test each file, but test the existence of the filename in the previously created list of filenames
like in the next
startdir="./test/grid"
(cd "$startdir" || err "can cd to $start" || exit 1
gridlist="/tmp/griglist.$$"
trap "rm -f $gridlist;exit" 0 2
find . -regex '\./lat[0-9][0-9]*.*' -print >$gridlist
known_longs=($(sed -n 's:^.*/n[0-9][0-9]*e\([0-9][0-9]*\)\.grd$:\1:p' $gridlist | sort -u))
known_lats=($(grep -oP '/lat\K\d+((?=/?)|$)' $gridlist | sort -u))
full_list() {
for lat in ${known_lats[#]}
do
for long in ${known_longs[#]}
do
echo "./lat${lat}/n${lat}e${long}.grd"
done
done
}
comm -13 $gridlist <(full_list)) | while read missing
do
#do something with the miising file
echo "$missing"
done

Efficient way to find paths from a list of filenames

From a list of file names stored in a file f, what's the best way to find the relative path of each file name under dir, outputting this new list to file p? I'm currently using the following:
while read name
do
find dir -type f -name "$name" >> p
done < f
which is too slow for a large list, or a large directory tree.
EDIT: A few numbers:
Number of directories under dir: 1870
Number of files under dir: 80622
Number of filenames in f: 73487
All files listed in f do exist under dir.
The following piece of python code does the trick. The key is to run find once and store the output in a hashmap to provide an O(1) way to get from file_name to the list of paths for the filename.
#!/usr/bin/env python
import os
file_names = open("f").readlines()
file_paths = os.popen("find . -type f").readlines()
file_names_to_paths = {}
for file_path in file_paths:
file_name = os.popen("basename "+file_path).read()
if file_name not in file_names_to_paths:
file_names_to_paths[file_name] = [file_path]
else:
file_names_to_paths[file_name].append(file_path) # duplicate file
out_file = open("p", "w")
for file_name in file_names:
if file_names_to_paths.has_key(file_name):
for path in file_names_to_paths[file_name]:
out_file.write(path)
Try this perl one-liner
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),<$p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
1- create an hashmap whose keys are filenames : %H=map{chomp;$_=>1}<>
2- define a recursive subroutine to traverse directories : sub R{}
2.1- recusive call for directories : map R($_), if -d$p
2.2- extract the filename from the path : ($b=$p)=~s|.*/||
2.3- print if hashmap contains filename : print"$p\n" if$H{$b}
3- call R with path current directory : R"."
EDIT : to traverse hidden directories (.*)
perl -e '%H=map{chomp;$_=>1}<>;sub R{my($p)=#_;map R($_),grep !m|/\.\.?$|,<$p/.* $p/*> if -d$p;($b=$p)=~s|.*/||;print"$p\n" if$H{$b}}R"."' f
I think this should do the trick:
xargs locate -b < f | grep ^dir > p
Edit: I can't think of an easy way to prefix dir/*/ to the list of file names, otherwise you could just pass that directly to xargs locate.
Depending on what percentage of the directory tree is considered a match, it might be faster to find every file, then grep out the matching ones:
find "$dir" -type f | grep -f <( sed 's+\(.*\)+/\1$+' "$f" )
The sed command pre-processes your list of file names into regular expressions that will only match full names at the end of a path.
Here is an alternative using bash and grep
#!/bin/bash
flist(){
for x in "$1"/*; do #*/ for markup
[ -d "$x" ] && flist $x || echo "$x"
done
}
dir=/etc #the directory you are searching
list=$(< myfiles) #the file with file names
#format the list for grep
list="/${list//
/\$\|/}"
flist "$dir" | grep "$list"
...if you need full posix shell compliance (busybox ash, hush, etc...) replace the $list substring manipulation with a variant of chepner's sed and replace $(< file) with $(cat file)

Recursively check length of directory name

I need to determine if there are any directory names > 31 characters in a given directory (i.e. look underneath that root).
I know I can use something like find /path/to/root/dir -type d >> dirnames.txt
This will give me a text file of complete paths.
What I need is to get the actual number of characters in each directory name. Not sure if parsing the above results w/sed or awk makes sense. Looking for ideas/thoughts/suggestions/tips on how to accomplish this. Thanks!
This short script does it all in one go, i.e. finds all directory names and then outputs any which are greater than 31 characters in length (along with their length in characters):
for d in `find /path/to/root/dir -type d -exec basename {} \;` ; do
len=$(echo $d | wc -c)
if [ $len -gt 31 ] ; then
echo "$d = $len characters"
fi
done
Using your dirnames.txt file created by your find cmd, you can then sort the data by length of pathname, i.e.
awk '{print length($0) "\t" $0}' dirnames.txt | sort +0nr -1 > dirNamesWithSize.txt
This will present the longest path names (based on the value of length) at the top of the file.
I hope this helps.
Try this
find . -type d -exec bash -c '[ $(wc -c <<<"${1##*/}") -gt 32 ] && echo "${1}"' -- {} \; 2>/dev/null
The one bug, which I consider minor, is that it will over-count directory name length by 1 every time.
If what you wanted was the whole path rather than the last path component, then use this:
find . -type d | sed -e '/.\{32,\}/!d'
This version also has a bug, but only when file names have embedded newlines.
The output of both commands is a list of file names which match the criteria. Counting the length of each one is trivial from there.

Resources