Using 'find' to select unknown patterns in file names with bash - bash

Let's say I have a directory with 4 files in it.
path/to/files/1_A
path/to/files/1_B
path/to/files/2_A
path/to/files/2_B
I want to create a loop, which on each iteration, does something with two files, a matching X_A and X_B. I need to know how to find these files, which sounds simple enough using pattern matching. The problem is, there are too many files, and I do not know the prefixes aka patterns (1_ and 2_ in the example). Is there some way to group files in a directory based on the first few characters in the filename? (Ultimately to store as a variable to be used in a loop)

You could get all the 3-character prefixes by printing out all the file names, trimming them to three characters, and then getting the unique strings.
find -printf '%f\n' | cut -c -3 | sort -u
Then if you wanted to loop over each prefix, you could write a loop like:
find -printf '%f\n' | cut -c -3 | sort -u | while IFS= read -r prefix; do
echo "Looking for $prefix*..."
find -name "$prefix*"
done

Related

How do I filter down a subset of files based upon time?

Let's assume I have done lots of work whittling down a list of files in a directory down to the 10 files that I am interested in. There were hundreds of files, and I have finally found the ones I need.
I can either pipe out the results of this (piping from ls), or I can say I have an array of those values (doing this inside a script). Doesn't matter either way.
Now, of that list, I want to find only the files that were created yesterday.
We can use tools like find -mtime 1 which are fine. But how would we do that with a subset of files in a directory? Can we pass a subset to find via xargs?
I can do this pretty easily with a for loop. But I was curious if you smart people knew of a one-liner.
If they're in an array:
files=(...)
find "${files[#]}" -mtime 1
If they're being piped in:
... | xargs -d'\n' -I{} find {} -mtime 1
Note that the second one will run a separate find command for each file which is a bit inefficient.
If any of the items are directories and you don't want to search inside of them, add -maxdepth 0 to disable find's recursion.
Another option that won't recurse, though I'd just use John's find solution if I were you.
$: stat -c "%n %w" "${files[#]}" | sed -n "
/ $(date +'%Y-%m-%d' --date=yesterday) /{ s/ .*//; p; }"
The stat will print the name and creation date of files in the array.
The sed "greps" for the date you want and strips the date info before printing the filename.

How to create argument variable in bash script

I am trying to write a script such that I can identify number of characters of the n-th largest file in a sub-directory.
I was trying to assign n and the name of sub-directory into arguments like $1, $2.
Current directory: Greetings
Sub-directory: language_files, others
Sub-directory: English, German, French
Files: Goodmorning.csv, Goodafternoon.csv, Goodevening.csv ….
I would be at directory “Greetings”, while I indicating subdirectory (English, German, French), it would show the nth-largest file in the subdirectory indicated and calculate number of characters as well.
For instance, if I am trying to figure out number of characters of 2nd largest file in English, I did:
langs=$1
n=$2
for langs in language_files/;
Do count=$(find language_files/$1 name "*.csv" | wc -m | head -n -1 | sort -n -r | sed -n $2(p))
Done | echo "The file has $count bytes!"
The result I wanted was:
$ ./script1.sh English 2
The file has 1100 bytes!
The main problem of all the issue is the fact that I don't understand how variables and looping work in bash script.
no need for looping
find language_files/"$1" -name "*.csv" | xargs wc -m | sort -nr | sed -n "$2{p;q}"
for byte counting you should use -c, since -m is for char counting (it may be the same for you).
You don't use the loop variable in the script anyway.
Bash loops are interesting. You are encouraged to learn more about them when you have some time. However, this particular problem might not need a loop. Set lang (you can call it langs if you prefer) and n appropriately, and then try this:
count=$(stat -c'%s %n' language_files/$lang/* | sort -nr | head -n$n | tail -n1 | sed -re 's/^[[:space:]]*([[:digit:]]+).*/\1/')
That should give you the $count you need. Then you can echo it however you like.
EXPLANATION
If you wish to learn how it works:
The stat command outputs various statistics about the named file (or files), in this case %s the file's size and %n the file's name.
The head and tail output respectively the first and last several lines of a file. Together, they select a specific line from the file
The sed command screens a certain part of the line. (You can use cut, instead, if you prefer.)
If you wish to be cleverer, then you can optimize as #karafka has done.

How to read CSV file stored in variable

I want to read a CSV file using Shell,
But for some reason it doesn't work.
I use this to locate the latest added csv file in my csv folder
lastCSV=$(ls -t csv-output/ | head -1)
and this to count the lines.
wc -l $lastCSV
Output
wc: drupal_site_livinglab.csv: No such file or directory
If I echo the file it says: drupal_site_livinglab.csv
Your issue is that you're one directory up from the path you are trying to read. The quick fix would be wc -l "csv-output/$lastCSV".
Bear in mind that parsing ls -t though convenient, isn't completely robust, so you should consider something like this to protect you from awkward file names:
last_csv=$(find csv-output/ -mindepth 1 -maxdepth 1 -printf '%T#\t%p\0' |
sort -znr | head -zn1 | cut -zf2-)
wc -l "$last_csv"
GNU find lists all files along with their last modification time, separating the output using null bytes to avoid problems with awkward filenames.
if you remove -maxdepth 1, this will become a recursive search
GNU sort arranges the files from newest to oldest, with -z to accept null byte-delimited input.
GNU head -z returns the first record from the sorted list.
GNU cut -z at the end discards the timestamp, leaving you with only the filename.
You can also replace find with stat (again, this assumes that you have GNU coreutils):
last_csv=$(stat csv-output/* --printf '%Y\t%n\0' | sort -znr | head -zn1 | cut -zf2-)

Selecting single directory that satisfies certain pattern

I would like to be able to get name of the first directory that matches a certain pattern, say:
~/dir-a/dir-b/dir-*
That is, if the directory dir-b contained directories dir-1, dir-2, and dir-3, I would get dir-1 (or, alternatively, dir-3).
The option listed above works if there is only one subdirectory in dir-b, but obviously fails when there are more of them.
You can use bash arrays, like:
content=(~/dir-a/dir-b/dir-*) #stores the content of a directory into array "content"
echo "${content[0]}" #echoes the 1st
echo ${content[${#content[#]}-1]} #echoes the last element of array "comtent"
#or, according to #konsolebox'c comments
echo "${content[#]:(-1)}"
Another method, make a bash function like:
first() { set "$#"; echo "$1"; }
#and call it
first ~/dir-a/dir-b/dir-*
If you want sort files, not by name but by modification time, you can use the next script:
where="~/dir-a/dir-b"
find $where -type f -print0 | xargs -0 stat -f "%m %N" | sort -rn | head -1 | cut -f2- -d" "
decomposed
the find finds files by defined criteria
the xargs runs the stat command for every found file and prints the result as "modification_time filename"
the sort sorts the result by the time
the head gets the first of them
and the cut cuts the unvanted time field
You can adjust the find with -mindepth 1 -maxdepth 1 to don't descend deeper.
In linux, it can be shorter, (using -printf format), but this works in OS X too...

Count files matching a pattern with GREP

I am on a windows server, and have installed GREP for win. I need to count the number of file names that match (or do not match) a specific pattern. I don't really need all the filenames listed out, I just need a total count of how many matched. The tree structure that I will be searching is fairly large, so I'd like to conserve as much processing as possible.
I'm not very familiar with grep, but it looks like i can use the -l option to search for file names matching a given pattern. So, for example, I could use
$grep -l -r this *.doc*
to search for all MS word files in the current folder and all child folders. This would then return to me a listing of all those files. i don't want the listing, i just want a count of how many it found. Is this possible with GREP...or another tool?
thanks!
On linux you would use
grep -l -r this .doc | wc -l
to get the number of printed lines
Although -r .doc does not search all word files, you would use --include "*doc" .
And if you do not have wc, you can use grep again, to count the number of matches:
grep -l -r --include "*doc" this . | grep -c .

Resources