unix command for file seperation in two different folders - bash

I am currently in data folder which has following files and folders
Folders:
ISOLATE
JUKEBOX
Files:
XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt
...
I want to put the files with .ISOLATE in Folder ISOLATE and .JUKEBOX ones in the JUKEBOX folder. How could I perform this task using terminal?
There are more than 12000 files, so I cannot really change the naming scheme.
Thanks in advance

Try to use wildcards:
mv *.ISOLATE.quantifier.txt ISOLATE/
mv *.JUKEBOX.quantifier.txt JUKEBOX/
If the number of files is too high, you might need to move them in smaller loads.
find -name '*.ISOLATE.quantifier.txt' -maxdepth 1 -exec mv {} ISOLATE/ +
-exec with + should accumulate the command line arguments the same way as xargs, so you shouldn't overflow the maximal number of arguments.

Since you're dealing with huge # of files, you can use this mv with xargs:
printf '%s\0' *.ISOLATE.* | xargs -0 mv -t ISOLATE/
printf '%s\0' *.JUKEBOX.* | xargs -0 mv -t JUKEBOX/

In addition to trying wildcards (bash pattern match or globs), which at some point will hit an upper limit based on the number of files, you can also use find and xargs:
find . -name '*.ISOLATE.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./ISOLATE
find . -name '*.JUKEBOX.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./JUKEBOX
Doing this won't be subject to the maximum number of command line arguments that the glob solution may hit.
They key things in the commands above are:
-maxdepth 1 ensures that find won't keep looking into the ./ISOLOATE or ./JUKEBOX subdirectories
-print0 causes find to delimit the file names with a null byte rather than whitespace. This protects you against files that have spaces or other special characters in their names.
-0 causes xargs to use the null byte delimiter rather than whitespace for the same reason
-IFILE tells xargs to use the string FILE for each of the arguments. Typically xargs puts the filenames on the right, which wouldn't work with the mv command.
I tested the approach with a small shell script:
touch XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
touch XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
touch XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
touch XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt
mkdir ISOLATE
mkdir JUKEBOX
find . -name '*.ISOLATE.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./ISOLATE
find . -name '*.JUKEBOX.*.txt' -maxdepth 1 -print0 | xargs -0 -IFILE mv FILE ./JUKEBOX
find .
Which outputs:
$ bash example.sh
.
./example.sh
./ISOLATE
./ISOLATE/XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.ISOLATE.quantifier.txt
./ISOLATE/XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.ISOLATE.quantifier.txt
./JUKEBOX
./JUKEBOX/XXX-12-2345-67A-89T-1011-12.ab20.RenderBase20.JUKEBOX.quantifier.txt
./JUKEBOX/XXX-24-2345-67A-89T-2022-24.ab10.RenderBase20.JUKEBOX.quantifier.txt

Related

"find | xargs | ls" not running ls on filenames from find

So I have a directory with files and sub-directories in it. I want to get all the files recursively and then list them in long format, sorted by the modified date. Here's what I came up with.
find . -type f | xargs -d "\n" | ls -lt
However this only lists the files in the current directory and not the sub-directories. I don't understand why, given that the following prints out all the files.
find . -type f | xargs -d "\n" | cat
Any help appreciated.
xargs can only start ls if it's passed ls as an argument. When you pipe from xargs into ls, only one copy of ls is started -- by the parent shell -- and it isn't given any of the filenames from find | xargs as arguments -- instead they're on its stdin, but ls never reads its stdin, so it doesn't even know that they're there.
Thus, you need to remove the | character:
# Does what you specified in the common case, but buggy; don't use this
# (filenames can contain newlines!)
# ...also, xargs -d is GNU-only
find . -type f | xargs -d '\n' ls -lt
...or, better:
# uses NUL separators, which cannot exist inside filenames
# also, while a non-POSIX extension, this is supported in both GNU and BSD xargs
find . -type f -print0 | xargs -0 ls -lt
...or, even better than that:
# no need for xargs at all here; find -exec can do the same thing
# -exec ... {} + is POSIX-mandated functionality since 2008
find . -type f -exec ls -lt {} +
Much of the content in this answer is also covered in the Actions, Complex Actions, and Actions in Bulk sections of Using Find, which is well worth reading.

renaming series of files using xargs

I would like to rename several files picked by find in some directory, then use xargs and mv to rename the files, with parameter expansion. However, it did not work...
example:
mkdir test
touch abc.txt
touch def.txt
find . -type f -print0 | \
xargs -I {} -n 1 -0 mv {} "${{}/.txt/.tx}"
Result:
bad substitution
[1] 134 broken pipe find . -type f -print0
Working Solution:
for i in ./*.txt ; do mv "$i" "${i/.txt/.tx}" ; done
Although I finally got a way to fix the problem, I still want to know why the first find + xargs way doesn't work, since I don't think the second way is very general for similar tasks.
Thanks!
Remember that shell variable substitution happens before your command runs. So when you run:
find . -type f -print0 | \
xargs -I {} -n 1 -0 mv {} "${{}/.txt/.tx}"
The shell tries to expan that ${...} construct before xargs even
runs...and since that contents of that expression aren't a valid shell variable reference, you get an error. A better solution would be to use the rename command:
find . -type f -print0 | \
xargs -I {} -0 rename .txt .tx {}
And since rename can operate on multiple files, you can simplify
that to:
find . -type f -print0 | \
xargs -0 rename .txt .tx

how to grep large number of files?

I am trying to grep 40k files in the current directory and i am getting this error.
for i in $(cat A01/genes.txt); do grep $i *.kaks; done > A01/A01.result.txt
-bash: /usr/bin/grep: Argument list too long
How do one normally grep thousands of files?
Thanks
Upendra
This makes David sad...
Everyone so far is wrong (except for anubhava).
Shell scripting is not like any other programming language because much of the interpretation of lines comes from the power of the shell interpolating them before the command is actually executed.
Let's take something simple:
$ set -x
$ ls
+ ls
bar.txt foo.txt fubar.log
$ echo The text files are *.txt
echo The text files are *.txt
> echo The text files are bar.txt foo.txt
The text files are bar.txt foo.txt
$ set +x
$
The set -x allows you to see how the shell actually interpolates the glob and then passes that back to the command as input. The > points to the line that is actually being executed by the command.
You can see that the echo command isn't interpreting the *. Instead, the shell grabs the * and replaces it with the names of the matching files. Then and only then does the echo command actually executes the command.
When you have 40K plus files, and you do grep *, you're expanding that * to the names of those 40,000 plus files before grep even has a chance to execute, and that's where the error message /usr/bin/grep: Argument list too long is coming from.
Fortunately, Unix has a way around this dilemma:
$ find . -name "*.kaks" -type f -maxdepth 1 | xargs grep -f A01/genes.txt
The find . -name "*.kaks" -type f -maxdepth 1 will find all of your *.kaks files, and the -depth 1 will only include files in the current directory. The -type f makes sure you only pick up files and not a directory.
The find command pipes the names of the files into xargs and xargs will append the names of the file to the grep -f A01/genes.txtcommand. However, xargs has a trick up it sleeve. It knows how long the command line buffer is, and will execute the grep when the command line buffer is full, then pass in another series of file to the grep. This way, grep gets executed maybe three or ten times (depending upon the size of the command line buffer), and all of our files are used.
Unfortunately, xargs uses whitespace as a separator for the file names. If your files contain spaces or tabs, you'll have trouble with xargs. Fortunately, there's another fix:
$ find . -name "*.kaks" -type f -maxdepth 1 -print0 | xargs -0 grep -f A01/genes.txt
The -print0 will cause find to print out the names of the files not separated by newlines, but by the NUL character. The -0 parameter for xargs tells xargs that the file separator isn't whitespace, but the NUL character. Thus, fixes the issue.
You could also do this too:
$ find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the grep for each and every file found instead of what xargs does and only runs grep for all the files it can stuff on the command line. The advantage of this is that it avoids shell interference entirely. However, it may or may not be less efficient.
What would be interesting is to experiment and see which one is more efficient. You can use time to see:
$ time find . -name "*.kaks" -type f -maxdepth 1 -exec grep -f A01/genes.txt {} \;
This will execute the command and then tell you how long it took. Try it with the -exec and with xargs and see which is faster. Let us know what you find.
You can combine find with grep like this:
find . -maxdepth 1 -name '*.kaks' -exec grep -H -f A01/genes.txt '{}' \; > A01/A01.result.txt
you can use recursive feature of grep:
for i in $(cat A01/genes.txt); do
grep -r $i .
done > A01/A01.result.txt
though if you want to select only kaks files:
for i in $(cat A01/genes.txt); do
find . -iregex '.*\.kaks$' -exec grep $i \;
done > A01/A01.result.txt
Put another for loop inside your outer one:
for f in *.kaks; do
grep -H $i "$f"
done
By the way, are you interested in finding EVERY occurrence in each file, or merely if the search string exists in there one or more times? If it is "good enough" to know the string occurs in there one or more times you can specify "-n 1" to grep and it will not bother reading/searching the rest of the file after finding the first match, which could potentially save lots of time.
The following solution has worked for me:
Problem:
grep -r "example\.com" *
-bash: /bin/grep: Argument list too long
Solution:
grep -r "example\.com" .
["In newer versions of grep you can omit the “.“, as the current directory is implied."]
Source:
Reinlick, J. https://www.saotn.org/bash-grep-through-large-number-files-argument-list-too-long/

More elegant use of find for passing files grouped by directory?

This script has taken me too long (!!) to compile, but I finally have a reasonably nice script which does what I want:
find "$#" -type d -print0 | while IFS= read -r -d $'\0' dir; do
find "$dir" -iname '*.flac' -maxdepth 1 ! -exec bash -c '
metaflac --list --block-type=VORBIS_COMMENT "$0" 2>/dev/null | grep -i "REPLAYGAIN_ALBUM_PEAK" &>/dev/null
exit $?
' {} ';' -exec bash -c '
echo Adding ReplayGain tags to "$0"/\*.flac...
metaflac --add-replay-gain "${#:1}"
' "$dir" {} '+'
done
The purpose is to search the file tree for directories containing FLAC files, test whether any are missing the REPLAYGAIN_ALBUM_PEAK tag, and scan all the files in that directory for ReplayGain if they are missing.
The big stumbling block is that all the FLAC files for a given album must be passed to metaflac as one command, otherwise metaflac doesn't know they're all one album. As you can see, I've achieved this using find ... -exec ... +.
What I'm wondering is if there's a more elegant way to do this. In particular, how can I skip the while loop? Surely this should be unnecessary, because find is already iterating over the directories?
You can probably use xargs to achieve it.
For example, if you are looking for text foo in all your files you'll have something like
find . type f | xargs grep foo
xargs passes each result from left-end expression (find) to the right-end invokated command.
Then, if no command exists to achieve what you want to do, you can always create a function, and pass if to xargs
I can't comment on the flac commands themselves, but as for the rest:
find . -name '*.flac' \
! -exec bash -c 'metaflac --list --block-type=VORBIS_COMMENT "$1" | grep -qi "REPLAYGAIN_ALBUM_PEAK"' -- {} \; \
-execdir bash -c 'metaflac --add-replay-gain *.flac' \;
You just find the relevant files, and then treat the directory it's in.

How can I list all unique file names without their extensions in bash?

I have a task where I need to move a bunch of files from one directory to another. I need move all files with the same file name (i.e. blah.pdf, blah.txt, blah.html, etc...) at the same time, and I can move a set of these every four minutes. I had a short bash script to just move a single file at a time at these intervals, but the new name requirement is throwing me off.
My old script is:
find ./ -maxdepth 1 -type f | while read line; do mv "$line" ~/target_dir/; echo "$line"; sleep 240; done
For the new script, I basically just need to replace find ./ -maxdepth 1 -type f
with a list of unique file names without their extensions. I can then just replace do mv "$line" ~/target_dir/; with do mv "$line*" ~/target_dir/;.
So, with all of that said. What's a good way to get a unique list of files without their file names with bash script? I was thinking about using a regex to grab file names and then throwing them in a hash to get uniqueness, but I'm hoping there's an easier/better/quicker way. Ideas?
A weird-named files tolerant one-liner could be:
find . -maxdepth 1 -type f -and -iname 'blah*' -print0 | xargs -0 -I {} mv {} ~/target/dir
If the files can start with multiple prefixes, you can use logic operators in find. For example, to move blah.* and foo.*, use:
find . -maxdepth 1 -type f -and \( -iname 'blah.*' -or -iname 'foo.*' \) -print0 | xargs -0 -I {} mv {} ~/target/dir
EDIT
Updated after comment.
Here's how I'd do it:
find ./ -type f -printf '%f\n' | sed 's/\..*//' | sort | uniq | ( while read filename ; do find . -type f -iname "$filename"'*' -exec mv {} /dest/dir \; ; sleep 240; done )
Perhaps it needs some explaination:
find ./ -type f -printf '%f\n': find all files and print just their name, followed by a newline. If you don't want to look in subdirectories, this can be substituted by a simple ls;
sed 's/\..*//': strip the file extension by removing everything after the first dot. Both foo.tar ad foo.tar.gz are transformed into foo;
sort | unique: sort the filenames just found and remove duplicates;
(: open a subshell:
while read filename: read a line and put it into the $filename variable;
find . -type f -iname "$filename"'*' -exec mv {} /dest/dir \;: find in the current directory (find .) all the files (-type f) whose name starts with the value in filename (-iname "$filename"'*', this works also for files containing whitespaces in their name) and execute the mv command on each one (-exec mv {} /dest/dir \;)
sleep 240: sleep
): end of subshell.
Add -maxdepth 1 as argument to find as you see fit for your requirements.
Nevermind, I'm dumb. there's a uniq command. Duh. New working script is: find ./ -maxdepth 1 -type f | sed -e 's/.[a-zA-Z]*$//' | uniq | while read line; do mv "$line*" ~/target_dir/; echo "$line"; sleep 240; done
EDIT: Forgot close tag on code and a backslash.

Resources