Concatenating Thousands of Text Files Across Hundreds of Directories (while keeping some structure) - bash

I have a set of plain text files spread across 400+ directories, with tons and tons of subdirectories. There are about 300,000 text files. For example:
directory1/subdirectory1
directory1/subdirectory2
directory1/subdirectory1/subdirectory3
All of those text files within directory1 should end up in one big massive text file named directory1.txt. Then repeat with directory2.
What would be the quickest and simplest way to go into each of these four hundred directories and combine all of the text files in such a manner?
I know I could go to each of the four hundred directories and use the commands such as find to bring all the text files together into one directory, and then use cat *.txt >> all.txt, but surely there must be an easy way to automate this process?

To concatenate all txt-files in a subtree:
#!/bin/sh
# Usage: cat-txt dirname
find "$1" -name \*.txt -print0 | xargs -0 cat >> "$1.txt"
Call cat-txt on all immediate subdirectories:
$ find -mindepth 1 -maxdepth 1 -type d -exec cat-txt '{}' \;

I've tested this on my system, and it works flawlessly. You may want to tweak it to your parameters, but in one line I did everything you needed.
for I in `ls -dR */`; do cat $I/* > $I.txt; done
You may to change the ls command to only search text files, otherwise you'll get binary data as well. Enjoy.
for I in `ls -dR */*.txt`; do cat $I/* > $I.txt; done

there must be an easy way to automate this process?
Why are you looking for one? Is this a one time activity or you're gonna repeat it every now and then?
I would just say go with something as simple as:
for ff in `find . -maxdepth 1 -type d`
do
find "$ff" -type *.txt -exec cat {} \; >> "$ff.txt"
done

Related

how to count files only in specific subdirectories located deeply in the hierarchy?

I need to count all sessions files sess_* located in TMP directories (Debian machine) and know path to each TMP with the count for each one.
All parent direcotries are in /somepath/to/clientsDirs.
The directory structure for one client is
../ClientDirX/webDirYX/someDirZx
../ClientDirX/webDirYX/someDirZy
../ClientDirX/webDirYX/tmp
../ClientDirX/webDirYX/someDirZz
../ClientDirX/webDirYX/...
../ClientDirX/webDirYX/someDirZN
../ClientDirX/webDirYY/someDirZx
../ClientDirX/webDirYY/someDirZy
../ClientDirX/webDirYY/tmp
../ClientDirX/webDirYY/someDirZz
../ClientDirX/webDirYY/...
../ClientDirX/webDirYY/someDirZN
all someDirZ and tmp directories have a various count of subdirectories. Sessions files are in tmp dir only and not in tmp subdirectories. In one tmp dir could be more than millions sess_* files, so the solution needs to be very time effective.
X, YY, etc. in directory names are always numbers, but not in a continuous line, e.g.:
ClientDir1/webDir3/*
ClientDir4/webDir31/*
ClientDir4/webDir35/*
ClientDir18/webDir2/*
Could you please help me count all sess_* files in each tmp dir by command line or bash script?
EDIT: change of answer after changing the sense of a question
The whole task is divided into 3 parts.
I changed the directory names to simpler.
1.Build a list of tmp directories to search (first script)
#!/bin/bash
find /var/log/clients/sd*/wd*/ -maxdepth 1 -type d -name "tmp" >list
explanation
-type d only search for directories
-maxdpth 1 specifies the maximum search depth
-name specifies the name of the items sought
>list redirects the result to the list file
* it is so-called shell globbing in this case means
any string of characters
We perform this task for two reasons in a separate file. First of all, the execution time will be significant. Secondly, the list of customers does not change very often and it makes no sense to check it every time.
2.iterating loop over list items in bash (see finaly script)
3.search for sess_* files in the tmp directory without including subdirectories
find /path/to/tmp -maxdepth 1 -type f -name "sess_*" -exec printf "1" \; |wc -c
explanation
-type f only searches files
-exec executes any system command in this case, printf
\; necessary part ending the -exec command, must contain a space!
-exec printf is used because not every version of find has a printf command built in, so this will also work on busyboxes or outside of the GNU world
If your find has printf, use it instead of -exec (-printf "1")
For more, see command man find
Finally the second script:
#!/bin/bash
for x in `cat list`
do
printf "%s \t" $x
find $x -maxdepth 1 -type f -name "sess_*" -exec printf "1" \; | wc -c
done
Example result:
/var/log/clients/sd1/wd1/tmp 3
/var/log/clients/sd2/wd1/tmp 62
EDIT:
Note in some versions find GNU (eg 4.7.0-git) when the order -maxdepth 1 changes the -type f program throws worning or does not work. It seems that these versions do not use the getopt mechanism for some reason. Other versions of find do not seem to have this problem.

Renaming Subdirectories and Files

I have a script using a for loop that would rename folders and files. The script would take the list of files and folders and rename them conditionally. I would invoke the file using the command:
find test/* -exec ./replace.sh {} \;
My replace.sh script would contain something similar to:
for i in $#
mv $OLDFILE $NEWFILE
done
$OLDFILE and $NEWFILE has been set previously and I don't believe any problems will arise from them.
My problem arises when I hit upon subdirectories. Originally, I would have folders like:
folder_1
-file1
-file2
When my script changes folder_1 into folderX1, the next argument, folder_1/file1 woudl be invalid as the changed path would be folderX1/file1. I figured I could create a stack with a list of folders that is being changed and pop them out later to rename the files but this seems hard on bash. Is there a better method that I am missing?
P.S I could run the program several times to go through all the subdirectories but this doesn't seem efficient.
You can add -depth to the find command. This will process the directory's files before the directory itself. See man find for details.
Your find usage is problematic. The first option is the start location for the search, so you don't want to use a glob there. If you want only the files in test/ and not any of its subdirectories, use the -depth option, as Olaf suggested.
You don't really need to use a separate script to handle this rename. It can be done within the find command line, if you don't mind a little mess.
To handle just the top-level of files, you could do this:
$ touch foo.txt bar.txt baz.ext
$ find test -depth 1 -type f -name \*.txt -exec bash -c 'f="{}"; mv -v "{}" "${f/.txt/.csv}"' \;
./foo.txt -> ./foo.csv
./bar.txt -> ./bar.csv
$
But your concern is valid -- find will build a list of matches, and if your -exec changes the list out from under find, some renames will fail.
I suspect your quickest solution is to do this in TWO stages (not several): one for files, followed by one for directories. (Or change the order, I don't think it should matter.)
$ mkdir foo_1; touch red_2 foo_1/blue_3
$ find . -type f -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1/blue_3 -> ./foo_1/blueX3
./red_2 -> ./redX2
$ find . -type d -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1 -> ./fooX1
Bash parameter expansion will get you a long way.
Another option, depending on your implementation of find, is the -d option:
-d Cause find to perform a depth-first traversal, i.e., directories
are visited in post-order and all entries in a directory will be
acted on before the directory itself. By default, find visits
directories in pre-order, i.e., before their contents. Note, the
default is not a breadth-first traversal.
So:
$ mkdir -p foo_1/bar_2; touch red_3 foo_1/blue_4 foo_1/bar_2/green_5
$ find . -d -name \*_\* -exec bash -c 'f="{}"; mv -v "{}" "${f%_?}X${f##*_}"' \;
./foo_1/bar_2/green_5 -> ./foo_1/bar_2/greenX5
./foo_1/bar_2 -> ./foo_1/barX2
./foo_1/blue_4 -> ./foo_1/blueX4
./foo_1 -> ./fooX1
./red_3 -> ./redX3
$

Bash: How to control iteration flow/loops?

For going over some recovered data, I am working on a script that recursively goes through folders & files and finally runs file on them, to check if they are likely fully recovered from a certain backup or not. (recovered files play, and are identified as mp3 or other audio, non-working files as ASCII-Text)
For now I would just be satisfied with having it go over my test folder structure, print all folders & corresponding files. (printing them mainly for testing, but also because I would like to log where the script currently is and how far along it is in the end, to verify what has been processed)
I tried using 2 for loops, one for the folders, then one for the files. (so that ideally it would take 1 folder, then list the files in there (or potentially delve into subfolders) and below each folder only give the files in that subfolders, then moving on to the next.
Such as:
Folder1
- File 1
- File 2
-- Subfolder
-- File3
-- File4
Folder2
- File5
However this doesn't seem to work in the ways (such with for loops) that are normally proposed. I got as far as using "find . -type d" for the directories and "find . -type f" or "find * -type f" (so that it doesn't go in to subdirectories) However, when just printing the paths/files in order to check if it ran as I wanted it to, it became obvious that that didn't work.
It always seemed to first print all the directories (first loop) and then all the files (second loop). For keeping track of what it is doing and for making it easier to know what was checked/recovered I would like to do this in a more orderly fashion as explained above.
So is it that I just did something wrong, or is this maybe a general limitation of the for loop in bash?
Another problem that could be related: Although assigning the output of find to an array seemed to work, it wasn't accessible as an array ...
Example for loop:
for folder in '$(find . -type d)' ; do
echo $folder
let foldercounter++
done
Arrays:
folders=("$(find . -type d)")
#As far as I know this should assign the output as an array
#However, it is not really assigned properly somehow as
echo "$folders[1]"
# does not work (quotes necessary for spaces)
A find ... -exec ... solution #H.-Dirk Schmitt was referring to might look something like:
find . -type f -exec sh -c '
case $(file "$1") in
*Audio file*)
echo "$1 is an audio file"
;;
*ASCII text*)
echo "$1 is an ascii text file"
;;
esac
' _ {} ';'
For going over some recovered data, I am working on a script that recursively goes through folders & files and finally runs file on them, to check if they are likely fully recovered from a certain backup or not. (recovered files play, and are identified as mp3 or other audio, non-working files as ASCII-Text)
If you want to run file on every file and directory in the current directory, including its subdirectories and so on, you don't need to use a Bash for-loop, because you can just tell find to run file:
find -exec file '{}' ';'
(The -exec ... ';' option runs the command ... on every matched file or directory, replacing the argument {} with the path to the file.)
If you only want to run file on regular files (not directories), you can specify -type f:
find -type f -exec file '{}' ';'
If you (say) want to just print the names of directories, but run the above on regular files, you can use the -or operator to connect one directive that uses -type d and one that uses -type f:
find -type d -print -or -type f -exec file '{}' ';'
Edited to add: If desired, the effect of the above commands can be achieved in pure Bash (plus the file command, of course), by writing a recursive shell function. For example:
function foo () {
local file
for file in "$1"/* ; do
if [[ -d "$file" ]] ; then
echo "$file"
foo "$file"
else
file "$file"
fi
done
}
foo .
This differs from the find command in that it will sort the files more consistently, and perhaps in gritty details such as handling of dot-files and symbolic links, but is broadly the same, so may be used as a starting-point for further adjustments.

Moving large number of files [duplicate]

This question already has answers here:
Argument list too long error for rm, cp, mv commands
(31 answers)
Closed 3 years ago.
If I run the command mv folder2/*.* folder, I get "argument list too long" error.
I find some example of ls and rm, dealing with this error, using find folder2 -name "*.*". But I have trouble applying them to mv.
find folder2 -name '*.*' -exec mv {} folder \;
-exec runs any command, {} inserts the filename found, \; marks the end of the exec command.
The other find answers work, but are horribly slow for a large number of files, since they execute one command for each file. A much more efficient approach is either to use + at the end of find, or use xargs:
# Using find ... -exec +
find folder2 -name '*.*' -exec mv --target-directory=folder '{}' +
# Using xargs
find folder2 -name '*.*' | xargs mv --target-directory=folder
find folder2 -name '*.*' -exec mv \{\} /dest/directory/ \;
First, thanks to Karl's answer. I have only minor correction to this.
My scenario:
Millions of folders inside /source/directory, containing subfolders and files inside. Goal is to copy it keeping the same directory structure.
To do that I use such command:
find /source/directory -mindepth 1 -maxdepth 1 -name '*' -exec mv {} /target/directory \;
Here:
-mindepth 1 : makes sure you don't move root folder
-maxdepth 1 : makes sure you search only for first level children. So all it's content is going to be moved too, but you don't need to search for it.
Commands suggested in answers above made result directory structure flat - and it was not what I looked for, so decided to share my approach.
This one-liner command should work for you.
Yes, it is quite slow, but works even with millions of files.
for i in /folder1/*; do mv "$i" /folder2; done
It will move all the files from folder /folder1 to /folder2.
find doesn't work with really long lists of files, it will give you the same error "Argument list too long". Using a combination of ls, grep and xargs worked for me:
$ ls|grep RadF|xargs mv -t ../fd/
It did the trick moving about 50,000 files where mv and find alone failed.

How to consolidate selected files from multiple sub-directories into one directory

I know this is probably elementary to unix people, but I haven't found a straightforward answer online.
I have a directory with sub-directories. Some of these sub-dirs have .mov files in them. I want to consolidate all the movs to a single directory. I don't need to worry about file naming conflicts because the files are from a digital camera and it names the files incrementally, but divides them into daily folders.
What is the Unix-fu for grabbing all these files and copying (or even better, moving them) to a directory in my home folder?
Thanks.
How about this?
find "$SOURCE_DIRECTORY" -type f -name '*.mov' -exec mv '{}' "$TARGET_DIRECTORY" ';'
If the source and target directories do not overlap this should work fine.
EDIT:
BTW, if you have mixed-case extensions (x.mov, y.Mov, Z.MOV) as is the case with many cameras, this would be better. It uses -iname which is case-insensitive when matching:
find "$SOURCE_DIRECTORY" -type f -iname '*.mov' -exec mv '{}' "$TARGET_DIRECTORY" ';'
Make sure to replace the $SOURCE_DIRECTORY and $TARGET_DIRECTORY variables with the actual directories and that they do not overlap (i.e. the target being somewhere under the source)
EDIT 2:
PS: I just noticed that khachik caught this one with his edit
mv `find . -name "*.mov" | xargs` OUTPUTDIR/
Update after thkala's comment:
find . -iname "*.mov" | while read line; do mv "$line" OUTPUTDIR/; done
If you need to cope with weird filenames (spaces, special characters), try this:
$ cd <source parent directory>
$ find -name '*.mov' -print0 | xargs -0 echo mv -v -t <target directory>
Remove the "echo" above to actually do the move, rather than print what would happen.
"mv -v" gives verbose output, "mv -t ..." specifies the target directory (possibly GNU-specific).
"-print0" and "-0" are extensions to cope with weird filenames. On non-GNU systems you might need to remove those options, which will result in newline-separated data. This will still work on filenames with spaces, but not filenames with newlines (yes, it's possible).

Resources