moving files that contain part of a line from a file - bash

I have a file that on each line is a string of some numbers such as
1234
2345
...
I need to move files that contain that number in their name followed by other stuff to a directory examples being
1234_hello_other_stuff_2334.pdf
2345_more_stuff_3343.pdf
I tried using xargs to do this, but my bash scripting isn't the best. Can anyone share the proper command to accomplish what I want to do?

for i in `cat numbers.txt`; do
mv ${i}_* examples
done
or (look ma, no cat!)
while read i; do
mv ${i}_* examples
done < numbers.txt

You could use a for loop, but that could make for a really long command line. If you have 20000 lines in numbers.txt, you might hit shell limits. Instead, you could use a pipe:
cat numbers.txt | while read number; do
mv ${number}_*.pdf /path/to/examples/
done
or:
sed 's/.*/mv -v &_*.pdf/' numbers.txt | sh
You can leave off the | sh for testing. If there are other lines in the file and you only want to match lines with 4 digits, you could restrict your match:
sed -r '/^[0-9]{4}$/s//mv -v &_*.pdf/' numbers.txt | sh

cat numbers.txt | xargs -n1 -I % find . -name '%*.pdf' -exec mv {} /path/to \;
% is your number (-n1 means one at a time), and '%*.pdf' to find means it'll match all files whose names begin with that number; then it just copies to /path/to ({} is the actual file name).

Related

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345
I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.
Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Handle files with space in filename and output file names

I need to write a Bash script that achieve the following goals:
1) move the newest n pdf files from folder 1 to folder 2;
2) correctly handles files that could have spaces in file names;
3) output each file name in a specific position in a text file. (In my actual usage, I will use sed to put the file names in a specific position of an existing file.)
I tried to make an array of filenames and then move them and do text output in a loop. However, the following array cannot handle files with spaces in filename:
pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 ls -1 -t | head -n$NUM))
Suppose a file has name "Filename with Space". What I get from the above array will have "with" and "Space" in separate array entries.
I am not sure how to avoid these words in the same filename being treated separately.
Can someone help me out?
Thanks!
-------------Update------------
Sorry for being vague on the third point as I thought I might be able to figure that out after achieving the first and second goals.
Basically, it is a text file that have a line start with "%comment" near the end and I will need to insert the filenames before that line in the format "file=PATH".
The PATH is the folder 2 that I have my pdfs moved to.
You can achieve this using mapfile in conjunction with gnu versions of find | sort | cut | head that have options to operate on NUL terminated filenames:
mapfile -d '' -t pdfs < <(find "$DOWNLOADS/*.pdf" -name 'file*' -printf '%T#:%p\0' |
sort -z -t : -rnk1 | cut -z -d : -f2- | head -z -n $NUM)
Commands used are:
mapfile -d '': To read array with NUL as delimiter
find: outputs each file's modification stamp in EPOCH + ":" + filename + NUL byte
sort: sorts reverse numerically on 1st field
cut: removes 1st field from output
head: outputs only first $NUM filenames
find downloads -name "*.pdf" -printf "%T# %p\0" |
sort -z -t' ' -k1 -n |
cut -z -d' ' -f2- |
tail -z -n 3
find all *.pdf files in downloads
for each file print it's modifition date %T with the format specifier # that means seconds since epoch with fractional part, then print space, filename and separate with \0
Sort the null separated stream using space as field separator using only first field using numerical sort
Remove the first field from the stream, ie. creation date, leaving only filenames.
Get the count of the newest files, in this example 3 newest files, by using tail. We could also do reverse sort and use head, no difference.
Don't use ls in scripts. ls is for nice formatted output. You could do xargs -0 stat --printf "%Y %n\0" which would basically move your script forward, as ls isn't meant to be used for scripts. Just that I couldn't make stat output fractional part of creation date.
As for the second part, we need to save the null delimetered list to a file
find downloads ........ >"$tmp"
and then:
str='%comment'
{
grep -B$((2**32)) -x "$str" "$out" | grep -v "$str"
# I don't know what you expect to do with newlines in filenames, but I guess you don't have those
cat "$tmp" | sed -z 's/^/file=/' | sed 's/\x0/\n/g'
grep -A$((2**32)) -x "$str" "$out"
} | sponge "$out"
where output is the output file name
assuming output file name is stored in variable "$out"
filter all lines before the %comment and remove the line %comment itself from the file
output each filename with file= on the beginning. I also substituted zeros for newlines.
the filter all lines after %comment including %comment itself
write the output for outfile. Remember to use a temporary file.
Don't use pdf=$(...) on null separated inputs. You can use mapfile to store that to an array, as other answers provided.
Then to move the files, do smth like
<"$tmp" xargs -0 -i mv {} "$outdir"
or faster, with a single move:
{ cat <"$tmp"; printf "%s\0" "$outdir"; } | xargs -0 mv
or alternatively:
<"$tmp" xargs -0 sh -c 'outdir="$1"; shift; mv "$#" "$outdir"' -- "$outdir"
Live example at turorialspoint.
I suppose following code will be close to what you want:
IFS=$'\n' pdfs=($(find -name "$DOWNLOADS/*.pdf" -print0 | xargs -0 -I ls -lt "{}" | tail -n +1 | head -n$NUM))
Then you can access the output through ${pdfs[0]}, ${pdfs[1]}, ...
Explanations
IFS=$'\n' makes the following line to be split only with "\n".
-I option for xargs tells xargs to substitute {} with filenames so it can be quoted as "{}".
tail -n +1 is a trick to suppress an error message saying "xargs: 'ls' terminated by signal 13".
Hope this helps.
Bash v4 has an option globstar, after enabling this option, we can use ** to match zero or more subdirectories.
mapfile is a built-in command, which is used for reading lines into an indexed array variable. -t option removes a trailing newline.
shopt -s globstar
mapfile -t pdffiles < <(ls -t1 **/*.pdf | head -n"$NUM")
typeset -p pdffiles
for f in "${pdffiles[#]}"; do
echo "==="
mv "${f}" /dest/path
sed "/^%comment/i${f}=/dest/path" a-text-file.txt
done

Move files based of a comparison with a file

I have 1000 files with following names:
something-345-something.txt
something-5468-something.txt
something-100-something.txt
something-6200-something.txt
and a lot more...
And I have one txt file, with only numbers in it. f.e:
1000
500
5468
6200
699
usw...
Now I would like to move all files, which have a number in their filenames which is in my txt file.
So in my example above the following files should be moved only:
something-5468-something.txt
something-6200-something.txt
Is there an easy way to achieve this?
What about on the fly moving files by doing this:
for i in `cat you-file.txt`; do
find . -iname "*-$i-*" -exec mv '{}' /target/dir \;
; done
For every line in your text file, the find command will try to find only does matching the pattern *-$i-* (something-6200-something.txt) and move it to your target dir.
Naive implementation: for file in $(ls); do grep $(echo -n $file | sed -nr 's/[^-]*-([0-9]+).*/\1/p') my-one-txt.txt && mv $file /tmp/somewhere; done
In English: For every file in output of ls: parse number part of filename with sed and grep for it in your text file. grep returns a non-zero exit code if nothing is found, so mv is in evaluated in that case.
Script file named move (executable):
#!/bin/bash
TARGETDIR="$1"
FILES=`find . -type f` # build list of files
while read n # read numbers from standard input
do # n contains a number => filter list of files by that number:
echo "$FILES" | grep "\-$n-" | while read f
do # move file that passed the filter because its name matches n:
mv "$f" "$TARGETDIR"
done
done
Use it like this:
cd directory-with-files
./move target-directory < number-list.txt
Here's a crazy bit of bash hackery
shopt -s extglob nullglob
mv -t /target/dir *-#($(paste -sd "|" numbers.txt))-*
That uses paste to join all the lines in your numbers file with pipe characters, then uses bash extended pattern matching to find the files matching any one of the numbers.
I assume mv from GNU coreutils for the -t option.

Count how many files contain a string in the last line

I want to count how many files in the current directory have the string "A" in the last line.
First solution: tail -n 1 * | grep \"A\"| wc -l
This works fine, but when there are more files it does bash: /usr/bin/tail: Argument list too long.
Is there a way to get around it?
Bonus points if I can also optionally get which files contains it.
EDIT: my folder contains 343729 files
EDIT2: #tso usefully pointed to the article I'm getting "Argument list too long". How can I process a large list in chunks? in his comment.
RESULTS:
#tso solution for f in $(find . -type f); do tail -1 $f|grep \"A\"; done|wc -l takes about 20 minutes
#lars solution grep -P "\"A\"*\Z" -r . | wc -l takes about 20 minutes
#mklement0 solution printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l takes about 10 minutes
#james solution (in the comments) for i in * ; do awk 'END{if(/a/)print FILENAME}' "$i" ; done takes about 25 minutes
#codeforester find . -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>' takes >20 minutes.
#mklement0 and #codeforester solutiona also has the advantage that if I want to change the grep pattern, the second time I run it it takes zero time, I guess it's due to some sort of caching.
I've accepted #mklement0 answer at is seems to be the fastest, but I still like to mention #tso and #lars for their contributions and, based on my personal knowledge, an easier and adaptable solution.
xargs is able to overcome the max. command-line length limitation by efficiently batching the invocations into as few calls as possible.
The shell's builtins, such as printf, are not subject to the max. command-line length.
Knowing this, you can use the following approach (which assumes that your xargs implementation supports the -0 option for NUL-terminated input, and that your tail implementation supports multiple file operands and the -q option for suppressing filename headers.
Both assumptions hold for the GNU (Linux) and BSD/macOS implementations of these utilities):
printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l
How about using find, tail, and grep this way? This will be more efficient than having to loop through each file. Also, tail -1 will just read the last line of the files and hence very I/O efficient.
find . -maxdepth 1 -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>'
find will invoke tail -1 in batches, passing ARG_MAX file names at a time
tail will print the last line of each of the file, prefixing it with the pattern "==> file_name <=="
grep -EB 1 '^[^=]+A' will look for pattern A and fetch the previous line as well (it will exclude the file_name lines while looking for the match)
grep -c '^==>' will count the number of files with matching pattern
If you don't need to know the name of the files having a match, but just get the count of files, you could do this:
find . -maxdepth 1 -type f -exec tail -q -n 1 -- {} + | grep -c 'A'
Using GNU awk:
$ cat foo
b
a
$ cat bar
b
b
$ awk 'ENDFILE{if(/a/){c++; print FILENAME}}END{print c}' *
foo
1
try with find:
for f in $(find . -type f); do tail -1 $f|grep PATERN; done|wc -l
If grep supports the -P option, this might work:
grep -P "A\Z" -r . | wc -l
See man pcrepattern. In short:
\Z matches at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject
Try \Z and \z.
To see which files match, you would use only the grep part without the pipe to wc.
This will return the number of files:
grep -rlP "A\z" | wc -l
If you want to get the names then simply:
grep -rlP "A\Z"

In bash, how to batch show the text of certain line in files?

I want to batch show the text of certain line of files in certain directory, usually this can be done with the following commands:
for file in `find ./ -name "results.txt"`;
do
sed -n '12p' < ${file};
done
In the 12th line of each file names "results.txt", there is the text I want to output.
But, I wonder that if we can use the pipeline command to do this operation. I have tried the following command:
find ./ -name "results.txt" | xargs sed -n '12p'
or
find ./ -name "results.txt" | xargs sed -n '12p' < {} \;
But neither works fine.
Could you give some advice or recommend some references, please?
All are welcome, Thanks in advice!
This should do it
find ./ -name results.txt -exec sed '12!d' {} ';'
#Steven Penny's answer is the most elegant and best-performing solution, but to shed some light on why your solution didn't work:
find ./ -name "results.txt" | xargs sed -n '12p'
causes all filenames(1) to be passed at once(2) to sed. Since sed counts lines cumulatively, across input files, only 1 line will be printed for all input files, namely line 12 from the first input file.
Keeping in mind that find's -exec action is the best solution, if you still wanted to solve this problem with xargs, you'd have to use xarg's -I option as follows, so as to ensure that sed is called once per input line (filename) (% is a self-chosen placeholder):
find ./ -name "results.txt" | xargs -I % sed -n '12q;d' %
Footnotes:
(1) with word splitting applied, which would break with paths with embedded spaces, but that's a separate issue.
(2) assuming they don't make the entire command exceed the max. length of a command line; either way, multiple filenames are passed at once.
As an aside: parsing command output with for as in your first snippet is NEVER a good idea - see http://mywiki.wooledge.org/ParsingLs and http://mywiki.wooledge.org/BashFAQ/001
Your use of xargs results in running sed with multiple file arguments. But as you can see, sed doesn't reset the record number to 1 when it starts reading a new file. For example, try running the following command against files with more than 12 lines each.
sed -n '12p' x.txt y.txt
If you want to use xargs, you might consider using awk:
find . -name 'results.txt' | xargs awk 'FNR==12'
P.S: I personally like using the for loop.

Resources