Concatenating multiple text files into a single file in Bash - bash

What is the quickest and most pragmatic way to combine all *.txt file in a directory into one large text file?
Currently I'm using windows with cygwin so I have access to BASH.
Windows shell command would be nice too but I doubt there is one.

This appends the output to all.txt
cat *.txt >> all.txt
This overwrites all.txt
cat *.txt > all.txt

Just remember, for all the solutions given so far, the shell decides the order in which the files are concatenated. For Bash, IIRC, that's alphabetical order. If the order is important, you should either name the files appropriately (01file.txt, 02file.txt, etc...) or specify each file in the order you want it concatenated.
$ cat file1 file2 file3 file4 file5 file6 > out.txt

The Windows shell command type can do this:
type *.txt > outputfile.txt
Type type command also writes file names to stderr, which are not captured by the > redirect operator (but will show up on the console).

You can use Windows shell copy to concatenate files.
C:\> copy *.txt outputfile
From the help:
To append files, specify a single file for destination, but multiple files for source (using wildcards or file1+file2+file3 format).

Be careful, because none of these methods work with a large number of files. Personally, I used this line:
for i in $(ls | grep ".txt");do cat $i >> output.txt;done
EDIT: As someone said in the comments, you can replace $(ls | grep ".txt") with $(ls *.txt)
EDIT: thanks to #gnourf_gnourf expertise, the use of glob is the correct way to iterate over files in a directory. Consequently, blasphemous expressions like $(ls | grep ".txt") must be replaced by *.txt (see the article here).
Good Solution
for i in *.txt;do cat $i >> output.txt;done

How about this approach?
find . -type f -name '*.txt' -exec cat {} + >> output.txt

the most pragmatic way with the shell is the cat command. other ways include,
awk '1' *.txt > all.txt
perl -ne 'print;' *.txt > all.txt

type [source folder]\*.[File extension] > [destination folder]\[file name].[File extension]
For Example:
type C:\*.txt > C:\1\all.txt
That will Take all the txt files in the C:\ Folder and save it in C:\1 Folder by the name of all.txt
Or
type [source folder]\* > [destination folder]\[file name].[File extension]
For Example:
type C:\* > C:\1\all.txt
That will take all the files that are present in the folder and put there Content in C:\1\all.txt

You can do like this:
cat [directory_path]/**/*.[h,m] > test.txt
if you use {} to include the extension of the files you want to find, there is a sequencing problem.

The most upvoted answers will fail if the file list is too long.
A more portable solution would be using fd
fd -e txt -d 1 -X awk 1 > combined.txt
-d 1 limits the search to the current directory. If you omit this option then it will recursively find all .txt files from the current directory.
-X (otherwise known as --exec-batch) executes a command (awk 1 in this case) for all the search results at once.
Note, fd is not a "standard" Unix program, so you will likely need to install it

When you run into a problem where it cats all.txt into all.txt,
You can try check all.txt is existing or not, if exists, remove
Like this:
[ -e $"all.txt" ] && rm $"all.txt"

all of that is nasty....
ls | grep *.txt | while read file; do cat $file >> ./output.txt; done;
easy stuff.

Related

Combine multiple files into one including the file name

I have been looking around trying to combine multiple text files into including the name of the file.
My current file content is:
1111,2222,3333,4444
What I'm after is:
File1,1111,2222,3333,4444
File1,1111,2222,3333,4445
File1,1111,2222,3333,4446
File1,1111,2222,3333,4447
File2,1111,2222,3333,114444
File2,1111,2222,3333,114445
File2,1111,2222,3333,114446
I found multiple example how to combine them all but nothing to combine them including the file name.
Could you please try following. Considering that your Input_file names extensions are .csv.
awk 'BEGIN{OFS=","} {print FILENAME,$0}' *.csv > output_file
After seeing OP's comments if file extensions are .txt then try:
awk 'BEGIN{OFS=","} {print FILENAME,$0}' *.txt > output_file
Assuming all your files have a .txt extension and contain only one line as in the example, you can use the following code:
for f in *.txt; do echo "$f,$(cat "$f")"; done > output.log
where output.log is the output file.
Well, it works:
printf "%s\n" *.txt |
xargs -n1 -d $'\n' bash -c 'xargs -n1 -d $'\''\n'\'' printf "%s,%s\n" "$1" <"$1"' --
First output a newline separated list of files.
Then for each file xargs execute sh
Inside sh execute xargs for each line of file
and it executes printf "%s,%s\n" <filename> for each line of input
Tested in repl.
Solved using grep "" *.txt -I > $filename.

Print list of files in a directory to a text file (but not the text file itself) from terminal

I would like to print all the filenames of every file in a directory to a .txt file.
Let's assume that I had a directory with 3 files:
file1.txt
file2.txt
file3.txt
and I tried using ls > output.txt.
The thing is that when I open output.txt I find this list:
file1.txt
file2.txt
file3.txt
output.txt
Is there a way to avoid printing the name of the file where I'm redirecting the output? Or better is there a command able to print all the filenames of files in a directory except one?
printf '%s\n' * > output.txt
Note that this assumes that there's no preexisting output.txt file -
if so, delete it first.
printf '%s\n' * uses globbing (filename expansion) to robustly print the names of all files and subdirectories located in the current directory, line by line.
Globbing happens before output.txt is created via output redirection > output.txt (which still happens before the command is executed, which explains your problem), so its name is not included in the output.
Globbing also avoids the use of ls, whose use in scripting is generally discouraged.
In general, it is not good to parse the output of ls, especially while writing production quality scripts that need to be in good standing for a long time. See this page to find out why: Don't parse ls output
In your example, output.txt is a part of the output in ls > output.txt because shell arranges the redirection (to output.txt) before running ls.
The simplest way to get the right behavior for your case would be:
ls file*txt > output.txt # as long as you are looking for files named that way
or, store the output in a hidden file (or in a normal file in some other directory) and then move it to the final place:
ls > .output.txt && mv .output.txt output.txt
A more generic solution would be using grep -v:
ls | grep -vFx output.txt > output.txt
Or, you can use an array:
files=( "$(ls)" )
printf '%s\n' "${files[#]}" > output.txt
ls has an ignore option and we can use find command also.
Using ls with ignore option
ls -I "output.txt" > output.txt
ls --ignore "output.txt" > output.txt
-I, --ignore are same. This option says, as in the man page, do not list implied entries matching shell PATTERN.
Using find
find \! -name "output.txt" > output.txt
-name option in find finds files/directories whose name match the pattern.
! -name excludes whose name match the pattern.
find \! -name "output.txt" -printf '%P\n' > output.txt
%P strips the path and gives only names.
The most safe way, without assuming anything about the file names, is to use bash arrays (in memory) or a temporary file. A temporary file does not need memory, so it may be even safer. Something like:
#!/bin/bash
tmp=$(tempfile)
ls > $tmp
mv $tmp output.txt
Using ls and awk commands you can get the correct output.
ls -ltr | awk '/txt/ {print $9}' > output.txt
This will print only filenames.
My way would be like:
ls *.txt > output.txt
Note that shell will always expand all globs before running it. In your specific case, the glob expansion process goes like:
# "ls *.txt > output.txt" will be expanded as
ls file1.txt file2.txt file3.txt > output.txt
The reason why you get "output.txt" in your final output file is that redirection actually works among all connected programs SIMULTANEOUSLY.
That means the redirection process does not occur at the end of the program ls, but happens each time ls yields a line of output. In your case, when ls finishing yield the very first line, the file "output.txt" would be created, which will finally be return by ls anyway.

Remove Lines in Multiple Text Files that Begin with a Certain Word

I have hundreds of text files in one directory. For all files, I want to delete all the lines that begin with HETATM. I would need a csh or bash code.
I would think you would use grep, but I'm not sure.
Use sed like this:
sed -i -e '/^HETATM/d' *.txt
to process all files in place.
-i means "in place".
-e means to execute the command that follows.
/^HETATM/ means "find lines starting with HETATM", and the following d means "delete".
Make a backup first!
If you really want to do it with grep, you could do this:
#!/bin/bash
for f in *.txt
do
grep -v "^HETATM" "%f" > $$.tmp && mv $$.tmp "$f"
done
It makes a temporary file of the output from grep (in file $$.tmp) and only overwrites your original file if the command executes successfully.
Using the -v option of grep to get all the lines that do not match:
grep -v '^HETATM' input.txt > output.txt

shell - cat - merge files content into one big file

I'm trying, using bash, to merge the content of a list of files (more than 1K) into a big file.
I've tried the following cat command:
cat * >> bigfile.txt
however what this command does is merge everything, included also the things already merged.
e.g.
file1.txt
content1
file2.txt
content2
file3.txt
content3
file4.txt
content4
bigfile.txt
content1
content2
content3
content2
content3
content4
content2
but I would like just
content1
content2
content3
content4
inside the .txt file
The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
Thank you for your support!
The problem is that you put bigfile in the same directory, hence making it part of *. So something like
cat dir/* > bigfile
should just work as you want it, with your fileN.txt files located in dir/
You can keep the output file in the same directory, you just have to be a bit more sophisticated than *:
shopt -s extglob
cat !(bigfile.txt) > bigfile.txt
On re-reading your question, it appears that you want to append data to bigfile.txt, but
without adding duplicates. You'll have to pass everything through sort -u to filter out duplicates:
sort -u * -o bigfile.txt
The -o option to sort allows you to safely include the contents of bigfile.txt in the input to sort before the file is overwritten with the output.
EDIT: Assuming bigfile.txt is sorted, you can try a two-stage process:
sort -u file*.txt | sort -um - bigfile.txt -o bigfile.txt
First we sort the input files, removing duplicates. We pipe that output to another sort -u process, this one using the -m option as well which tells sort to merge two previously sorted files. The two files we will merge are - (standard input, the stream coming from the first sort), and bigfile.txt itself. We again use the -o option to allow us to write the output back to bigfile.txt after we've read it as input.
The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
This is what xargs is for:
find . -maxdepth 1 -type f -name "file*.txt" -print0 | xargs -0 cat > bigfile.txt
This is an old question but still I'll give another approach with xargs
list the files you want to concat
ls | grep [pattern] > filelist
Review your files are in the proper order with vi or cat. If you use a suffix (1, 2, 3, ..., N) this should be no problem
Create the final file
cat filelist | xargs cat >> [final file]
Remove the filelist
rm -f filelist
Hope this helps anyone
Try:
cat `ls -1 *` >> bigfile.txt
I don't have a unix machine handy at the moment to test it for you first.

Listing files in date order with spaces in filenames

I am starting with a file containing a list of hundreds of files (full paths) in a random order. I would like to list the details of the ten latest files in that list. This is my naive attempt:
$ ls -las -t `cat list-of-files.txt` | head -10
That works, so long as none of the files have spaces in, but fails if they do as those files are split up at the spaces and treated as separate files. File "hello world" gives me:
ls: hello: No such file or directory
ls: world: No such file or directory
I have tried quoting the files in the original list-of-files file, but the here-document still splits the files up at the spaces in the filenames, treating the quotes as part of the filenames:
$ ls -las -t `awk '{print "\"" $0 "\""}' list-of-files.txt` | head -10
ls: "hello: No such file or directory
ls: world": No such file or directory
The only way I can think of doing this, is to ls each file individually (using xargs perhaps) and create an intermediate file with the file listings and the date in a sortable order as the first field in each line, then sort that intermediate file. However, that feels a bit cumbersome and inefficient (hundreds of ls commands rather than one or two). But that may be the only way to do it?
Is there any way to pass "ls" a list of files to process, where those files could contain spaces - it seems like it should be simple, but I'm stumped.
Instead of "one or more blank characters", you can force bash to use another field separator:
OIFS=$IFS
IFS=$'\n'
ls -las -t $(cat list-of-files.txt) | head -10
IFS=$OIFS
However, I don't think this code would be more efficient than doing a loop; in addition, that won't work if the number of files in list-of-files.txt exceeds the max number of arguments.
Try this:
xargs -a list-of-files.txt ls -last | head -n 10
I'm not sure whether this will work, but did you try escaping spaces with \? Using sed or something. sed "s/ /\\\\ /g" list-of-files.txt, for example.
This worked for me:
xargs -d\\n ls -last < list-of-files.txt | head -10

Resources