Concatenating text files in bash - bash

I have many text files with only one line float value in one folder and I would like to concatenate them in bash in order for example: file_1.txt, file_2.txt ...file_N.txt. I would like to have them in one txt file in the order from 1 to N. Could someone please help me ? Here is the code I have but it just concatenates them in random manner. Thank you
for file in *.txt
do
cat ${file} >> output.txt
done

As much as I recommend against parsing the output of ls, here we go.
ls has a "version sort" option that will sort numbered files like you want. See below for a demo.
To concatenate, you want:
ls -v file*.txt | xargs cat > output
$ touch file{1..20}.txt
$ ls
file1.txt file12.txt file15.txt file18.txt file20.txt file5.txt file8.txt
file10.txt file13.txt file16.txt file19.txt file3.txt file6.txt file9.txt
file11.txt file14.txt file17.txt file2.txt file4.txt file7.txt
$ ls -1
file1.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file2.txt
file20.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
$ ls -1v
file1.txt
file2.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt

for file in *.txt
do
cat ${file} >> output.txt
done
This works for me as well as :
for file in *.txt
do
cat $file >> output.txt
done
You don't need {}
But the simpler is still :
cat file*.txt > output.txt
So if you have more than 9 files as suggested in the comment, you can do one of the following :
files=$(ls file*txt | sort -t"_" -k2g)
files=$(find . -name "file*txt" | sort -t "_" -k2g)
files=$(printf "%s\n" file_*.txt | sort -k1.6n) # Thanks to glenn jackman
and then:
cat $files
or
cat $(find . -name "file*txt" | sort -t "_" -k2g)
Best is still to number your files correctly, so file_01.txt if you have less than 100 files, et file_001.txt if less than 1000, an so on.
example :
ls file*txt
file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_10.txt
They contain only their corresponding number.
$ cat $files
1
2
3
4
5
10

Use this:
find . -type f -name "file*.txt" | sort -V | xargs cat -- >final_file
If the files are numbered, then sorting doesn't happen in the natural way that we human expect. For that to happen, you will have to use -V option with sort command.

As others have pointed out, if you have files file_1, file_2, file_3... file_123283, the internal BASH sorting of these files will put file_11 before file_2 because they're sorted by text and not numerically.
You can use sort to get the order you want. Assuming that your files are file_#...
cat $(ls -1 file_* | sort -t_ -k2,2n)
The ls -1 lists your files out on one per line.
sort -t_ says to break the sorting fields down by underscores. This makes the second sorting field the numeric part of the file name.
-k2,2n says to sort by the second field numerically.
Then, you concatenate out all of the files together.
One issue is that you may end up filling up your command line buffer if you have a whole lot of files. Before cat can get the file names, the $(...) must first be expanded.

This works for me...
for i in $(seq 0 $N); do [[ -f file_$i.txt ]] && cat file_$i.txt; done > newfile
Or, more concisely
for i in $(seq 0 $N); do cat file_$i.txt 2> /dev/null ;done > newfile

You can use ls for listing files:
for file in `ls *.txt`
do·
cat ${file} >> output
done
Some sort techniques are discussed here: Unix's 'ls' sort by name

Glenn Jackman 's answer is a simple solution for GNU/Linux systems.
David W. 's answer is a portable alternative.
Both solutions work well for the specific case at hand, but not generally in that they'll break with filenames with embedded spaces or other metacharacters (characters that, when used unquoted, have special meaning to the shell).
Here are solutions that work with filenames with embedded spaces, etc.:
Preferable solution for systems where sort -z and xargs -0 are supported (e.g., Linux, OSX, *BSD):
printf "%s\0" file_*.txt | sort -z -t_ -k2,2n | xargs -0 cat > out.txt
Uses NUL (null character, 0x0) to separate the filenames and so safely preserves their boundaries.
This is the most robust solution, because it even handles filename with embedded newlines correctly (although such filenames are very rare in practice). Unfortunately, sort -z and xargs -0 are not POSIX-compliant.
POSIX-compliant solution, using xargs -I:
printf "%s\n" file_*.txt | sort -t_ -k2,2n | xargs -I % cat % > out.txt
Processing is line-based, and due to use of -I, cat is invoked once per input filename, making this method slower than the one above.

Related

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345
I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.
Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Creating a script that checks to see if each word in a file

I am pretty new to Bash and scripting in general and could use some help. Each word in the first file is separated by \n while the second file could contain anything. If the string in the first file is not found in the second file, I want to output it. Pretty much "check if these words are in these words and tell me the ones that are not"
File1.txt contains something like:
dog
cat
fish
rat
file2.txt contains something like:
dog
bear
catfish
magic ->rat
I know I want to use grep (or do I?) and the command would be (to my best understanding):
$foo.sh file1.txt file2.txt
Now for the script...
I have no idea...
grep -iv $1 $2
Give this a try. This is straight forward and not optimized but it does the trick (I think)
while read line ; do
fgrep -q "$line" file2.txt || echo "$line"
done < file1.txt
There is a funny version below, with 4 parrallel fgrep and the use of an additional result.txt file.
> result.txt
nb_parrallel=4
while read line ; do
while [ $(jobs | wc -l) -gt "$nb_parralel" ]; do sleep 1; done
fgrep -q "$line" file2.txt || echo "$line" >> result.txt &
done < file1.txt
wait
cat result.txt
You can increase the value 4, in order to use more parrallel fgrep, depending on the number of cpus and cores and the IOPS available.
With the -f flag you can tell grep to use a file.
grep -vf file2.txt file1.txt
To get a good match on complete lines, use
grep -vFxf file2.txt file1.txt
As #anubhava commented, this will not match substrings. To fix that, we will use the result of grep -Fof file1.txt file2.txt (all the relevant keywords).
Combining these will give
grep -vFxf <(grep -Fof file1.txt file2.txt) file1.txt
Using awk you can do:
awk 'FNR==NR{a[$0]; next} {for (i in a) if (index(i, $0)) next} 1' file2 file1
rat
You can simply do the following:
comm -2 -3 file1.txt file2.txt
and also:
diff -u file1.txt file2.txt
I know you were looking for a script but I don't think there is any reason to do so and if you still want to have a script you can jsut run the commands from a script.
similar awk
$ awk 'NR==FNR{a[$0];next} {for(k in a) if(k~$0) next}1' file2 file1
rat

How can I combine a set of text files, leaving off the first line of each?

As part of a normal workflow, I receive sets of text files, each containing a header row. It's more convenient for me to work with these as a single file, but if I cat them naively, the header rows in files after the first cause problems.
The files tend to be large enough (103–105 lines, 5–50 MB) and numerous enough that it's awkward and/or tedious to do this in an editor or step-by-step, e.g.:
$ wc -l *
20251 1.csv
124520 2.csv
31158 3.csv
175929 total
$ tail -n 20250 1.csv > 1.tmp
$ tail -n 124519 2.csv > 2.tmp
$ tail -n 31157 3.csv > 3.tmp
$ cat *.tmp > combined.csv
$ wc -l combined.csv
175926 combined.csv
It seems like this should be doable in one line. I've isolated the arguments that I need but I'm having trouble figuring out how to match them up with tail and subtract 1 from the line total (I'm not comfortable with awk):
$ wc -l * | grep -v "total" | xargs -n 2
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
$ wc -l * | grep -v "total" | xargs -n 2 | tail -n
tail: option requires an argument -- n
Try 'tail --help' for more information.
xargs: echo: terminated by signal 13
You don't need to use wc -l to calculate the number of lines to output; tail can skip the first line (or the first K lines), just by adding a + symbol when using the -n (or --lines) option, as described in the man page:
-n, --lines=K output the last K lines, instead of the last 10;
or use -n +K to output starting with the Kth
This makes combining all files in a directory without the first line of each file as simple as:
$ tail -q -n +2 * > combined.csv
$ wc -l *
20251 foo.csv
124520 bar.csv
31158 baz.csv
87457 zappa.csv
7310 bingo.csv
29968 niner.csv
2086 hella.csv
302743 combined.csv
605493 total
The -q flag suppresses headers in the output when globbing for multiple files with tail.
Both tail and sed answers work fine.
For the sake of an alternative here is an awk command that does the same job:
awk 'FNR > 1' *.csv > combined.csv
FNR > 1 condition will skip first row for each file.
With GNU sed:
sed -ns '2,$p' 1.csv 2.csv 3.csv > combined.csv
or
sed -ns '2,$p' *.csv > combined.csv
Another sed alternative
sed -s 1d *.csv
deletes first line from each input file, without -s it will only delete from the first file.

shell - cat - merge files content into one big file

I'm trying, using bash, to merge the content of a list of files (more than 1K) into a big file.
I've tried the following cat command:
cat * >> bigfile.txt
however what this command does is merge everything, included also the things already merged.
e.g.
file1.txt
content1
file2.txt
content2
file3.txt
content3
file4.txt
content4
bigfile.txt
content1
content2
content3
content2
content3
content4
content2
but I would like just
content1
content2
content3
content4
inside the .txt file
The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
Thank you for your support!
The problem is that you put bigfile in the same directory, hence making it part of *. So something like
cat dir/* > bigfile
should just work as you want it, with your fileN.txt files located in dir/
You can keep the output file in the same directory, you just have to be a bit more sophisticated than *:
shopt -s extglob
cat !(bigfile.txt) > bigfile.txt
On re-reading your question, it appears that you want to append data to bigfile.txt, but
without adding duplicates. You'll have to pass everything through sort -u to filter out duplicates:
sort -u * -o bigfile.txt
The -o option to sort allows you to safely include the contents of bigfile.txt in the input to sort before the file is overwritten with the output.
EDIT: Assuming bigfile.txt is sorted, you can try a two-stage process:
sort -u file*.txt | sort -um - bigfile.txt -o bigfile.txt
First we sort the input files, removing duplicates. We pipe that output to another sort -u process, this one using the -m option as well which tells sort to merge two previously sorted files. The two files we will merge are - (standard input, the stream coming from the first sort), and bigfile.txt itself. We again use the -o option to allow us to write the output back to bigfile.txt after we've read it as input.
The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
This is what xargs is for:
find . -maxdepth 1 -type f -name "file*.txt" -print0 | xargs -0 cat > bigfile.txt
This is an old question but still I'll give another approach with xargs
list the files you want to concat
ls | grep [pattern] > filelist
Review your files are in the proper order with vi or cat. If you use a suffix (1, 2, 3, ..., N) this should be no problem
Create the final file
cat filelist | xargs cat >> [final file]
Remove the filelist
rm -f filelist
Hope this helps anyone
Try:
cat `ls -1 *` >> bigfile.txt
I don't have a unix machine handy at the moment to test it for you first.

Unix: merge many files, while deleting first line of all files

I have >100 files that I need to merge, but for each file the first line has to be removed. What is the most efficient way to do this under Unix? I suspect it's probably a command using cat and sed '1d'. All files have the same extension and are in the same folder, so we probably could use *.extension to point to the files. Many thanks!
Assuming your filenames are sorted in the order you want your files appended, you can use:
ls *.extension | xargs -n 1 tail -n +2
EDIT: After Sorin and Gilles comments about the possible dangers of piping ls output, you could use:
find . -name "*.extension" | xargs -n 1 tail -n +2
Everyone has to be complicated. This is really easy:
tail -q -n +2 file1 file2 file3
And so on. If you have a large number of files you can load them in to an array first:
list=(file1 file2 file3)
tail -q -n +2 "${list[#]}"
All the files with a given extension in the current directory?
list=(*.extension)
tail -q -n +2 "${list[#]}"
Or just
tail -q -n +2 *.extension
Just append each file after removing the first line.
#!/bin/bash
DEST=/tmp/out
FILES=space separated list of files
echo "" >$DEST
for FILE in $FILES
do
sed -e'1d' $FILE >>$DEST
done
tail outputs the last lines of a file. You can tell it how many lines to print, or how many lines to omit at the beginning (-n +N where N is the number of the first line to print, counting from 1 — so +2 omits one line). With GNU utilities (i.e. under Linux or Cygwin), FreeBSD or other systems that have the -q option:
tail -q -n +2 *.extension
tail prints a header before each file, and -q is not standard. If your implementation doesn't have it, or to be portable, you need to iterate over the files.
for x in *.extension; do tail -n +2 <"$x"; done
Alternatively, you can call Awk, which has a way to identify the first line of each file. This is likely to be faster if you have a lot of small files and slower if you have many large files.
awk 'FNR != 1' *.extension
ls -1 file*.txt | xargs nawk 'FNR!=1'

Resources