shell - cat - merge files content into one big file - shell

I'm trying, using bash, to merge the content of a list of files (more than 1K) into a big file.
I've tried the following cat command:
cat * >> bigfile.txt
however what this command does is merge everything, included also the things already merged.
e.g.
file1.txt
content1
file2.txt
content2
file3.txt
content3
file4.txt
content4
bigfile.txt
content1
content2
content3
content2
content3
content4
content2
but I would like just
content1
content2
content3
content4
inside the .txt file
The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
Thank you for your support!

The problem is that you put bigfile in the same directory, hence making it part of *. So something like
cat dir/* > bigfile
should just work as you want it, with your fileN.txt files located in dir/

You can keep the output file in the same directory, you just have to be a bit more sophisticated than *:
shopt -s extglob
cat !(bigfile.txt) > bigfile.txt

On re-reading your question, it appears that you want to append data to bigfile.txt, but
without adding duplicates. You'll have to pass everything through sort -u to filter out duplicates:
sort -u * -o bigfile.txt
The -o option to sort allows you to safely include the contents of bigfile.txt in the input to sort before the file is overwritten with the output.
EDIT: Assuming bigfile.txt is sorted, you can try a two-stage process:
sort -u file*.txt | sort -um - bigfile.txt -o bigfile.txt
First we sort the input files, removing duplicates. We pipe that output to another sort -u process, this one using the -m option as well which tells sort to merge two previously sorted files. The two files we will merge are - (standard input, the stream coming from the first sort), and bigfile.txt itself. We again use the -o option to allow us to write the output back to bigfile.txt after we've read it as input.

The other way would be cat file1.txt file2.txt ... and so on... but I cannot do it for more than 1k files!
This is what xargs is for:
find . -maxdepth 1 -type f -name "file*.txt" -print0 | xargs -0 cat > bigfile.txt

This is an old question but still I'll give another approach with xargs
list the files you want to concat
ls | grep [pattern] > filelist
Review your files are in the proper order with vi or cat. If you use a suffix (1, 2, 3, ..., N) this should be no problem
Create the final file
cat filelist | xargs cat >> [final file]
Remove the filelist
rm -f filelist
Hope this helps anyone

Try:
cat `ls -1 *` >> bigfile.txt
I don't have a unix machine handy at the moment to test it for you first.

Related

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345
I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.
Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Retain the latest file sets in a directory for a given file pattern

I have multiple set of files in a ftp folder and each set contains a text and a marker file.
Here I need to get the latest set of files having below file pattern from a given directory based on its arrival time.
File format:
<FileName>_<FileID>_<Date>_<TimeStamp>.csv
<FileName>_<FileID>_<Date>_<TimeStamp>.mrk
File1 has three sets coming at different times:
file1_123_20180306_654321.csv
file1_123_20180306_654321.mrk
file1_123_20180306_866321.csv
file1_123_20180306_866321.mrk
file1_123_20180306_976321.csv
file1_123_20180306_976321.mrk
File2 has two sets coming at different times:
file2_456_20180306_277676.csv
file2_456_20180306_277676.mrk
file2_456_20180306_788988.csv
file2_456_20180306_788988.mrk
If it's a single file I'm able to do the below command but in case if its a set I need help.
ls -t *123*.mrk | head -1
ls -t *123*.csv | head -1
I need to retain only the latest set of files (from file1 and file2) and move the other files into a different folder.
Expected output:
file1_123_20180306_976321.csv
file1_123_20180306_976321.mrk
file2_456_20180306_788988.csv
file2_456_20180306_788988.mrk
How would I do this using shell or python2.6? Any help is much appreciated.
If a more or less exact answer already exists to this question please point to that.
You may use this awk to get the latest file entry for each set from your two files:
printf '%s\0' *_*_*_*.csv *_*_*_*.mrk |
awk -v RS='\0' -v ORS='\0' -F '[_.]' 'NF{a[$1,$2,$3,$NF]=$0}
END{for (i in a) print a[i]}' |
xargs -0 -I {} echo mv '{}' /dest/dir
Output:
mv file2_456_20180306_788988.csv /dest/dir
mv file1_123_20180306_976321.mrk /dest/dir
mv file1_123_20180306_976321.csv /dest/dir
mv file2_456_20180306_788988.mrk /dest/dir
When you're satisfied with the output, you can remove echo before mv command to move these files into a destination directory.

Unix-Read File Line by line.Check if a string exists on another file and do required operation

I need some assistance on the below.
File1.txt
aaa:/path/to/aaa:777
bob:/path/to/bbb:700
ccc:/path/to/ccc:600
File2.txt
aaa:/path/to/aaa:700
bbb:/path/to/bbb:700
ccc:/path/to/ccc:644
I should iterate file2.txt and if aaa exists in File1.txt, then i should compare the file permission. If the file permission is same for aaa in both the files then ignore.
If they are different then write them in the output.txt
So in above case
Output.txt
aaa:/path/to/aaa:700
ccc:/path/to/ccc:644
How can i achieve this in unix shell script? Please suggest
I agree with the comment of #Marc that you should try something before asking here.
However, the following answer is difficult to find when you never have seen the constructions, so I give you something to study.
When you want to parse line by line, you can start with
while IFS=: read -r file path mode; do
comparewith=$(grep "^${file}:${path}:" File2.txt | cut -d: -f3)
# compare and output
done < File1.txt
For large files that will become very slow.
You can first filter the lines you want to compare from File2.txt.
You want to grep strings like aaa:/path/to/aaa:, including the last :. With cut -d: -f1-2 you might be fine with your inputfile, but maybe it is better to remove the last three characters:
sed 's/...$//' File1.txt.
You can let grep use the output as a file with expressions using <():
grep -f <(sed 's/...$//' File1.txt) File2.txt
Your example files don't show the situation when both files have identical lines (that you want to skip), you will need another process substitution to get that working:
grep -v -f File1.txt <(grep -f <(sed 's/...$//' File1.txt ) File2.txt )
Another solution, worth trying yourself, is using awk (see What is "NR==FNR" in awk? for accessing 2 files).
comm - compare two sorted files line by line
According to manual, comm -13 <file1> <file2> must print only lines unique to <file2>:
$ ls
File1.txt File2.txt
$ cat File1.txt
aaa:/path/to/aaa:777
bbb:/path/to/bbb:700
ccc:/path/to/ccc:600
$ cat File2.txt
aaa:/path/to/aaa:700
bbb:/path/to/bbb:700
ccc:/path/to/ccc:644
$ comm -13 File1.txt File2.txt
aaa:/path/to/aaa:700
ccc:/path/to/ccc:644
$ # Nice!
But it doesn't check for lines in <file1> that are "similar" to corresponding lines of <file2>. I. e. it won't work as you want if File1.txt has line BOB:/path/to/BOB:700 and File2.txt has BBB:/path/to/BBB:700 since it will print the latter (while you want it not to be printed).
It also won't do what you want if strings bbb:/path/to/bbb:700 and bbb:/another/path/to/bbb:700 are supposed to be "identical".

Concatenating text files in bash

I have many text files with only one line float value in one folder and I would like to concatenate them in bash in order for example: file_1.txt, file_2.txt ...file_N.txt. I would like to have them in one txt file in the order from 1 to N. Could someone please help me ? Here is the code I have but it just concatenates them in random manner. Thank you
for file in *.txt
do
cat ${file} >> output.txt
done
As much as I recommend against parsing the output of ls, here we go.
ls has a "version sort" option that will sort numbered files like you want. See below for a demo.
To concatenate, you want:
ls -v file*.txt | xargs cat > output
$ touch file{1..20}.txt
$ ls
file1.txt file12.txt file15.txt file18.txt file20.txt file5.txt file8.txt
file10.txt file13.txt file16.txt file19.txt file3.txt file6.txt file9.txt
file11.txt file14.txt file17.txt file2.txt file4.txt file7.txt
$ ls -1
file1.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file2.txt
file20.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
$ ls -1v
file1.txt
file2.txt
file3.txt
file4.txt
file5.txt
file6.txt
file7.txt
file8.txt
file9.txt
file10.txt
file11.txt
file12.txt
file13.txt
file14.txt
file15.txt
file16.txt
file17.txt
file18.txt
file19.txt
file20.txt
for file in *.txt
do
cat ${file} >> output.txt
done
This works for me as well as :
for file in *.txt
do
cat $file >> output.txt
done
You don't need {}
But the simpler is still :
cat file*.txt > output.txt
So if you have more than 9 files as suggested in the comment, you can do one of the following :
files=$(ls file*txt | sort -t"_" -k2g)
files=$(find . -name "file*txt" | sort -t "_" -k2g)
files=$(printf "%s\n" file_*.txt | sort -k1.6n) # Thanks to glenn jackman
and then:
cat $files
or
cat $(find . -name "file*txt" | sort -t "_" -k2g)
Best is still to number your files correctly, so file_01.txt if you have less than 100 files, et file_001.txt if less than 1000, an so on.
example :
ls file*txt
file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_10.txt
They contain only their corresponding number.
$ cat $files
1
2
3
4
5
10
Use this:
find . -type f -name "file*.txt" | sort -V | xargs cat -- >final_file
If the files are numbered, then sorting doesn't happen in the natural way that we human expect. For that to happen, you will have to use -V option with sort command.
As others have pointed out, if you have files file_1, file_2, file_3... file_123283, the internal BASH sorting of these files will put file_11 before file_2 because they're sorted by text and not numerically.
You can use sort to get the order you want. Assuming that your files are file_#...
cat $(ls -1 file_* | sort -t_ -k2,2n)
The ls -1 lists your files out on one per line.
sort -t_ says to break the sorting fields down by underscores. This makes the second sorting field the numeric part of the file name.
-k2,2n says to sort by the second field numerically.
Then, you concatenate out all of the files together.
One issue is that you may end up filling up your command line buffer if you have a whole lot of files. Before cat can get the file names, the $(...) must first be expanded.
This works for me...
for i in $(seq 0 $N); do [[ -f file_$i.txt ]] && cat file_$i.txt; done > newfile
Or, more concisely
for i in $(seq 0 $N); do cat file_$i.txt 2> /dev/null ;done > newfile
You can use ls for listing files:
for file in `ls *.txt`
do·
cat ${file} >> output
done
Some sort techniques are discussed here: Unix's 'ls' sort by name
Glenn Jackman 's answer is a simple solution for GNU/Linux systems.
David W. 's answer is a portable alternative.
Both solutions work well for the specific case at hand, but not generally in that they'll break with filenames with embedded spaces or other metacharacters (characters that, when used unquoted, have special meaning to the shell).
Here are solutions that work with filenames with embedded spaces, etc.:
Preferable solution for systems where sort -z and xargs -0 are supported (e.g., Linux, OSX, *BSD):
printf "%s\0" file_*.txt | sort -z -t_ -k2,2n | xargs -0 cat > out.txt
Uses NUL (null character, 0x0) to separate the filenames and so safely preserves their boundaries.
This is the most robust solution, because it even handles filename with embedded newlines correctly (although such filenames are very rare in practice). Unfortunately, sort -z and xargs -0 are not POSIX-compliant.
POSIX-compliant solution, using xargs -I:
printf "%s\n" file_*.txt | sort -t_ -k2,2n | xargs -I % cat % > out.txt
Processing is line-based, and due to use of -I, cat is invoked once per input filename, making this method slower than the one above.

Concatenating multiple text files into a single file in Bash

What is the quickest and most pragmatic way to combine all *.txt file in a directory into one large text file?
Currently I'm using windows with cygwin so I have access to BASH.
Windows shell command would be nice too but I doubt there is one.
This appends the output to all.txt
cat *.txt >> all.txt
This overwrites all.txt
cat *.txt > all.txt
Just remember, for all the solutions given so far, the shell decides the order in which the files are concatenated. For Bash, IIRC, that's alphabetical order. If the order is important, you should either name the files appropriately (01file.txt, 02file.txt, etc...) or specify each file in the order you want it concatenated.
$ cat file1 file2 file3 file4 file5 file6 > out.txt
The Windows shell command type can do this:
type *.txt > outputfile.txt
Type type command also writes file names to stderr, which are not captured by the > redirect operator (but will show up on the console).
You can use Windows shell copy to concatenate files.
C:\> copy *.txt outputfile
From the help:
To append files, specify a single file for destination, but multiple files for source (using wildcards or file1+file2+file3 format).
Be careful, because none of these methods work with a large number of files. Personally, I used this line:
for i in $(ls | grep ".txt");do cat $i >> output.txt;done
EDIT: As someone said in the comments, you can replace $(ls | grep ".txt") with $(ls *.txt)
EDIT: thanks to #gnourf_gnourf expertise, the use of glob is the correct way to iterate over files in a directory. Consequently, blasphemous expressions like $(ls | grep ".txt") must be replaced by *.txt (see the article here).
Good Solution
for i in *.txt;do cat $i >> output.txt;done
How about this approach?
find . -type f -name '*.txt' -exec cat {} + >> output.txt
the most pragmatic way with the shell is the cat command. other ways include,
awk '1' *.txt > all.txt
perl -ne 'print;' *.txt > all.txt
type [source folder]\*.[File extension] > [destination folder]\[file name].[File extension]
For Example:
type C:\*.txt > C:\1\all.txt
That will Take all the txt files in the C:\ Folder and save it in C:\1 Folder by the name of all.txt
Or
type [source folder]\* > [destination folder]\[file name].[File extension]
For Example:
type C:\* > C:\1\all.txt
That will take all the files that are present in the folder and put there Content in C:\1\all.txt
You can do like this:
cat [directory_path]/**/*.[h,m] > test.txt
if you use {} to include the extension of the files you want to find, there is a sequencing problem.
The most upvoted answers will fail if the file list is too long.
A more portable solution would be using fd
fd -e txt -d 1 -X awk 1 > combined.txt
-d 1 limits the search to the current directory. If you omit this option then it will recursively find all .txt files from the current directory.
-X (otherwise known as --exec-batch) executes a command (awk 1 in this case) for all the search results at once.
Note, fd is not a "standard" Unix program, so you will likely need to install it
When you run into a problem where it cats all.txt into all.txt,
You can try check all.txt is existing or not, if exists, remove
Like this:
[ -e $"all.txt" ] && rm $"all.txt"
all of that is nasty....
ls | grep *.txt | while read file; do cat $file >> ./output.txt; done;
easy stuff.

Resources