Merge huge number of files into one file by reading the files in ascending order - bash

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.
Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?
My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Actual output:
$ ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345
EDIT:
Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.
find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt
Tried with file name containing a space and below are the results with 2 different commands.
Example:
$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello
Expected output.txt:
abcdef
12345
Hello
Command:
find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt
Output:
Successfuly completed the merge as expected but with an error at the end.
abcdef
12345
hello
awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)
Command:
Here I have used the sort with -zV options to avoid the error occured in the above command.
find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt
Output:
Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.
abcdef
hello
12345

I would approach this with a for loop, and use echo to add the newline between each file:
for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
cat $x
echo
done > output.txt
Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.
EDIT
You can optimize this a lot by using awk 1 as #oguzismail did in his answer:
ls -d -1 -v "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt
This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.
Note: if want to you upvote this, make sure to upvote #oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.

Don't parse the output of ls, use an array instead.
for fname in 9999_xyz_*.json; do
index="${fname##*_}"
index="${index%.json}"
files[index]="$fname"
done && awk 1 "${files[#]}" > output.txt
Another approach that relies on GNU extensions:
printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Related

grep from 7 GB text file OR many smaller ones

I have about two thousand text files in folder.
I want to loop each one and search for specific word in line.
for file in "./*.txt";
do
cat $file | grep "banana"
done
I was wondering if join all text files into one file would be faster.
The whole directory has about 7 GB.
You're not actually looping, you're calling cat just once on the string ./*.txt, i.e., your script is equivalent to
cat ./*.txt | grep 'banana'
This is not equivalent to
grep 'banana' ./*.txt
though, as the output for the latter would prefix the filename for each match; you could use
grep -h 'banana' ./*.txt
to suppress filenames.
The problem you could run into is that ./*.txt expands to something that is longer than the maximum command line length allowed; to prevent that, you could do something like
printf '%s\0' ./*.txt | xargs -0 grep -h 'banana'
which is save for both files containing blanks and shell metacharacters and calls grep as few times as possible1.
This can even be parallelized; to run 4 grep processes in parallel, each handling 5 files at a time:
printf '%s\0' ./*.txt | xargs -0 -L 5 -P 4 grep -h 'banana'
What I think you intended to run is this:
for file in ./*.txt; do
cat "$file" | grep "banana"
done
which would call cat/grep once per file.
1At first I thought that printf would run into trouble with command line length limitations as well, but it seems that as a shell built-in, it's exempt:
$ touch '%s\0' {1000000..10000000} > /dev/null
-bash: /usr/bin/touch: Argument list too long
$ printf '%s\0' {1000000..10000000} > /dev/null
$

How to extract codes using the grep command?

I have a file with below input lines.
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
I need to extract only the above highlighted text using the grep command.
I tried the below command and didn't get the proper result. Getting the extra 2 unwanted characters in the output. Please suggest if there is any other way to achieve this through grep command.
find ./ -type f -name <FileName> -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\\....' |
grep -o '...\\....' | sort | uniq
Current Output:
123.NNN
456.1 a
Expected output:
123.NNN
456.1
You can use another grep regular expression.
find ./ -type f -name f -exec cut -f 4 -d'|' {} + |
grep -o 'Category is not found for local configuration/code/...\.[^ ]*' |
grep -o '...\..*' | sort | uniq
. matches any character, [^ ]* matches any sequence of characters until the first space
Output:
123.NNN
456.1
Your regex specifies a fixed character width for strings of variable width. Based on your examples, something like
[0-9]\+\.[A-Z0-9]\+
would seem like a better regex. However, we could probably also simplify this by merging the cut and multiple grep commands into a single Awk script.
find etc etc -exec awk -F '|' '
$4 ~ /Category is not found for local configuration\/code\/[0-9]{3}\.[0-9A-Z]/ {
split($4, a, /\/code\/);
split(a[2], b); print b[1] }' {} + |
sort -u
The two split operations are just a cheap way to pick out the text between /code/ and the next whitespace character; we have already established by way of the regex match that the string after /code/ matches the pattern we're after.
Notice also how sort has a -u option which allows you to replace (trivial cases of) uniq.
The regex variant supported by Awk is slightly different than that supported by POSIX grep; so the backslashed \+ in grep's BRE dialect is plain + in the dialect called ERE which is [more or less] supported by Awk - and grep -E. If you have grep -P you can use a third variant which has a convenient feature;
find etc etc -exec grep -oP '^([^|]*[|]){3}[^|]*Category is not found for local configuration/code/\K[0-9]{3}\.[0-9A-Z]+' {} + |
sort -u
The \K says "match up through here, but forget everything before this" and so only prints the part after this token.
With sed:
sed -E -n 's#.*code/(.*)\s+and.*#\1#p' file.txt | uniq
Output:
123.NNN
456.1
I'd use the -P option:
grep -oP '/code/\K\S+' file | sort -u
You want to extract the non-whitespace characters following /code/
An awk using match():
$ awk 'match($0,/[0-9]+\.[A-Z0-9]+/)&&++a[(b=substr($0,RSTART,RLENGTH))]==1{print b}' file
Output:
123.NNN
456.1
Pretty printed for slightly better readability:
$ awk '
match($0,/[0-9]+\.[A-Z0-9]+/) && ++a[(b=substr($0,RSTART,RLENGTH))]==1 {
print b
}' file
It's not possible just using grep. You should use AWK instead:
awk '{split($7, ar, "/"); print ar[3]}' FILE
Explanation:
The split function splits on a string, here $7, the 7th field, placing the result in an array ar, and using the string / as delimiter.
Then prints the 3rd field of the array.
Note:
I am assuming that all of your input looks like the samples you have given us, i.e.:
aaa|b|c|ddd is not found for local configuration/code/111.nnn and customer nnn
Where aaa and ddd will not contain whitespace.
I also assume you really do have a file FILE containing those lines. It's a bit unclear.
Input:
▶ cat FILE
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
Output:
▶ awk '{split($7, ar, "/"); print ar[3]}' FILE
123.NNN
123.NNN
456.1
Single sed can do the filtering.
(The pattern can be further generalized as suggested by others if that is an option. But be careful to not to over simplify so that it can match with unexpected inputs)
sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' input.txt
To replace your exact command,
find ./ -type f -name <Filename> -exec cat {} \; | sed -nE 's#(\S+\s+){6}configuration/code/(\S+)\s.*#\2#p' | sort | uniq
Simple substitutions on individual lines is the job sed is best suited for. This will work using any sed in any shell on any UNIX box:
$ cat file
John|1|R|Category is not found for local configuration/code/123.NNN and customer 113
TOM|2|R|Category is not found for local configuration/code/123.NNN and customer 114
PETER|3|R|Category is not found for local configuration/code/456.1 and customer 115
$ sed -n 's:.*Category is not found for local configuration/code/\([^ ]*\).*:\1:p' file | sort -u
123.NNN
456.1

How to select (grep) many different patterns by bash via a pipe?

My task
I have a file A.txt with the following content.
aijdish uhuih
buh iiu hhuih
zhuh hiu
d uhiuhg ui
...
I want to select lines with these words aijdish, d, buh ...
I only know that I can:
cat A.txt | grep "aijdish" > temp.txt
cat A.txt | grep "d" >> temp.txt
cat A.txt | grep "buh" >> temp.txt
...
But I have several thousands of words need to select this time, how can I do this under bash?
Since you have many words you want to look for I suggest putting the pattern into a file and use greps -f option:
$ cat grep-pattern.txt
aijdish
buh
d
$ grep -f grep-pattern.txt inputfile
aijdish uhuih
buh iiu hhuih
d uhiuhg ui
But if you have words like d you might want to add the -w option to match only whole words and not parts of words.
grep -wf grep-pattern.txt inputfile
$ grep -E "aijdish|d|buh" inputfile
aijdish uhuih
buh iiu hhuih
d uhiuhg ui
Store the words to be searched in a file (say a.txt) and then write a script for searching every line in a.txt and matching it in the required file

moving files that contain part of a line from a file

I have a file that on each line is a string of some numbers such as
1234
2345
...
I need to move files that contain that number in their name followed by other stuff to a directory examples being
1234_hello_other_stuff_2334.pdf
2345_more_stuff_3343.pdf
I tried using xargs to do this, but my bash scripting isn't the best. Can anyone share the proper command to accomplish what I want to do?
for i in `cat numbers.txt`; do
mv ${i}_* examples
done
or (look ma, no cat!)
while read i; do
mv ${i}_* examples
done < numbers.txt
You could use a for loop, but that could make for a really long command line. If you have 20000 lines in numbers.txt, you might hit shell limits. Instead, you could use a pipe:
cat numbers.txt | while read number; do
mv ${number}_*.pdf /path/to/examples/
done
or:
sed 's/.*/mv -v &_*.pdf/' numbers.txt | sh
You can leave off the | sh for testing. If there are other lines in the file and you only want to match lines with 4 digits, you could restrict your match:
sed -r '/^[0-9]{4}$/s//mv -v &_*.pdf/' numbers.txt | sh
cat numbers.txt | xargs -n1 -I % find . -name '%*.pdf' -exec mv {} /path/to \;
% is your number (-n1 means one at a time), and '%*.pdf' to find means it'll match all files whose names begin with that number; then it just copies to /path/to ({} is the actual file name).

Linux commands to output part of input file's name and line count

What Linux commands would you use successively, for a bunch of files, to count the number of lines in a file and output to an output file with part of the corresponding input file as part of the output line. So for example we were looking at file LOG_Yellow and it had 28 lines, the the output file would have a line like this (Yellow and 28 are tab separated):
Yellow 28
wc -l [filenames] | grep -v " total$" | sed s/[prefix]//
The wc -l generates the output in almost the right format; grep -v removes the "total" line that wc generates for you; sed strips the junk you don't want from the filenames.
wc -l * | head --lines=-1 > output.txt
produces output like this:
linecount1 filename1
linecount2 filename2
I think you should be able to work from here to extend to your needs.
edit: since I haven't seen the rules for you name extraction, I still leave the full name. However, unlike other answers I'd prefer to use head rather then grep, which not only should be slightly faster, but also avoids the case of filtering out files named total*.
edit2 (having read the comments): the following does the whole lot:
wc -l * | head --lines=-1 | sed s/LOG_// | awk '{print $2 "\t" $1}' > output.txt
wc -l *| grep -v " total"
send
28 Yellow
You can reverse it if you want (awk, if you don't have space in file names)
wc -l *| egrep -v " total$" | sed s/[prefix]//
| awk '{print $2 " " $1}'
Short of writing the script for you:
'for' for looping through your files.
'echo -n' for printing the current file
'wc -l' for finding out the line count
And dont forget to redirect
('>' or '>>') your results to your
output file

Resources