Unix: One line bash command to merge 3 files together. extracting only the first line of each - bash

I am having time with my syntax here:
I have 3 files with various content file1 file2 file3 (100+ lines). I am trying to merge them together, but only the first line of each file should be merged. The point is to do it using one line of bash code:
sed -n 1p file1 file2 file3 returns only the first line of file1

You might want to try
head -n1 -q file1 file2 file3.

It's not clear if by merge you mean concatenate or join?
In awk by joining (each first line in the files printed side by side):
$ awk 'FNR==1{printf "%s ",$0}' file1 file2 file3
1 2 3
In awk by concatenating (each first line in the files printed one after another):
$ awk 'FNR==1' file1 file2 file3
1
2
3

I suggest you use head as explained by themel's answer. However, if you insist in using sed you cannot simply pass all files to it, since they are implicitly concatenated and you lose information about what the first line is in each file respectively. So, if you really want to do it in sed, you need bash to help you out:
for f in file1 file2 file3; do sed -n 1p "$f"; done

You can avoid calling external processes by using the read built-in command:
for f in file1 file2 file3; do read l < $f; echo "$l"; done > merged.txt

Related

How to compare two file in bash script and find lines in one file not in another? [duplicate]

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man file is actually quite readable for this.
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.
Explanation
The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and ## lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.
The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.
Like konsolebox suggested, the posters grep solution
grep -v -f file2 file1
actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.
These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.
So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
grep -F -x -v -f file2 file1
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
Test:
seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
# Output:
1
2
This is also relatively fast, compared to grep.
whats the speed of as sort and diff?
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
Use combine from moreutils package, a sets utility that supports not, and, or, xor operations
combine file1 not file2
i.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.
So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
This seems quick for me :
comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
$ join -v 1 -t '' file1 file2
line2
line3
The -t makes sure that it compares the whole line, if you had a space in some of the lines.
You can use Python:
python -c '
lines_to_remove = set()
with open("file2", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("f1", "r") as f:
for line in f.readlines():
if line.strip() not in lines_to_remove:
print(line.strip())
'
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.
diff -y --suppress-common-lines file1.txt file2.txt

With bash how to get the new content between a file A and a file B [duplicate]

I have two large files (sets of filenames). Roughly 30.000 lines in each file. I am trying to find a fast way of finding lines in file1 that are not present in file2.
For example, if this is file1:
line1
line2
line3
And this is file2:
line1
line4
line5
Then my result/output should be:
line2
line3
This works:
grep -v -f file2 file1
But it is very, very slow when used on my large files.
I suspect there is a good way to do this using diff(), but the output should be just the lines, nothing else, and I cannot seem to find a switch for that.
Can anyone help me find a fast way of doing this, using bash and basic Linux binaries?
EDIT: To follow up on my own question, this is the best way I have found so far using diff():
diff file2 file1 | grep '^>' | sed 's/^>\ //'
Surely, there must be a better way?
The comm command (short for "common") may be useful comm - compare two sorted files line by line
#find lines only in file1
comm -23 file1 file2
#find lines only in file2
comm -13 file1 file2
#find lines common to both files
comm -12 file1 file2
The man file is actually quite readable for this.
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.
Explanation
The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and ## lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.
The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.
Like konsolebox suggested, the posters grep solution
grep -v -f file2 file1
actually works great (faster) if you simply add the -F option, to treat the patterns as fixed strings instead of regular expressions. I verified this on a pair of ~1000 line file lists I had to compare. With -F it took 0.031 s (real), while without it took 2.278 s (real), when redirecting grep output to wc -l.
These tests also included the -x switch, which are necessary part of the solution in order to ensure totally accuracy in cases where file2 contains lines which match part of, but not all of, one or more lines in file1.
So a solution that does not require the inputs to be sorted, is fast, flexible (case sensitivity, etc) is:
grep -F -x -v -f file2 file1
This doesn't work with all versions of grep, for example it fails in macOS, where a line in file 1 will be shown as not present in file 2, even though it is, if it matches another line that is a substring of it. Alternatively you can install GNU grep on macOS in order to use this solution.
If you're short of "fancy tools", e.g. in some minimal Linux distribution, there is a solution with just cat, sort and uniq:
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
Test:
seq 1 1 7 | sort --random-sort > includes.txt
seq 3 1 9 | sort --random-sort > excludes.txt
cat includes.txt excludes.txt excludes.txt | sort | uniq --unique
# Output:
1
2
This is also relatively fast, compared to grep.
whats the speed of as sort and diff?
sort file1 -u > file1.sorted
sort file2 -u > file2.sorted
diff file1.sorted file2.sorted
Use combine from moreutils package, a sets utility that supports not, and, or, xor operations
combine file1 not file2
i.e give me lines that are in file1 but not in file2
OR give me lines in file1 minus lines in file2
Note: combine sorts and finds unique lines in both files before performing any operation but diff does not. So you might find differences between output of diff and combine.
So in effect you are saying
Find distinct lines in file1 and file2 and then give me lines in file1 minus lines in file2
In my experience, it's much faster than other options
This seems quick for me :
comm -1 -3 <(sort file1.txt) <(sort file2.txt) > output.txt
$ join -v 1 -t '' file1 file2
line2
line3
The -t makes sure that it compares the whole line, if you had a space in some of the lines.
You can use Python:
python -c '
lines_to_remove = set()
with open("file2", "r") as f:
for line in f.readlines():
lines_to_remove.add(line.strip())
with open("f1", "r") as f:
for line in f.readlines():
if line.strip() not in lines_to_remove:
print(line.strip())
'
Using of fgrep or adding -F option to grep could help. But for faster calculations you could use Awk.
You could try one of these Awk methods:
http://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/#post4066219
The way I usually do this is using the --suppress-common-lines flag, though note that this only works if your do it in side-by-side format.
diff -y --suppress-common-lines file1.txt file2.txt

Finiding common lines for two files using bash

I am trying to compare two files and output a file which consists of common names for both.
File1
1990.A.BHT.s_fil 4.70
1991.H.BHT.s_fil 2.34
1992.O.BHT.s_fil 3.67
1993.C.BHT.s_fil -1.50
1994.I.BHT.s_fil -3.29
1995.K.BHT.s_fil -4.01
File2
1990.A.BHT_ScS.dat 1537 -2.21
1993.C.BHT_ScS.dat 1494 1.13
1994.I.BHT_ScS.dat 1545 0.15
1995.K.BHT_ScS.dat 1624 1.15
I want to compare the first parts of the names ** (ex:1990.A.BHT ) ** on both files and output a file which has common names with the values on 2nd column in file1 to file3
ex: file3 (output)
1990.A.BHT.s_fil 4.70
1993.C.BHT.s_fil -1.50
1994.I.BHT.s_fil -3.29
1995.K.BHT.s_fil -4.01
I used following codes which uses grep command
while read line
do
grep $line file1 >> file3
done < file2
and
grep -wf file1 file2 > file3
I sort the files before using this script.
But I get an empty file3. Can someone help me with this please?
You need to remove everything starting from _SCS.dat from the lines in file2. Then you can use that as a pattern to match lines in file1.
grep -F -f <(sed 's/_SCS\.dat.*//' file2) file1 > file3
The -F option matches fixed strings rather than treating them as regular expressions.
In your example data, the lines appear to be in sorted order. If you can guarantee that they always are, comm -1 -2 file1 file2 would do the job. If they can be unsorted, do a
comm -1 -2 <(sort file1) <(sort file2)

Bash - reading two files and searching within files

I have two files, file1 and file2. I want to reach each line from file1, and then search if any of the lines in file2 is present in file1. I am using the following bash script, but it does not seem to be working. What should I change? (I am new to bash scripting).
#!/bin/bash
while read line1
do
echo $line1
while read line2
do
if grep -Fxq "line2" "$1"
then
echo "found"
fi
done < "$2"
done < "$1"
Note: Both files are text files.
Use grep -f
grep -f file_with_search_words file_with_content
Note however that if file_with_search_words contains blank lines everything will be matched. But that can be easily avoided with:
grep -f <(sed '/^$/d' file_with_search_words) file_with_content
From the man page:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. If this option is used
multiple times or is combined with the -e (--regexp) option, search
for all patterns given. The empty file contains zero patterns, and
therefore matches nothing.
You may use the command "comm", it compare two sorted files line-by-line
This command show the common lines in file1 and file2
comm -12 file1 file2
The only problem with this command is that you have to sort the files before, like this:
sort file1 > file1sorted
http://www.computerhope.com/unix/ucomm.htm
File 1
Line 1
Line 3
Line 6
Line 9
File 2
Line 3
Line 6
awk 'NR==FNR{con[$0];next} $0 in con{print $0}' file1 file2
will give you
Line 3
Line 6
that is the content in file 2 which is present in file1.
If you wish to ignore the spaces you can achieve with the below one.
awk 'NR==FNR{con[$0];next} !/^$/{$0 in con;print $0}' file1 file2

Difference between two files without sorting

I have the files file1 and file2, where file2 is a subset of file1. That means, if I iterate over file1, there are some lines that are in file2, and some that aren't, but there is no line in file2 that is not in file1. There may be several lines with the same content in a file. Now I want to get the difference between them, that is, all lines of file1 that aren't in file2.
According to this well received answer
diff(1) isn't the answer, comm(1) is.
(For whatever reason)
But as I understand, for comm the files need to be sorted first. The problem: Both files are ordered (not sorted!), and this order needs to be kept. So what I really want is to iterate over file1, and check for every line, if it is also in file2. If not, write it to file3. If the same content occurs more than once, it should be kept more than once!
Is there any way to do this with the command line?
Try this with GNU grep:
grep -vFf file2 file1 > file3
Update:
grep -vxFf file2 file1 > file3
I think you do not want to sort for avoiding temp files. This is possible with process substitution:
diff <(sort file1) <(sort file2)
# or
comm <(sort file1) <(sort file2)
Edit: Using https://stackoverflow.com/a/4544925/3220113 I found another alternative (for text files with short lines):
diff -a --suppress-common-lines -y file2 file1 | sed 's/\s*>.//'

Resources