Compare two files, and keep only if first word of each line is the same - bash

Here are two files where I need to eliminate the data that they do not have in common:
a.txt:
hello world
tom tom
super hero
b.txt:
hello dolly 1
tom sawyer 2
miss sunshine 3
super man 4
I tried:
grep -f a.txt b.txt >> c.txt
And this:
awk '{print $1}' test1.txt
because I need to check only if the first word of the line exists in the two files (even if not at the same line number).
But then what is the best way to get the following output in the new file?
output in c.txt:
hello dolly 1
tom sawyer 2
super man 4

Use awk where you iterate over both files:
$ awk 'NR == FNR { a[$1] = 1; next } a[$1]' a.txt b.txt
hello dolly 1
tom sawyer 2
super man 4
NR == FNR is only true for the first file making { a[$1] = 1; next } only run on said file.

Use sed to generate a sed script from the input, then use another sed to execute it.
sed 's=^=/^=;s= .*= /p=' a.txt | sed -nf- b.txt
The first sed turns your a.txt into
/^hello /p
/^tom /p
/^super /p
which prints (p) whenever a line contains hello, tom, or super at the beginning of line (^) followed by a space.

This combines grep, cut and sed with process substitution:
$ grep -f <(cut -d ' ' -f 1 a.txt | sed 's/^/^/') b.txt
hello dolly 1
tom sawyer 2
super man 4
The output of the process substitution is this (piping to cat -A to show spaces):
$ cut -d ' ' -f 1 a.txt | sed 's/^/^/;s/$/ /' | cat -A
^hello $
^tom $
^super $
We then use this as input for grep -f, resulting in the above.
If your shell doesn't support process substitution, but your grep supports reading from stdin with the -f option (it should), you can use this instead:
$ cut -d ' ' -f 1 a.txt | sed 's/^/^/;s/$/ /' | grep -f - b.txt
hello dolly 1
tom sawyer 2
super man 4

Related

How to insert a generated value by a loop while you open a file in bash

Lets say that I have:
cat FILENAME1.txt
Definition john
cat FILENAME2.txt
Definition mary
cat FILENAME3.txt
Definition gary
cat textfile.edited
text
text
text
I want to obtain an ouput like:
1 john text
2 mary text
3 gary text
I tried to use "stored" values from FILENAMES "generated" by a loop. I wrote this:
for file in $(ls *.txt); do
name=$(cat $file| grep -i Definition|awk '{$1="";print $0}')
#echo $name --> this command works as it gives the names
done
cat textfile.edited| awk '{printf "%s\t%s\n",NR,$0}'
which very close to what I want to get
1 text
2 text
3 text
My issue was coming through when I tried to add the "stored" value. I tried the following with no success.
cat textfile.edited| awk '{printf "%s\t%s\n",$name,NR,$0}'
cat textfile.edited| awk '{printf "%s\t%s\n",name,NR,$0}'
cat textfile.edited| awk -v name=$name '{printf "%s\t%s\n",NR,$0}'
Sorry if the terminology used is not the best, but I started scripting recently.
Thank you in advance!!!
One solution using paste and awk ...
We'll append a count to the lines in textfile.edited (so we can see which lines are matched by paste):
$ cat textfile.edited
text1
text2
text3
First we'll look at the paste component:
$ paste <(egrep -hi Definition FILENAME*.txt) textfile.edited
Definition john text1
Definition mary text2
Definition gary text3
From here awk can do the final slicing-n-dicing-n-numbering:
$ paste <(egrep -hi Definition FILENAME*.txt) textfile.edited | awk 'BEGIN {OFS="\t"} {print NR,$2,$3}'
1 john text1
2 mary text2
3 gary text3
NOTE: It's not clear (to me) if the requirement is for a space or tab between the 2nd and 3rd columns; above solution assumes a tab, while using a space would be doable via a (awk) printf call.
You can do all with one awk command.
First file is the textfile.edited, other files are mentioned last.
awk 'NR==FNR {text[NR]=$0;next}
/^Definition/ {namenr++; names[namenr]=$2}
END { for (i=1;i<=namenr;i++) printf("%s %s %s\n", i, names[i], text[i]);}
' textfile.edited FILENAME*.txt
You can avoid awk with
paste -d' ' <(seq $(wc -l <textfile.edited)) \
<(sed -n 's/^Definition //p' FILE*) \
textfile.edited
Another version of the paste solution with a slightly careless grep -
$: paste -d\ <( grep -ho '[^ ]*$' FILENAME?.txt ) textfile.edited
john text
mary text
gary text
Or, one more way to look at it...
$: a=( $(sed '/^Definition /s/.* //;' FILENAME[123].txt) )
$: echo "${a[#]}"
john mary gary
$: b=( $(<textfile.edited) )
$: echo "${b[#]}"
text text text
$: c=-1 # initialize so that the first pre-increment returns 0
$: while [[ -n "${a[++c]}" ]]; do echo "${a[c]} ${b[c]}"; done
john text
mary text
gary text
This will put all the values in memory before printing anything, so if the lists are really large it might not be your best bet. If they are fairly small, it's pretty efficient, and a single parallel index will keep them in order.
If the lines are not the same as the number of files, what did you want to do? As long as there aren't more files than lines, and any extra lines are ok to ignore, this still works. If there are more files than lines, then we need to know how you'd prefer to handle that.
A one-liner using GNU utilities:
paste -d ' ' <(cat -n FILENAME*.txt | sed 's/\sDefinition//') textfile.edited
Or,
paste -d ' ' <(cat -n FILENAME*.txt | sed 's/^\s*//;s/\sDefinition//') textfile.edited
if the leading white spaces are not desired.
Alternatively:
paste -d ' ' <(sed 's/^Definition\s//' FILENAME*.txt | cat -n) textfile.edited

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

Combine two lines from different files when the same word is found in those lines

I'm new with bash, and I want to combine two lines from different files when the same word is found in those lines.
E.g.:
File 1:
organism 1
1 NC_001350
4 NC_001403
organism 2
1 NC_001461
1 NC_001499
File 2:
NC_001499 » Abelson murine leukemia virus
NC_001461 » Bovine viral diarrhea virus 1
NC_001403 » Fujinami sarcoma virus
NC_001350 » Saimiriine herpesvirus 2 complete genome
NC_022266 » Simian adenovirus 18
NC_028107 » Simian adenovirus 19 strain AA153
i wanted an output like:
File 3:
organism 1
1 NC_001350 » Saimiriine herpesvirus 2 complete genome
4 NC_001403 » Fujinami sarcoma virus
organism 2
1 NC_001461 » Bovine viral diarrhea virus 1
1 NC_001499 » Abelson murine leukemia virus
Is there any way to get anything like that output?
You can get something pretty similar to your desired output like this:
awk 'NR == FNR { a[$1] = $0; next }
{ print $1, ($2 in a ? a[$2] : $2) }' file2 file1
This reads in each line of file2 into an array a, using the first field as the key. Then for each line in file1 it prints the first field followed by the matching line in a if one is found, else the second field.
If the spacing is important, then it's a little more effort but totally possible.
For a more Bash 4 ish solution:
declare -A descriptions
while read line; do
name=$(echo "$line" | cut -d '»' -f 1 | xargs echo)
description=$(echo "$line" | cut -d '»' -f 2)
eval "descriptions['$name']=' »$description'"
done < file2
while read line; do
name=$(echo "$line" | cut -d ' ' -f 2)
if [[ -n "$name" && -n "${descriptions[$name]}" ]]; then
echo "${line}${descriptions[$name]}"
else
echo "$line"
fi
done < file1
We could create a sed-script from the second file and apply it to the first file. It is straight forward, we use the sed s command to construct another sed s command from each line and store in a variable for later usage:
sc=$(sed -rn 's#^\s+(\w+)([^\w]+)(.*)$#s/\1/\1\2\3/g;#g; p;' file2 )
sed "$sc" file1
The first command looks so weird, because we use # in the outer sed s and we use the more common / in the inner sed s command as delimiters.
Do a echo $sc to study the inner one. It just takes the parts of each line of file2 into different capture groups and then combines the captured strings to a s/find/replace/g; with
find is \1
replace is \1\2\3
You want to rebuild file2 into a sed-command file.
sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2
You can use process substitution to use the result without storing it in a temp file.
sed -f <(sed 's# \(\w\+\) \(.*\)#s/\1/\1 \2/#' File2) File1

Extract lines from a file in bash

I have a file like this
I would like to extract the line with the 0 and 1 (all lines in the file) into a seperate file. However, the sequence does not have to start with a 0 but could also start with a 1. However, the line always comes directly after the line (SITE:). Moreover, I would like to extract the line SITTE itself into a seperate file. Could somebody tell me how that is doable in bash?
Moreover, I would like to extract the line SITTE itself into a seperate file.
That’s the easy part:
grep '^SITE:' infile > outfile.site
Extracting the line after that is slightly harder:
grep --after-context=1 '^SITE:' infile \
| grep '^[01]*$' \
> outfile.nr
--after-context (or -A) specifies how many lines after the matching line to print as well. We then use the second grep to print only that line, and not the actually matching line (nor the delimiter which grep puts between each matching entry when specifying an after-context).
Alternatively, you could use the following to match the numeric lines:
grep '^[01]*$' infile > outfile.nr
That’s much easier, but it will find all lines consisting solely of 0s and 1s, regardless of whether they come after a line which starts with SITE:.
You could try something like :
$ egrep -o "^(0|1)+$" test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
$ grep "^SITE:" test.txt > test3.txt
$ cat test3.txt
SITE: 0 0.000340988542 0.0357651018
SITE: 1 0.000529755514 0.00324293642
SITE: 2 0.000577745511 0.052214098
Another solution, using bash :
$ while read; do [[ $REPLY =~ ^(0|1)+$ ]] && echo "$REPLY"; done < test.txt > test2.txt
$ cat test2.txt
0000000000001010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
0000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000
0011010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
To remove the characters 0 at beginning of the line :
$ egrep "^(0|1)+$" test.txt | sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
1010000000000000010000000000000000000100000000000010000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000000000000
11010000000000001010000000000000001000010001000000001001001000011000000000000000101000101010101000
UPDATE : New file format provided in comments :
$ egrep "^SITE:" test.txt|egrep -o "(0|1)+$"|sed "s/^0\{1,\}//g" > test2.txt
$ cat test2.txt
100000000000000000000001000001000000000000000000000000000000000000
1010010010000000000111101000010000001001010111111100000000000010010001101010100011101011110011100
10000000000
$ egrep "^SITE:" test.txt|sed "s/[01\ ]\{1,\}$//g" > test3.txt
$ cat test3.txt
SITE: 967 0.189021866 0.0169990123
SITE: 968 0.189149593 0.246619149
SITE: 969 0.189172266 6.84752689e-05
Here's a simple awk solution that matches all lines starting with SITE: and outputs the respective next line:
awk '/^SITE:/ { if (getline) print }' infile > outfile
Simply omit the { ... } block part to extract all lines starting with SITE: themselves to a separate file:
awk '/^SITE:/' infile > outfile
If you wanted to combine both operations:
outfile1 and outfile2 are the names of the 2 output files, passed to awk as variables f1 and f2:
awk -v f1=outfile1 -v f2=outfile2 \
'/^SITE:/ { print > f1; if (getline) print > f2 }' infile

Is there a shell command to pick the n-th line?

Is there a shell command to pick the n-th line of a string ?
Example:
line1
line2
line3
pick line 2.
UPDATE: Thank you so far. With your help, I came up with this solution for a string:
Pick the 2nd line:
echo -e "1\n2\n3" | head -2 | tail -1
$ head -n filename | tail -1
where 'n' is your line number. But it's a little inefficient, launching 2 processes.
Alternatively sed can do this. To print the 4th line:
$ sed -n 4p filename
This forum answer details 3 different methods for sed
# print line number 52
sed -n '52p' # method 1
sed '52!d' # method 2
sed '52q;d' # method 3, efficient on large files
Using gawk:
gawk -v n=3 'n==NR { print; exit }' a.txt
head -4 a.txt | tail -1
To print the 4:th line in a. txt.

Resources