Bash: tell if a file is included in another - bash

I'm trying to compare the content of two files and tell if the content of one is totally included in another (meaning if one file has three lines, A, B and C, can I find those three lines, in that order, in the second file). I've looked at diff and grep but wasn't able to find the relevant option (if any).
Examples:
file1.txt file2.txt <= should return true (file2 is included in file1)
--------- ---------
abc def
def ghi
ghi
jkl
file1.txt file2.txt <= should return false (file2 is not included in file1)
--------- ---------
abc abc
def ghi
ghi
jkl
Any idea?

Using the answer from here
Use the following python function:
def sublistExists(list1, list2):
return ''.join(map(str, list2)) in ''.join(map(str, list1))
In action:
In [35]: a=[i.strip() for i in open("f1")]
In [36]: b=[i.strip() for i in open("f2")]
In [37]: c=[i.strip() for i in open("f3")]
In [38]: a
Out[38]: ['abc', 'def', 'ghi', 'jkl']
In [39]: b
Out[39]: ['def', 'ghi']
In [40]: c
Out[40]: ['abc', 'ghi']
In [41]: sublistExists(a, b)
Out[41]: True
In [42]: sublistExists(a, c)
Out[42]: False

Assuming your file2.txt does not contain characters with special meaning for regular expressions, you can use:
grep "$(<file2.txt)" file1.txt

This should work even if your file2.txt contains special characters:
cp file1.txt file_read.txt
while read -r a_line ; do
first_line_found=$( fgrep -nx "${a_line}" file_read.txt 2>/dev/null | head -1 )
if [ -z "$first_line_found" ];
then
exit 1 # we couldn't find a_line in the file_read.txt
else
{ echo "1,${first_line_found}d" ; echo "w" ; } | ed file_read.txt #we delete up to line_found
fi
done < file2.txt
exit 0
(the "exit 0" is there for "readability" so one can see easily that it exits with 1 only if fgrep can't find a line in file1.txt. It's not needed)
(fgrep is a literral grep, searching for a string (not a regexp))
(I haven't tested the above, it's a general idea. I hope it does work though ^^)
the "-x" force it to match lines exactly, ie, no additionnal characters (ie : "to" can no longer match "toto". Only "toto" will match "toto" when adding -x)

please try if this awk "one-liner" ^_^ works for your real file. for the example files in your question, it worked:
awk 'FNR==NR{a=a $0;next}{b=b $0}
END{while(match(b,a,m)){
if(m[0]==a) {print "included";exit}
b=substr(b,RSTART+RLENGTH)
}
print "not included"
}' file2 file1

Related

sed/awk between two patterns in a file: pattern 1 set by a variable from lines of a second file; pattern 2 designated by a specified charcacter

I have two files. One file contains a pattern that I want to match in a second file. I want to use that pattern to print between that pattern (included) up to a specified character (not included) and then concatenate into a single output file.
For instance,
File_1:
a
c
d
and File_2:
>a
MEEL
>b
MLPK
>c
MEHL
>d
MLWL
>e
MTNH
I have been using variations of this loop:
while read $id;
do
sed -n "/>$id/,/>/{//!p;}" File_2;
done < File_1
hoping to obtain something like the following output:
>a
MEEL
>c
MEHL
>d
MLWL
But have had no such luck. I have played around with grep/fgrep awk and sed and between the three cannot seem to get the right (or any output). Would someone kindly point me in the right direction?
Try:
$ awk -F'>' 'FNR==NR{a[$1]; next} NF==2{f=$2 in a} f' file1 file2
>a
MEEL
>c
MEHL
>d
MLWL
How it works
-F'>'
This sets the field separator to >.
FNR==NR{a[$1]; next}
While reading in the first file, this creates a key in array a for every line in file file.
NF==2{f=$2 in a}
For every line in file 2 that has two fields, this sets variable f to true if the second field is a key in a or false if it is not.
f
If f is true, print the line.
A plain (GNU) sed solution. Files are read only once. It is assumed that characters in File_1 needn't to be quoted in sed expression.
pat=$(sed ':a; $!{N;ba;}; y/\n/|/' File_1)
sed -E -n ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}" File_2
Explanation:
The first call to sed generates a regular expression to be used in the second call to sed and stores it in the variable pat. The aim is to avoid reading repeatedly the entire File_2 for each line of File_1. It just "slurps" the File_1 and replaces new-line characters with | characters. So the sample File_1 becomes a string with the value a|c|d. The regular expression a|c|d matches if at least one of the alternatives (a, b, c for this example) matches (this is a GNU sed extension).
The second sed expression, ":a; /^>($pat)/{:b; p; n; /^>/ba; bb}", could be converted to pseudo code like this:
begin:
read next line (from File_2) or quit on end-of-file
label_a:
if line begins with `>` followed by one of the alternatives in `pat` then
label_b:
print the line
read next line (from File_2) or quit on end-of-file
if line begins with `>` goto label_a else goto label_b
else goto begin
Let me try to explain why your approach does not work well:
You need to say while read id instead of while read $id.
The sed command />$id/,/>/{//!p;} will exclude the lines which start
with >.
Then you might want to say something like:
while read id; do
sed -n "/^>$id/{N;p}" File_2
done < File_1
Output:
>a
MEEL
>c
MEHL
>d
MLWL
But the code above is inefficient because it reads File_2 as many times as the count of the id's in File_1.
Please try the elegant solution by John1024 instead.
If ed is available, and since the shell is involve.
#!/usr/bin/env bash
mapfile -t to_match < file1.txt
ed -s file2.txt <<-EOF
g/\(^>[${to_match[*]}]\)/;/^>/-1p
q
EOF
It will only run ed once and not every line that has the pattern, that matches from file1. Like say if you have a to z from file1,ed will not run 26 times.
Requires bash4+ because of mapfile.
How it works
mapfile -t to_match < file1.txt
Saves the entry/value from file1 in an array named to_match
ed -s file2.txt point ed to file2 with the -s flag which means don't print info about the file, same info you get with wc file
<<-EOF A here document, shell syntax.
g/\(^>[${to_match[*]}]\)/;/^>/-1p
g means search the whole file aka global.
( ) capture group, it needs escaping because ed only supports BRE, basic regular expression.
^> If line starts with a > the ^ is an anchor which means the start.
[ ] is a bracket expression match whatever is inside of it, in this case the value of the array "${to_match[*]}"
; Include the next address/pattern
/^>/ Match a leading >
-1 go back one line after the pattern match.
p print whatever was matched by the pattern.
q quit ed

Sed insert file contents rather than file name

I have two files and would like to insert the contents of one file into the other, replacing a specified line.
File 1:
abc
def
ghi
jkl
File 2:
123
The following code is what I have.
file1=numbers.txt
file2=letters.txt
linenumber=3s
echo $file1
echo $file2
sed "$linenumber/.*/r $file1/" $file2
Which results in the output:
abc
def
r numbers.txt
jkl
The output I am hoping for is:
abc
def
123
jkl
I thought it could be an issue with bash variables but I still get the same output when I manually enter the information.
How am I misunderstanding sed and/or the read command?
Your script replaces the line with the string "r $file1". The part in sed in s command is not re-interpreted as a command, but taken literally.
You can:
linenumber=3
sed "$linenumber"' {
r '"$file1"'
d
}' "$file2"
Read line number 3, print file1 and then delete the line.
See here for a good explanation and reference.
Surely we can make that a oneliner:
sed -e "$linenumber"' { r '"$file2"$'\n''d; }' "$file1"
Life example at tutorialpoints.
I would use the c command as follows:
linenumber=3
sed "${linenumber}c $(< $file1)" "$file2"
This replaces the current line with the text that comes after c.
Your command didn't work because it expands to this:
sed "3s/.*/r numbers.txt/" letters.txt
and you can't use r like that. r has to be the command that is being run.

Why does a "while read" loop stop when grep is run with an empty argument?

The following code does not work as I would expect:
(the original purpose of the script is to make a relation between items of two files where the identifiers are not sorted in the same order, but my question raises rather a curiosity about basic shell functionalities)
#!/bin/sh
process_line() {
id="$1"
entry=$(grep $id index.txt) # the "grep" line
if [ "$entry" = "" ]; then
echo 00000 $id
else
echo $entry | awk '{print $2, $1;}'
fi
}
cat << EOF > index.txt
xyz 33333
abc 11111
def 22222
EOF
cat << EOF | while read line ; do process_line "$line"; done
abc
def
xyz
EOF
The output is:
11111 abc
22222 def
00000
But I would expect:
11111 abc
22222 def
00000
33333 xyz
(the last line is missing in the actual output)
My investigations show that the "grep" line is the one that leads to the early interruption of the while loop. However I cannot see the causal relationship.
That's because in the third iteration with the empty line, you call process_line with an empty id. This leads to grep index.txt, i.e. no file name. This grep reads from stdin and that'll consume all your input you pipe into the while loop.
To see this in action, add set -x at the top of your script.
You can get the desired behaviour if you replace the empty id with a string guaranteed to be not found, such as
entry=$(grep "${id:-NoSuchString}" index.txt)
Changing the "process_line" function to the following might help...
process_line() {
id=$1
if [ "$id" = "" ]
then
echo "00000"
else
entry=$(grep "${id}" index.txt)
echo "$entry" | awk '{ print $2, $1 }'
fi
}
Explanation:
if the "id" passed in is empty then just output the default
move the grep to the else clause so it only executes when "id" has a value
solves the problem with the missing quotes around id in the grep statement
another thing to consider is the case where "id" is not-empty but not found in the index.txt file. This could result in a blank output. Adding an if statement after the grep call to handle this case may be a good idea depending on what the overall intention is.
Hope that helps

Bash/Sed - multiline sed operation printing lines out of order

I am having some trouble with using sed to edit a log file. I have built it into a function which is supposed to replace the text between two search strings with the output from another function. It is almost working correctly, but is printing the lines to the log file out of order. For the life of me I can't figure out why, and most adjustments I have made while trying to fix it have actually had less desirable results.
My sed function:
log_edit(){
"$3" > temp.txt
sed -i -n "/$1/{
:loop
n
/$2/!b loop
x
r temp.txt
G
s/$2/\n\n&/
}
p" "$FILE"
rm temp.txt
}
I am using the "=== text ===" dividers as my start and stop strings to pass along to the function, and using the same functions that built the log in the first place to fill the temporary text file.
The problem is occurring somewhere near/related to the 'G' command. Rather than appending the hold pattern line to the end of the string, it appears to be attaching it to the beginning of the string.
Original log sample/Desired output:
=== Metech ITAMS Log ===
Metech Recycling
ITAMS Hardware Report
Date: Thu Mar 2 08:01:38 PST 2017
Tech: SP
=== Manufacturer Information ===
# dmidecode 2.12
...
Unfortunately, the output I am getting looks like this:
=== Manufacturer Information ===
=== Metech ITAMS Log ===
Metech Recycling
ITAMS Hardware Report
Date: Fri Mar 3 09:39:02 PST 2017
Tech: SS
# dmidecode 2.12
...
Would someone be able to help me understand what I'm doing wrong, or propose a fix? This is my first question ever to SO, if more information is necessary I am happy to provide it. Thanks in advance.
Edit #1: As requested a snippet of the code that calls the function:
2)
printf "\n"
text_prompt "Please enter Tech initials: "
set_tech_id
text_prompt "Please enter Traveler ID: "
set_travel_id
mv "$FILE" "$TRAVEL_ID $TECH_INITIALS"
FILE="$TRAVEL_ID $TECH_INITIALS"
log_edit "=== Metech ITAMS Log ===" \
"=== Manufacturer Information ===" "print_header"
unset TECH_INITIALS
unset TRAVEL_ID
;;
This is part of a menu function, and it would be overkill to include the whole thing, just be aware that there will be several calls to log_edit with different start/stop strings (though all follow the === === pattern), but usually calling different functions to fill the temp.txt with.
Edit 2: For added clarity, I thought I should add the function being called with $3:
print_header(){ #Prints log header.
print_div "Metech ITAMS Log"
printf "Metech Recycling\nITAMS Hardware Report\nDate: $(date)\nTech: %s\n" \
"$TECH_INITIALS"
}
and print_header calls print_div:
print_div(){ #Prints a divider. Required parameter: $1=Text for divider.
printf "\n=== %s ===\n\n" "$1"
}
Edit 3: For question clarity, my issue is that the $2 string is being written to the log before the contents of temp.txt, rather then after.
Final Edit: A solution was found. I thought I would post the working code below just in case it's helpful to others. A big portion of my problem was a misunderstanding with how sed uses the 'r' command. There's another part to this solution that came from the accepted answer that I still don't understand, and that is the substitute commands that add backslashes, this was key to making it work. I don't know why it works, but it does.
log_edit() { #Works!!
"$3" > temp.txt
sed -i -n '/^'"$1"'$/ {
:loop
n
/^'"$2"'$/!b loop
i\
'"$(sed 's/\\/\\&/g;s/$/\\/' -- "temp.txt")"'
#Blank line terminates i command.
}
p' "$FILE"
rm temp.txt
}
The r command copies out the file before the next read, not when it is evaluated, and does not modify pattern-space. However the file can be inserted into the script as part of an i command:
log_edit() {
sed -n '/^'"$1"'$/ {
p
:loop
n
/^'"$2"'$/!bloop
i\
'"$("$3" | sed 's/^[[:space:]]/\\&/;s/\\/\\&/g;s/$/\\/')"'
# The blank line above is part of the `i' command,
# and appends a newline to the inserted text.
}
p' "$FILE" > "$FILE.mod" && mv -f -- "$FILE.mod" "$FILE"
}
The command-substitution "$("$3" | sed '...')" filters the output
of $3 for use with sed's i command. The i command prints
a series of lines will all but the last ending with a \.
$ echo three | sed 'i\
> one\
> two
> '
one
two
three
Looks like just a few things out of order there. Try this:
log_edit(){
"$3" > text.tmp
sed -i -n "/$1/{
r text.tmp
:loop
N
/$2/!b loop
s/.*\n/\n\n/g
}
p
" "$FILE"
rm text.tmp
}
print_header(){ #Prints log header.
print_div "Metech ITAMS Log"
printf "Metech Recycling\nITAMS Hardware Report\nDate: $(date)\nTech:%s" "$TECH_INITIALS"
}
print_div(){ #Prints a divider. Required parameter: $1=Text for divider.
printf "\n=== %s ===\n\n" "$1"
}
log_edit "=== Metech ITAMS Log ===" "=== Manufacturer Information ===" "print_header"
Try the csplit program, which can divide a file up into sections according to a pattern:
csplit $3 "/\($1\|$2\)/" "{*}"
This means take file $3, and break it into files xxNN (where NN starts at 00 and goes up) according to sections demarcated by an unlimited number ({*}) of patterns $1 OR $2 (two alternate patterns, separated by \| and grouped by escaped parentheses). The demarcation lines will remain in the output. You can then write ancillary code to delete the files you don't want. You can also change the name of the output filename and pattern.
# cat foo
a
b
#
c
d
%
e
f
#
g
h
%
i
j
# csplit foo '/\(#\|%\)/' '{*}'
4
6
6
6
6
# more xx0*
::::::::::::::
xx00
::::::::::::::
a
b
::::::::::::::
xx01
::::::::::::::
#
c
d
::::::::::::::
xx02
::::::::::::::
%
e
f
::::::::::::::
xx03
::::::::::::::
#
g
h
::::::::::::::
xx04
::::::::::::::
%
i
j
Note: You'll need to tweak it if your demarcation lines can repeat/occur out of order. This is very simple; the breaks occur at any point that one pattern OR the other is seen, regardless of order.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Resources