Removed all occurences from file A from file B - bash

I have two files: A and B.
Contents of A:
http://example.com/1
http://example.com/2
http://example.com/3
http://example.com/4
http://example.com/5
http://example.com/6
http://example.com/7
http://example.com/8
http://example.com/9
http://example.com/4
Contents from file B:
http://example.com/1
http://example.com/3
http://example.com/9
http://example.com/4
Now, I would like to remove all the occurences of the lines in file B from file A.
I have tried following:
for LINK in $(sort -u B);do sed -i -e 's/"$LINK"//g' A; echo "Removed $LINK";done
But it didn't do anything at all.

grep -vf will be simpler for this:
grep -vxFf file2 file1
http://example.com/2
http://example.com/5
http://example.com/6
http://example.com/7
http://example.com/8

Related

Replace third line of nth file with nth line of a single file

Say I have hundreds of *.xml in /train/xml/, in the following format
# this is the content of /train/xml/RIGHT_NAME.xml
<annotation>
<path>/train/img/WRONG_NAME.jpg</path> # this is the WRONG_NAME
</annotation>
The file name WRONG_NAME in <path>...</path> should match that of the .xml file, so that it looks like this:
# this is the content of /train/xml/RIGHT_NAME.xml
<annotation>
<path>/train/img/RIGHT_NAME.jpg</path> # this is the **RIGHT_NAME**
</annotation>
One solution I can think of is to:
1. export all file names into a text file:
ls -1 *.xml > filenames.txt
which generates a file with the content:
RIGHT_NAME_0.xml
RIGHT_NAME_1.xml
...
2. then edit filenames.txt, so that it becomes:
# tab at beginning of each line
<path>/train/img/RIGHT_NAME_0.jpg</path>
<path>/train/img/RIGHT_NAME_1.jpg</path>
...
3. Then, replace the third line of nth .xml file with the nth line from filenames.txt.
Thus the question title.
I've hammered around with sedand awk but had no success. How should I do it (on a EDIT: MacOS machine)? Also, is there a more elegant solution?
Thanks in advance for helping out!
---things I've tried (and didnt work out)---
# this replaces the fifth line with an empty string
for i in *.xml ; do perl -i.bak -pe 's/.*/$i/ if $.==5' RIGHT_NAME.xml ; done
# this apprehends contents of filenames.txt after third line
sed -i.bak -e '/\<path\>/r filenames.txt' RIGHT_NAME.xml
# also, trying to utilize the <path>...</path> pattern...
Untested:
for xml in *.xml; do
sed -E -i.bak '3s/([^/]*.jpg)/'"${xml/.xml/.jpg}/" "$xml"
done
If ed is acceptable since it should be installed by default on a mac.
#!/bin/sh
for file in ./*.xml; do
printf 'Processing %s\n' "$file"
f=${file%.*}; f=${f#*./}
printf '%s\n' H "g/<annotation>/;/<\/annotation>/\
s|^\([[:blank:]]*<path>.*/\)[^.]*\(.*</path>\)|\1${f}\2|" %p Q |
ed -s "$file" || break
done
Will give desired results even if you have
/foo/bar/baz/more/train/img/WRONG_NAME.jpg
Will only edit/parse the string inside the path tag which is inside the annotation tag.
Change Q to w if in-place editing is needed.
Remove the %p to silence the output.
Caveat:
ed is not an xml editor/parser.
Using GNU awk (which you can easily install on MacOS if it's not already present on your system) for "inplace" editing, gensub() and the 3rd arg to match():
$ cat tst.awk
match($0,"(^\\s*<path>.*/).*([.][^.]+</path>)",a) {
name = gensub("(.*/)?(.*)[.][^.]+$","\\2",1,FILENAME)
$0 = a[1] name a[2]
}
{ print }
$ head *.xml
==> RIGHT_NAME_1.xml <==
# this is the content of /train/xml/RIGHT_NAME_1.xml
<annotation>
<path>/train/img/WRONG_NAME.xml.jpg</path>
</annotation>
==> RIGHT_NAME_2.xml <==
# this is the content of /train/xml/RIGHT_NAME_2.xml
<annotation>
<path>/train/img/WRONG_NAME.xml.jpg</path>
</annotation>
$ awk -i inplace -f tst.awk *.xml
$ head *.xml
==> RIGHT_NAME_1.xml <==
# this is the content of /train/xml/RIGHT_NAME_1.xml
<annotation>
<path>/train/img/RIGHT_NAME_1.jpg</path>
</annotation>
==> RIGHT_NAME_2.xml <==
# this is the content of /train/xml/RIGHT_NAME_2.xml
<annotation>
<path>/train/img/RIGHT_NAME_2.jpg</path>
</annotation>
Just call it as awk -i inplace -f tst.awk /train/xml/* on your system. Note that the above just replaces the name in the <path> tag content wherever it occurs on it's own line and so it will work whether that's the 3rd line in any given file or some other line. If you REALLY only want to do this for the 3rd line then just change match(... to FNR==3 && match(....
This might work for you (GNU sed & parallel):
parallel --dry sed -i '3s#[^/]*.jpg#{/.}.jpg#' {} ::: /train/xml/*.xml
In parallel the {} represents the file name and its path whereas the {/.} represents the filename less the path and its extension.
Once the output from the above solution has been checked the option --dry which is short form --dry-run can be removed.

Sed insert file contents rather than file name

I have two files and would like to insert the contents of one file into the other, replacing a specified line.
File 1:
abc
def
ghi
jkl
File 2:
123
The following code is what I have.
file1=numbers.txt
file2=letters.txt
linenumber=3s
echo $file1
echo $file2
sed "$linenumber/.*/r $file1/" $file2
Which results in the output:
abc
def
r numbers.txt
jkl
The output I am hoping for is:
abc
def
123
jkl
I thought it could be an issue with bash variables but I still get the same output when I manually enter the information.
How am I misunderstanding sed and/or the read command?
Your script replaces the line with the string "r $file1". The part in sed in s command is not re-interpreted as a command, but taken literally.
You can:
linenumber=3
sed "$linenumber"' {
r '"$file1"'
d
}' "$file2"
Read line number 3, print file1 and then delete the line.
See here for a good explanation and reference.
Surely we can make that a oneliner:
sed -e "$linenumber"' { r '"$file2"$'\n''d; }' "$file1"
Life example at tutorialpoints.
I would use the c command as follows:
linenumber=3
sed "${linenumber}c $(< $file1)" "$file2"
This replaces the current line with the text that comes after c.
Your command didn't work because it expands to this:
sed "3s/.*/r numbers.txt/" letters.txt
and you can't use r like that. r has to be the command that is being run.

awk to add extracted prefix from file to filename

The below awk execute as is, but it renames fields within each matching file that matches $p (which is extracted from each text file) instead of adding $x which is the prefix to add (from $1 of rename) to each filename in the directory. Each $x is followed by a_ the the filename. I can see in the echo $p the correct value to use in the lookup for $2 is extracted but each file in the directory is unchanged. Not every file in the rename will be in the directory, but it will always have a match to $p. Maybe there is a better way as I am not sure what I am doing wrong. Thank you :).
rename tab-delimeted
00-0000 File-01
00-0001 File-02
00-0002 File-03
00-0003 File-04
file1
File-01_xxxx.txt
file2
File-02_yyyy.txt
desired output
00-0000_File-01-xxxx.txt
00-0001_File-02-yyyy.txt
bash
for file1 in /path/to/folders/*.txt
do
# Grab file prefix
bname=`basename $file1` # strip of path
p="$(echo $bname|cut -d_ -f1,1)" # remove after second underscore
echo $p
# add prefix to matching file
awk -v var="$p" '$2~var{x=$1}(NR=x){print $x"_",$bname}' $file1 rename OFS="\t" > tmp && mv tmp $file1
done
This script :
touch File-01-azer.txt
touch File-02-ytrf.txt
touch File-03-fdfd.txt
touch File-04-dfrd.txt
while read p f;
do
f=$(ls $f*)
mv ${f} "${p}_${f}"
done << EEE
00-0000 File-01
00-0001 File-02
00-0002 File-03
00-0003 File-04
EEE
ls -1
outputs :
00-0000_File-01-azer.txt
00-0001_File-02-ytrf.txt
00-0002_File-03-fdfd.txt
00-0003_File-04-dfrd.txt
You can use a file as input using done < rename_map.txt or cat rename_map.txt | while

Bash Loop To Merge Sorted Files Using The Same Output File?

I'm currently working on a larger script, but I can't get this single function to work properly.
for f in app1/*; do
sort -u $f "temp.txt" > "temp.txt"
done
Directory app1 has a few text files in it. What I am trying to do is take each file one by one and merge it with temp.txt to build an updated sorted temp.txt file without duplicates.
Example:
temp.txt starts as an empty file.
app1/1.txt
a
b
c
d
app1/2.txt
d
e
f
End result at the end of the loop
temp.txt
a
b
c
d
e
f
The problem I'm running into is that the temp.txt file only has the data from the last file passed through the loop.
If all the files combined are not large, you can sort them at once:
sort -u *.txt > all
If the files are large and sorting must be done at one file level, you can do
sort -u $f all -o all
You have 2 problems.
You are using the outputfile as input (as stated by others) and you overwrite the outputfile in each loop. See the next incorrect fix
for f in app1/*; do
sort -u $f "temp.txt" > "temp1.txt"
done
This code will reset the outputfile for each f. Remember: When you redirect to a file in a loop, always append (>> "temp1.txt").
The problem seems to be fixed with the ugly loop:
for f in app1/*; do
cp temp.txt fix1.txt
sort -u $f "fix1.txt" > "temp.txt"
done
The way you should do it is writing to output outside the loop. Since you start with an empty temp.txt you have
for f in app1/*; do
sort -u $f
done > "fix2.txt"
sort -u "fix2.txt" > "temp.txt"
Or is #Andrey right and can you use
for f in app1/*; do
sort -u $f
done | sort -u > "temp.txt"
or
sort -u app1/* > "temp.txt"
You may want to append - using double angle-bracket:
sort -u $f "temp.txt" >> "temp.txt"
This may be another way to do it:
reut#reut-work-room:~/srt$ cat 1.txt
a
b
c
d
reut#reut-work-room:~/srt$ cat 2.txt
d
e
f
reut#reut-work-room:~/srt$ sort -u *.txt > out.txt
reut#reut-work-room:~/srt$ cat out.txt
a
b
c
d
e
f
The shell process redirections before launching the command. So
sort foo bar > bar
will first truncate "bar" to zero bytes. Then the sort command has the "normal" foo file and a now empty bar file to work with.
ref: http://www.gnu.org/software/bash/manual/bashref.html#Redirections

How to delete all lines containing certain words, except those containing certain words?

I have a file called file1.txt. I'd like to delete every line containing the words, "center of", "farm", or "middle of", etc. except lines which contain "①" or "city".
The list of deletions and exceptions is quite long.
The files are in UTF-8.
How can I delete every line containing at least one of these words, but not those lines which have some of the exceptions?
This might work for you:
sed -i '/center of\|farm\|middle of/{/①\|city/!d}' file1.txt
or
sed -i '/center of/ba
/farm/ba
/middle of/ba
b
:a
/①/b
/city/b
d' file1.txt
and if you have a words.txtand exceptions.txt files, use this:
sed '/\*exceptions\*/{h;s/.*/:a/p;d}
x
/./{x;s|.*|/&/b|p;$!d;s/.*/d/;q}
x
s|.*|/&/ba|' words.txt - <<<"*exceptions*" exceptions.txt > file.sed
sed -i -f file.sed file1.txt
sed -r '/①|city/{p;d};/center of|farm|middle of/d' file1.txt
sed '/blacklist/{/whitelist/p;d}' file
Delete the blacklist, except it is in the whitelist:
echo -e "a b\nb c\nc d\nd e\ne f" | sed '/c\|d/{/a\|b/p;d}'
prints
a b
b c
e f
which is every line, which does not contain c or d, and lines containing c or d only if they contain a or b.

Resources