Echoing awk output to file to remove duplicates has strange output - bash

I made a small shell script to try to remove duplicate entries (lines) from a text file. When the script is ran and the file has three lines, all identical, a strange output occurs.
The shell script is ran on an Ubuntu distribution.
The contents of my text file:
one
one
one
The script I am running to remove duplicates:
echo -e $(awk '!a[$0]++' /test/test.txt) > /test/test.txt
The awk is intended to delete duplicates, while the echo is intended to output it to a file.
Upon running my script, I receive the following output in the file:
one
one
It should also be noted that there is an additional newline after the second line, and a space at the start of the second line.

Writing to a file at the same time that you are reading from it usually leads to disaster.
If you have GNU awk, then use the -i inplace option:
$ cat text
one
one
one
$ gawk -i inplace '!a[$0]++' text
$ cat text
one
If you have BSD awk, then use:
awk '!a[$0]++' text >tmp && mv tmp text
Alternatively, if you have sponge installed:
awk '!a[$0]++' text | sponge text
sponge does not update the file until the pipeline has finished reading and processing it.

Related

Crop Lines from multiple CSV files using bash

I have a directory of 40 or so csv's. Each csv file has an extra 10 lines at the top that I don't need. I'm new to bash commands, but I have found that I can use
tail -n +10 oldfile.csv > newfile.csv
to cut 10 lines from a file one at a time. How can I do this across all csv's in the directory? I have tried doing this:
for filename in *foo*; do echo tail -n +10 \"$filename\" > \"${filename}\"; done
From what I've read, I thought this would pass in every csv containing foo in its name, run the formula, and leave the filename alone. Where am I going wrong?
You cannot use the same file as input and ouput.
With sed, your can edit the file in place with the -i flag:
for f in *.csv; do
sed -i '1,10d' "$f"
done
or as one-liner for the command line:
for f in *.csv; do sed -i '1,10d' "$f"; done
As a side note, your tail should be tail -n +11 to output 11th line to end of file.
Use a proper loop as below. Am using the native ex editor which Vim uses internally to in-place replacement, so you don't have to move the files back again using mv or any other command.
for file in *.csv
do
ex -sc '1d10|x' "$file"
done
The command moves to first line, selects 10 lines from first, deletes it and saves & closes the file.
Use a command-line friendly version in a single line as
for file in *.csv; do ex -sc '1d10|x' "$file"; done
The ex command is POSIX compatible and can work on all major platforms and distros.
In awk:
$ awk 'FNR>10{ print > "new-" FILENAME }' files*
Explained:
FNR>10 if current record number in current file is greater than 10, condition is true.
print well, output
> "new-" FILENAME redirect output to a new file named new-file, for example.
Edited to writing output to multiple files. Original which just outputed to screen was awk 'FNR>10' files*

How to combine multiple sed and awk commands?

I have a folder with about 2 million files in it. I need to run the following commands:
sed -i 's/<title>/<item><title>/g;s/rel="nofollow"//g;s/<\/a> •/]]><\/wp:meta_value><\/wp:postmeta><content:encoded><![CDATA[/g;s/By <a href="http:\/\/www.website.com\/authors.*itemprop="author">/<wp:postmeta><wp:meta_key><![CDATA[custom_author]]><\/wp:meta_key><wp:meta_value><![CDATA[/g' /home/testing/*
sed -i '$a]]></content:encoded><wp:status><![CDATA[draft]]></wp:status><wp:post_type><![CDATA[post]]></wp:post_type><dc:creator><![CDATA[Database]]></dc:creator></item>\' /home/testing/*
awk -i inplace 1 ORS=' ' /home/testing/*
The problem I'm having is that when I run the first command, it cycles through all 2 million files, then I move on to the second command and so on. The problem is that I'm basically having to open files 6 million times in total.
I'd prefer that when each file is opened, all 3 commands are run on it and then it moves on to the next. Hopefully that makes sense.
You can do everything in one awk command as something like:
awk -i inplace -v ORS=' ' '{
gsub(/<title>/,"<item><title>")
gsub(/rel="nofollow"/,"")
gsub(/<\/a> •/,"]]><\/wp:meta_value><\/wp:postmeta><content:encoded><![CDATA[")
gsub(/By <a href="http:\/\/www.website.com\/authors.*itemprop="author">/,"<wp:postmeta><wp:meta_key><![CDATA[custom_author]]><\/wp:meta_key><wp:meta_value><![CDATA[")
print $0 "]]></content:encoded><wp:status><![CDATA[draft]]></wp:status><wp:post_type><![CDATA[post]]></wp:post_type><dc:creator><![CDATA[Database]]></dc:creator></item>"
}' /home/testing/*
but that doesn't mean it's necessarily the best way to do what you want.
The above relies on my correctly interpreting what your commands are doing and is obviously untested since you didn't provide any sample input and expected output. It also still relies on GNU awk for -i inplace like your original script did.
Assuming that your files are small enough for a single file to fit into memory as a whole (and assuming GNU sed, which your use of -i without an option-argument implies):
sed -i -e ':a;$!{N;ba}; s/.../.../g; ...; $a...' -e 's/\n/ /g' /home/testing/*
s/.../.../g; ...; and $a... in the command above represent your actual substitution and append commands.
:a;$!{N;ba}; reads each input file as a whole, and then performs the desired substitutions, appending, and replacement of all newlines with a single space each.[1]
This allows you to make do with a single sed command per input file.
[1] Your awk 1 ORS=' ' command actually creates output with a trailing space instead of a newline. By contrast, 's/\n/ /g' applied to the whole input file will only place a space between lines, and terminate the overall file with a newline (assuming the input file ended in one).

how to remove <Technology> and </Technology> words from a file using shell script?

My text file contains 100 lines and the text file surely contains Technology and /Technology words .In which ,I want to remove Technology and /Technology words present in the file using shell scripting.
sed -i.bak -e 's#/Technology##g' -e 's#Technology##g' my_text_file
This is delete the words and also make a backup of the original file just in case
sed -i -e 's#/Technology##g' -e 's#Technology##g' my_text_file
This will not make a backup but just modify the original file
You can try this one.
sed -r 's/<[\/]*technology>//g' a
Here is an awk
cat file
This Technology
More here
Mine /Technology must go
awk '{gsub(/\/*Technology/,"")}1' file
This
More here
Mine must go
By adding an extra space in the regex, it will not leave an extra space in the output.
awk '{gsub(/\/*Technology /,"")}1' file
This Technology
More here
Mine must go
To write back to original file
awk '{gsub(/\/*Technology /,"")}1' file > tmp && mv tmp file
If you have gnu awk 4.1+ you can do
awk -i '{gsub(/\/*Technology /,"")}1' file

Bash shell remove rows from a text file

i have a big domain lists file for a proxy filter. In another file I have some exceptions i would like to remove from filter file all rows of excpetions file. Is it possibile with some "sed" operation?
Thanks.
You can generally use grep with the -v and -f options for this. In fact, you probably want to use fgrep or the -F flag as well to ensure the string are considered as fixed string rather than regexes. Without that, for example, the first line of the infile file below will be removed despite not actually matching the fixed string.
-v reverses the sense that that matching lines are thrown away rather than kept, and -f will get the patterns from a file rather than the command line.
For example:
pax> cat infile
http://wwwxdodgy.com/rest-of-url
http://www.dodgy.com/rest-of-url
ftp://this/one/is/good
https://www.bad.org/rest-of-url
pax> cat exceptions
http://www.dodgy.com
https://www.bad.org
pax> fgrep -v -f exceptions infile
ftp://this/one/is/good
It is easier to do this with grep:
grep -v -x -F -f /path/to/exclude /path/to/file

Extracting all lines from a file that are not commented out in a shell script

I'm trying to extract lines from certain files that do not begin with # (commented out). How would I run through a file, ignore everything with a # in front of it, but copy each line that does not start with a # into a different file.
Thanks
Simpler: grep -v '^[[:space:]]*#' input.txt > output.txt
This assumes that you're using Unix/Linux shell and the available Unix toolkit of commands AND that you want to keep a copy of the original file.
cp file file.orig
mv file file.fix
sed '/^[ ]*#/d' file.fix > file
rm file.fix
Or if you've got a nice shiny new GNU sed that all be summarized as
cp file file.orig
sed -i '/^[ ]*#/d' file
In both cases, the regexp in the sed command is meant to be [spaceCharTabChar]
So you saying, delete any line that begins with an (optional space or tab chars) #, but print everything else.
I hope this helps.
grep -v ^\# file > newfile
grep -v ^\# file | grep -v ^$ > newfile
Not fancy regex, but I provide this method to Jr. Admins as it helps with understanding of pipes and redirection.

Resources