How to combine multiple sed and awk commands? - bash

I have a folder with about 2 million files in it. I need to run the following commands:
sed -i 's/<title>/<item><title>/g;s/rel="nofollow"//g;s/<\/a> •/]]><\/wp:meta_value><\/wp:postmeta><content:encoded><![CDATA[/g;s/By <a href="http:\/\/www.website.com\/authors.*itemprop="author">/<wp:postmeta><wp:meta_key><![CDATA[custom_author]]><\/wp:meta_key><wp:meta_value><![CDATA[/g' /home/testing/*
sed -i '$a]]></content:encoded><wp:status><![CDATA[draft]]></wp:status><wp:post_type><![CDATA[post]]></wp:post_type><dc:creator><![CDATA[Database]]></dc:creator></item>\' /home/testing/*
awk -i inplace 1 ORS=' ' /home/testing/*
The problem I'm having is that when I run the first command, it cycles through all 2 million files, then I move on to the second command and so on. The problem is that I'm basically having to open files 6 million times in total.
I'd prefer that when each file is opened, all 3 commands are run on it and then it moves on to the next. Hopefully that makes sense.

You can do everything in one awk command as something like:
awk -i inplace -v ORS=' ' '{
gsub(/<title>/,"<item><title>")
gsub(/rel="nofollow"/,"")
gsub(/<\/a> •/,"]]><\/wp:meta_value><\/wp:postmeta><content:encoded><![CDATA[")
gsub(/By <a href="http:\/\/www.website.com\/authors.*itemprop="author">/,"<wp:postmeta><wp:meta_key><![CDATA[custom_author]]><\/wp:meta_key><wp:meta_value><![CDATA[")
print $0 "]]></content:encoded><wp:status><![CDATA[draft]]></wp:status><wp:post_type><![CDATA[post]]></wp:post_type><dc:creator><![CDATA[Database]]></dc:creator></item>"
}' /home/testing/*
but that doesn't mean it's necessarily the best way to do what you want.
The above relies on my correctly interpreting what your commands are doing and is obviously untested since you didn't provide any sample input and expected output. It also still relies on GNU awk for -i inplace like your original script did.

Assuming that your files are small enough for a single file to fit into memory as a whole (and assuming GNU sed, which your use of -i without an option-argument implies):
sed -i -e ':a;$!{N;ba}; s/.../.../g; ...; $a...' -e 's/\n/ /g' /home/testing/*
s/.../.../g; ...; and $a... in the command above represent your actual substitution and append commands.
:a;$!{N;ba}; reads each input file as a whole, and then performs the desired substitutions, appending, and replacement of all newlines with a single space each.[1]
This allows you to make do with a single sed command per input file.
[1] Your awk 1 ORS=' ' command actually creates output with a trailing space instead of a newline. By contrast, 's/\n/ /g' applied to the whole input file will only place a space between lines, and terminate the overall file with a newline (assuming the input file ended in one).

Related

pass sed a long list of line numbers to remove from a file

I am trying to remove 500+ non-consecutive lines from a very large file with sed.
I have the lines stored in a list.txt file but I cant't use it in a for loop
for i in `cat list`; do echo 'sed -i -e ' \'"$i"d\'' huge_file.txt' ; done
because line numbers in the original file would change every time sed removes one and exits.
I should do:
sed -i -e '1d;2d;93572277d;93572278d; ......;nth ' huge_file.txt
Is there a way to pass that list to sed in a file?
you can try with awk:
awk -v s="2,3,..,n" 'BEGIN{n=split(s,t,",");for(i=1;i<=n;i++)d[t[i]]=1}
!d[NR]' huge.txt
You pass the comma-separated line numbers to awk by -v, in awk split it in array, and check each line, if the line number in the array, skip.
Test it with small file, if it worked as you expected, you can do:
awk -v '....' '....' huge.txt > tmp.txt && mv tmp.txt huge.txt
to write the change back to your original input file.
update
If you have 500 line numbers in another file, say, each number in a line, you can:
awk 'NR==FNR{a[$0]=1;next}!a[FNR]' ln.txt huge.txt
If it's just for a single particular task (not frequent) you may use the following GNU sed approach (assuming that numbers in list.txt are separated with newline \n):
sed -i "$(sed -z 's/\n/d;/g' list.txt)" huge_file.txt

Echoing awk output to file to remove duplicates has strange output

I made a small shell script to try to remove duplicate entries (lines) from a text file. When the script is ran and the file has three lines, all identical, a strange output occurs.
The shell script is ran on an Ubuntu distribution.
The contents of my text file:
one
one
one
The script I am running to remove duplicates:
echo -e $(awk '!a[$0]++' /test/test.txt) > /test/test.txt
The awk is intended to delete duplicates, while the echo is intended to output it to a file.
Upon running my script, I receive the following output in the file:
one
one
It should also be noted that there is an additional newline after the second line, and a space at the start of the second line.
Writing to a file at the same time that you are reading from it usually leads to disaster.
If you have GNU awk, then use the -i inplace option:
$ cat text
one
one
one
$ gawk -i inplace '!a[$0]++' text
$ cat text
one
If you have BSD awk, then use:
awk '!a[$0]++' text >tmp && mv tmp text
Alternatively, if you have sponge installed:
awk '!a[$0]++' text | sponge text
sponge does not update the file until the pipeline has finished reading and processing it.

Deleting first n rows and column x from multiple files using Bash script

I am aware that the "deleting n rows" and "deleting column x" questions have both been answered individually before. My current problem is that I'm writing my first bash script, and am having trouble making that script work the way I want it to.
file0001.csv (there are several hundred files like these in one folder)
Data number of lines 540
No.,Profile,Unit
1,1027.84,µm
2,1027.92,µm
3,1028,µm
4,1028.81,µm
Desired output
1,1027.84
2,1027.92
3,1028
4,1028.81
I am able to use sed and cut individually but for some reason the following bash script doesn't take cut into account. It also gives me an error "sed: can't read ls: No such file or directory", yet sed is successful and the output is saved to the original files.
sem2csv.sh
for files in 'ls *.csv' #list of all .csv files
do
sed '1,2d' -i $files | cut -f '1-2' -d ','
done
Actual output:
1,1027.84,µm
2,1027.92,µm
3,1028,µm
4,1028.81,µm
I know there may be awk one-liners but I would really like to understand why this particular bash script isn't running as intended. What am I missing?
The -i option of sed modifies the file in place. Your pipeline to cut receives no input because sed -i produces no output. Without this option, sed would write the results to standard output, instead of back to the file, and then your pipeline would work; but then you would have to take care of writing the results back to the original file yourself.
Moreover, single quotes inhibit expansion -- you are "looping" over the single literal string ls *.csv. The fact that you are not quoting it properly then causes the string to be subject to wildcard expansion inside the loop. So after variable interpolation, your sed command expands to
sed -i 1,2d ls *.csv
and then the shell expands *.csv because it is not quoted. (You should have been receiving a warning that there is no file named ls in the current directory, too.) You probably attempted to copy an example which used backticks (ASCII 96) instead of single quotes (ASCII 39) -- the difference is quite significant.
Anyway, the ls is useless -- the proper idiom is
for files in *.csv; do
sed '1,2d' "$files" ... # the double quotes here are important
done
Mixing sed and cut is usually not a good idea because you can express anything cut can do in terms of a simple sed script. So your entire script could be
for f in *.csv; do
sed -i -e '1,2d' -e 's/,[^,]*$//' "$f"
done
which says to remove the last comma and everything after it. (If your sed does not like multiple -e options, try with a semicolon separator: sed -i '1,2d;s/,[^,]*$//' "$f")
You may use awk,
$ awk 'NR>2{sub(/,[^,]*$/,"",$0);print}' file
1,1027.84
2,1027.92
3,1028
4,1028.81
or
sed -i '1,2d;s/,[^,]*$//' file
1,2d; for deleting the first two lines.
s/,[^,]*$// removes the last comma part in remaining lines.

Using shell script to copy script from one file to another

Basically I want to copy several lines of code from a template file to a script file.
Is it even possible to use sed to copy a string full of symbols that interact with the script?
I used these lines:
$SWAP='sudo cat /home/kaarel/template'
sed -i -e "s/#pointer/${SWAP}/" "script.sh"
The output is:
./line-adder.sh: line 11: =sudo cat /home/kaarel/template: No such file or directory
No, it is not possible to do this robustly with sed. Just use awk:
awk -v swap="$SWAP" '{sub(/#pointer/,swap)}1' script.sh > tmp && mv tmp script.sh
With recent versions of GNU awk there's a -i inplace flag for inplace editing if that's something you care about.
Good point about "&&". Here's the REALLY robust version that will work for absolutely any character in the search or replacement strings:
awk -v old="#pointer" -v new="$SWAP" 's=index($0,old){$0 = substr($0,1,s-1) new substr($0,s+length(old))} 1'
e.g.:
$ echo "abc" | awk -v old="b" -v new="m&&n" 's=index($0,old){$0 = substr($0,1,s-1) new substr($0,s+length(old))} 1'
am&&nc
There are two issues with the line:
$SWAP='sudo cat /home/kaarel/template'
The first is that, before executing the line, bash performs variable expansion and replaces $SWAP with the current value of SWAP. That is not what you wanted. You wanted bash to assign a value to SWAP.
The second issue is that the right-hand side is enclosed in single-quotes which protect the string from expansion. You didn't want to protect the string from expansion: you wanted to execute it. To execute it, you can use back-quotes which may look similar but act very differently.
Back-quotes, however, are an ancient form of asking for command execution. The more modern form is $(...) which eliminates some problems that back-quotes had.
Putting it all together, use:
SWAP=$(sudo cat /home/kaarel/template)
sed -i -e "s/#pointer/${SWAP}/" "script.sh"
Be aware, though, that the sed command may have problems if there are any sed-active characters in the template file.

sed command in linux.how to use sed to replace only first n ocuurences

I want to replace only first four occurences of LC-COUNT=1.how can i do that.
sed -i "s/LC-COUNT=1/LC-COUNT=$LC_COUNT/1,4" file.txt
Try this -
sed -e '0,/LC-COUNT=1/s//LC-COUNT=\$LC_COUNT/' file.txt > output.txt
Running it only once will replace first occurrences of LC-COUNT=1 by LC-COUNT=$LC_COUNT and will put the output output.txt file. Note : You will have to escape $ char first.
You are going to have to run it four times. But next time, consider output.txt as the original file, I mean do the replace in output.txt.
I think finding and replacing first N occurrences is not possible with sed.
In vim you do the similar kind of thing like -
:%s/LC-COUNT=1/LC-COUNT=\$LC_COUNT/gc
There gc option will ask you for confirmation on each find-replace. You can
This is better suited for awk.
Consider this awk command:
awk -F= '$1=="LC-COUNT" && c<=4 {$2="=$LC_COUNT";c++}1' OFS= file

Resources