Add prefix to every line in text in bash - bash

Suppose there is a text file a.txt e.g.
aaa
bbb
ccc
ddd
I need to add a prefix (e.g. myprefix_) to every line in the file:
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
I can do that with awk:
awk '{print "myprefix_" $0}' a.txt
Now I wonder if there is another way to do that in shell.

With sed:
$ sed 's/^/myprefix_/' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
This replaces every line beginning ^ with myprefix_. Note that ^ is not lost, so this allows to add content to the beginning of each line.
You can make your awk's version shorter with:
$ awk '$0="myprefix_"$0' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
or passing the value:
$ prefix="myprefix_"
$ awk -v prefix="$prefix" '$0=prefix$0' a.txt
myprefix_aaa
myprefix_bbb
myprefix_ccc
myprefix_ddd
It can also be done with nl:
$ nl -s "prefix_" a.txt | cut -c7-
prefix_aaa
prefix_bbb
prefix_ccc
prefix_ddd
Finally: as John Zwinck explains, you can also do:
paste -d'' <(yes prefix_) a.txt | head -n $(wc -l a.txt)
on OS X:
paste -d '\0' <(yes prefix_) a.txt | head -n $(wc -l < a.txt)

Pure bash:
while read line
do
echo "prefix_$line"
done < a.txt

For reference, regarding the speed of the awk, sed, and bash solution to this question:
Generate a 800K input file in bash:
line="12345678901234567890123456789012345678901234567890123456789012345678901234567890"
rm a.txt
for i in {1..10000} ; do
echo $line >> a.txt
done
Then consider the bash script timeIt
if [ -e b.txt ] ; then
rm b.txt
fi
echo "Bash:"
time bashtest
rm b.txt
echo
echo "Awk:"
time awktest
rm b.txt
echo
echo "Sed:"
time sedtest
where bashtest is
while read line
do
echo "prefix_$line" >> b.txt
done < a.txt
awktest is:
awk '$0="myprefix_"$0' a.txt > b.txt
and sedtest is:
sed 's/^/myprefix_/' a.txt > b.txt
I got the following result on my machine:
Bash:
real 0m0.401s
user 0m0.340s
sys 0m0.048s
Awk:
real 0m0.009s
user 0m0.000s
sys 0m0.004s
Sed:
real 0m0.009s
user 0m0.000s
sys 0m0.004s
It seems like the bash solution is much slower..

You can also use the xargs utility:
cat file | xargs -d "\n" -L1 echo myprefix_
The -d option is used to allow input line with trailing blanks (related to -L spec).

Related

Estimate number of lines in a file and insert that value as first line

I have many files for which I have to estimate the number of lines in each file and add that value as first line. To estimate that, I used something like this:
wc -l 000600.txt | awk '{ print $1 }'
However, no success on how to do it for all files and then to add the value corresponding to each file as first line.
An example:
a.txt b.txt c.txt
>>print a
15
>> print b
22
>>print c
56
Then 15, 22 and 56 should be added respectively to: a.txt b.txt and c.txt
I appreciate the help.
You can add a pattern for example (LINENUM) in first line of file and then use the following script.
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i 's/LINENUM/LINENUM:{}/' a.txt
or just use from this script:
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' a.txt
This way you can add the line number as the first line for all *.txt files in current directory. Also using that group command here would be faster than inplace editing commands, in case of large files. Do not modify spaces or semicolons into the grouping.
for f in *.txt; do
{ wc -l < "$f"; cat "$f"; } > "${f}.tmp" && mv "${f}.tmp" "$f"
done
For iterate over the all file you can add use from this script.
for f in `ls *` ; do if [ -f $f ]; then wc -l $f | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' $f ; fi; done
This might work for you (GNU sed):
sed -i '1h;1!H;$!d;=;x' file1 file2 file3 etc ...
Store each file in memory and insert the last lines line number as the file size.
Alternative:
sed -i ':a;$!{N;ba};=' file?

Different ways of grepping for large amounts of data

So I have a huuuuge file and a big list of items that I want to grep out from that file.
For the sake of this example, let the files be denoted thus-
seq 1 10000 > file.txt #file.txt contains numbers from 1 to 10000
seq 1 5 10000 > list #list contains every fifth number from 1 to 10000
My question is, which is the best way to grep out the lines corresponding to 'list' from 'file.txt'
I tried it in two ways-
time while read i ; do grep -w "$i" file.txt ; done < list > output
That command took - real 0m1.300s
time grep -wf list file.txt > output
This one was slower, clocking in at- real 0m1.402s.
Is there a better (faster) way to do this? Is there a best way that I'm missing?
You're comparing apples and oranges
this command greps words from list in file.txt
time for i in `cat list`; do grep -w "$i" file.txt ; done > output
this command greps patterns from file.txt in list
time grep -f file.txt list > output
you need to fix one file as the source of strings to match and the other file as the target data in which to match strings - also use the same grep options like -w or -F
It sounds like list is the source of patterns and file.txt is target datafile - here are my timings for the original adjusted commands plus one awk and two sed solutions - the sed solutions differ in whether the patterns are given as separate sed commands or one extended regex
timings
one grep
real 0m0.016s
user 0m0.001s
sys 0m0.001s
2000 output1
loop grep
real 0m10.120s
user 0m0.060s
sys 0m0.212s
2000 output2
awk
real 0m0.022s
user 0m0.007s
sys 0m0.000s
2000 output3
sed
real 0m4.260s
user 0m4.211s
sys 0m0.022s
2000 output4
sed -r
real 0m0.144s
user 0m0.085s
sys 0m0.047s
2000 output5
script
n=10000
seq 1 $n >file.txt
seq 1 5 $n >list
echo "one grep"
time grep -Fw -f list file.txt > output1
wc -l output1
echo "loop grep"
time for i in `cat list`; do grep -Fw "$i" file.txt ; done > output2
wc -l output2
echo "awk"
time awk 'ARGIND==1 {list[$1]; next} $1 in list' list file.txt >output3
wc -l output3
echo "sed"
sed 's/^/\/^/;s/$/$\/p/' list >list.sed
time sed -n -f list.sed file.txt >output4
wc -l output4
echo "sed -r"
tr '\n' '|' <list|sed 's/^/\/^(/;s/|$/)$\/p/' >list.sedr
time sed -nr -f list.sedr file.txt >output5
wc -l output5
You can try awk:
awk 'NR==FNR{a[$1];next} $1 in a' file.txt list
In my system, awk is faster than grep with the sample data.
Test:
$ time grep -f file.txt list > out
real 0m1.231s
user 0m1.056s
sys 0m0.175s
$ time awk 'NR==FNR{a[$1];next} $1 in a' file.txt list > out1
real 0m0.068s
user 0m0.067s
sys 0m0.001s
Faster or not, you've useless use of cat up there
Why not?
grep -f list file.txt # Aren't files meant other way
Or use a bit more customized awk
awk 'NR==FNR{a[$1];next} $1 in a{print $1;next}' list file.txt

How to apply 'awk' for all files in folder?

I am new to awk pls pardon my ignorance. I am using awk to extract tag values from file. following code works for single execution
awk -F"<NAME>|</NAME>" '{print $2; exit;}' file.txt
but I am not sure how I can run it for all files in folder.
File sample is as follows
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
#!/bin/bash
STRING=ABC
DATE=$(date +%Y/%m/%d | tr '/' '-')
changedate(){
for a in $(ls /root/Working/awk/*)
do
for b in $(awk -F"<NAME>|</NAME>" '{print $2;}' "$a")
do
if [ "$b" == "$STRING" ]; then
for c in $(awk -F"<DATE>|</DATE>" '{print $2;}' "$a")
do
sed "s/$c/$DATE/g" "$a";
done
else
echo "Strings are not a match";
fi
done
done
}
changedate
When you run it -
root#revolt:~# cat /root/Working/awk/*
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>GHI</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>JKL</NAME><DATE>2015-12-11</DATE></BODY>
String in code is set to ABC
root#revolt:~# ./ANSWER
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Strings are not a match
String in code is set to DEF
root#revolt:~# ./ANSWER
Strings are not a match
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Alright. So in this you would set the STRING=ABC or whatever your desired string is. You can also set it to = a list of strings you're checking for.
The date variable echoes the date in the same format (Y/m/d) as your string. The tr command then replaces all instances of forward slashes with hyphens.
First we're creating a function called "changedate". Within this function we're going to nest a few for loops to do different things. The first for loop sets ls /root/Working/awk/* to the variable a. This means that for each instance of a file/directory in /root/Working/awk/, do the following.
The next for loop is checking for of each instance, grab between the Name tags and print it. Notice we're still using $a as the file because that's going to be the file path for each file. Then we're going to have an if statement to check for your string. If it is true, then do another for loop that will substitute the date in file a. If it isn't true, then echo Strings are not a match.
Lastly, we call our "changedate" function which basically runs the entire looping sequence above.
To answer somewhat generically your question about running awk on multiple
files, imagine we have these files:
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file2.txt
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
One thing you can do is simply supply awk with multiple files as with almost any command (like ls *.txt):
$ awk -F"<NAME>|</NAME>" '{print $2}' *.txt
XYZ
ABC
123
Awk just reads lines from each file in turn. As mentioned in the comments,
be careful with exit because it will stop processing all together after the first match::
$ awk -F"<NAME>|</NAME>" '{print $2; exit}' *.txt
XYZ
However, if for efficiency or some other reason you want to stop
processing in the current file and move on immediately to the next one,
you can use the gawk only nextfile:
$ # GAWK ONLY!
$ gawk -F"<NAME>|</NAME>" '{print $2; nextfile}' *.txt
XYZ
ABC
123
Sometimes the results on multiple files are not useful without knowing
which lines came from which file. For that you can use the built in FILENAME
variable:
$ awk -F"<NAME>|</NAME>" '{print FILENAME, $2}' *.txt
file1.txt XYZ
file2.txt ABC
file3.txt 123
Things get trickier when you want to modify the files you are working
on. Imagine you want to convert the name to lower case:
$ awk -F"<NAME>|</NAME>" '{print tolower($2)}' *.txt
xyz
abc
123
With traditional awk, the usual pattern is to save to a temp file and copy
the temp file back to the original (obviously you want to be careful with
this, keeping copies of the orignals!)
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' file1.txt > tmp && mv tmp file1.txt
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
To use this style on multiple files, it's probably easier to drop back to
the shell and run awk in a loop on single files:
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ for f in file*.txt; do
> awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' $f > tmp && mv tmp $f
> done
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
Finally, with gawk you have the option if in-place editing (much like sed -i):
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ # GAWK ONLY!
$ gawk -v INPLACE_SUFFIX=.sav -i inplace -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' *.txt
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
The recommended INPLACE_SUFFIX variable tells gawk to make backups of
each file with that extension:
$ cat file1.txt.sav file2.txt.sav file3.txt.sav
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>

Add a sentence in each line of a file with sed [duplicate]

This question already has answers here:
Add a prefix string to beginning of each line
(18 answers)
Closed 8 years ago.
I have a file with this content:
dog
cat
and I want to add a sentence (I have a) before every sentence, so the new content should be:
I have a dog
I have a cat
I tried with
sed -i 'I have a ' file
But I get something like this
I have a
dog
I have a
cat
Which would be the correct use of sed? Is there another way of doing?
You can use sed:
sed -i.bak 's/^/I have a /' file
cat file
I have a dog
I have a cat
Or awk:
awk '{print "I have a", $0}' file
I have a dog
I have a cat
With bash:
while read -r line
do
echo "I have a $line"
done < file > tmp_file && mv tmp_file file
Basically, read each line and echo them back with the leading title. Then redirect to a temp file. In case the command succeeds, then move the temp file into the original one.
With awk:
awk '{print "I have a", $0}' file > tmp_file && mv tmp_file file
With sed: check anubhava's answer for the best way. This can be another.
sed -i.bak -n 's/^/I have a /p' file
Note it is good practice to use -i.bak (or -i.whatever) to generate a backup file with the .bak extension.
Just for fun, you can also use paste:
$ paste -d" " <(printf "%0.sI have a\n" $(seq $(wc -l <file))) file
I have a dog
I have a cat
Explanation
paste -d" " file1 file prints two files side by side, with space as delimiter.
printf "%0.sI have a\n" $(seq $(wc -l <file)) print "I have a" as many times a lines file has.
$ wc -l <a
2
$ seq $(wc -l <a)
1
2
$ printf "%0.sI have a\n" $(seq $(wc -l <a))
I have a
I have a
sed -i 's/^/I have a /' your_file

bash: process substitution, paste and echo

I'm trying out process substitution and this is just a fun exercise.
I want to append the string "XXX" to all the values of 'ls':
paste -d ' ' <(ls -1) <(echo "XXX")
How come this does not work? XXX is not appended. However if I want to append the file name to itself such as
paste -d ' ' <(ls -1) <(ls -1)
it works.
I do not understand the behavior. Both echo and ls -1 write to stdout but echo's output isn't read by paste.
Try doing this, using a printf hack to display the file with zero length output and XXX appended.
paste -d ' ' <(ls -1) <(printf "%.0sXXX\n" * )
Demo :
$ ls -1
filename1
filename10
filename2
filename3
filename4
filename5
filename6
filename7
filename8
filename9
Output :
filename1 XXX
filename10 XXX
filename2 XXX
filename3 XXX
filename4 XXX
filename5 XXX
filename6 XXX
filename7 XXX
filename8 XXX
filename9 XXX
If you just want to append XXX, this one will be simpler :
printf "%sXXX\n"
If you want the XXX after every line of ls -l output, you need a second command that output x times the string. You are echoing it just once and therefore it will get appended to the first line of ls output only.
If you are searching for a tiny command line to achieve the task you may use sed:
ls -l | sed -n 's/\(^.*\)$/\1 XXX/p'
And here's a funny one, not using any external command except the legendary yes command!
while read -u 4 head && read -u 5 tail ; do echo "$head $tail"; done 4< <(ls -1) 5< <(yes XXX)
(I'm only posting this because it's funny and it's actually not 100% off topic since it uses file descriptors and process substitutions)
... you have to:
for i in $( ls -1 ); do echo "$i XXXX"; done
Never use for i in $(command). See this answer for more details.
So, to answer of this original question, you could simply use something like this :
for file in *; do echo "$file XXXX"; done
Another solution with awk :
ls -1|awk '{print $0" XXXX"}'
awk '{print $0" XXXX"}' <(ls -1) # with process substitution
Another solution with sed :
ls -1|sed "s/\(.*\)/\1 XXXX/g"
sed "s/\(.*\)/\1 XXXX/g" <(ls -1) # with process substitution
And useless solutions, just for fun :
while read; do echo "$REPLY XXXX"; done <<< "$(ls -1)"
ls -1|while read; do echo "$REPLY XXXX"; done
It does it only for the first line, since it groups the first line from parameter 1 with the first line from parameter 2:
paste -d ' ' <(ls -1) <(echo "XXX")
... outputs:
/dir/file-a XXXX
/dir/file-b
/dir/file-c
... you have to:
for i in $( ls -1 ); do echo "$i XXXX"; done
You can use xargs for the same effect:
ls -1 | xargs -I{} echo {} XXX

Resources