How to output only text after a match with sed - bash

I am using sed to find a certain match in a text file and then put this value in to a variable, my problem is that I only want the text after the match, and not the entire line.
Ans=$(sed -n '/^'$1':/,/~/{/:/{p;n};/~/q;p}' $file.txt)
Text File
q1:answer1
~
q2:answer2
~
q3:answer3
~
Actual Output
q1:answer1
Expected Output
answer1

With grep :
Ans=$(grep -oP "^$1:\K.*" file)
or with perl if your grep version doesn't support -P switch :
Ans=$(var=$1 perl -lne '/^$ENV{var}:\K.*/ and print $&' file)

In case a sed solution is needed - e.g., if answers could span multiple lines:
Ans=$(sed -r -n '/^'$1':(.*)/,/^(~)$/ { s//\1/; /^~$/q; p; }' file.txt)
(OSX users: use -E instead of -r).
Uses a backreference (\1) to replace the first matching line with its portion of interest; any other lines between the first matching one and the terminating ~ line are unaffected by the replacement (assuming they don't also start with $1:) and also printed.
Replace q with d if you don't want to quit after the first matching range.
By contrast, if the string of interest is limited to the line starting with $1:, there's no need to also match the ~ line, and the command can be simplified to:
Ans=$(sed -r -n '/^'$1':(.*)/ { s//\1/p; q; }' file.txt)
Remove q; if you don't want to quit after the first match.
However, the single-line case is more easily handled with a grep or awk solution - see #sputnick's and #anubhava's answers. If you wanted those to quit after the first match -- as in the snippets above and the code in the OP -- you'd need to add option -m 1 to the grep solution and ; exit to the awk solution (before the }).

Better use awk for this:
ans=$(awk -F':' -v s='q1' '$1 == s {print $2}' file)

Related

Grep a line from a file and replace a substring and append the line to the original file in bash?

This is what I want to do.
for example my file contains many lines say :
ABC,2,4
DEF,5,6
GHI,8,9
I want to copy the second line and replace a substring EF(all occurrences) and make it XY and add this line back to the file so the file looks like this:
ABC,2,4
DEF,5,6
GHI,8,9
DXY,5,6
how can I achieve this in bash?
EDIT : I want to do this in general and not necessarily for the second line. I want to grep EF, and do the substition in whatever line is returned.
Here's a simple Awk script.
awk -F, -v pat="EF" -v rep="XY" 'BEGIN { OFS=FS }
$1 ~ pat { x = $1; sub(pat, rep, x); y = $0; sub($1, x, y); a[++n] = y }
1
END { for(i=1; i<=n; i++) print a[i] }' file
The -F , says to use comma as the input field separator (internal variable FS) and in the BEGIN block, we also set that as the output field separator (OFS).
If the first field matches the pattern, we copy the first field into x, substitute pat with rep, and then substitute the first field of the whole line $0 with the new result, and append it to the array a.
1 is a shorthand to say "print the current input line".
Finally, in the END block, we output the values we have collected into a.
This could be somewhat simplified by hardcoding the pattern and the replacement, but I figured it's more useful to make it modular so that you can plug in whatever values you need.
While this all could be done in native Bash, it tends to get a bit tortured; spending the 30 minutes or so that it takes to get a basic understanding of Awk will be well worth your time. Perhaps tangentially see also while read loop extremely slow compared to cat, why? which explains part of the rationale for preferring to use an external tool like Awk over a pure Bash solution.
You can use the sed command:
sed '
/EF/H # copy all matching lines
${ # on the last line
p # print it
g # paste the copied lines
s/EF/XY/g # replace all occurences
s/^\n// # get rid of the extra newline
}'
As a one-liner:
sed '/EF/H;${p;g;s/EF/XY/g;s/^\n//}' file.csv
If ed is available/acceptable, something like:
#!/bin/sh
ed -s file.txt <<-'EOF'
$kx
g/^.*EF.*,.*/t'x
'x+;$s/EF/XY/
,p
Q
EOF
Or in one-line.
printf '%s\n' '$kx' "g/^.*EF.*,.*/t'x" "'x+;\$s/EF/XY/" ,p Q | ed -s file.txt
Change Q to w if in-place editing is needed.
Remove the ,p to silence the output.
Using BASH:
#!/bin/bash
src="${1:-f.dat}"
rep="${2:-XY}"
declare -a new_lines
while read -r line ; do
if [[ "$line" == *EF* ]] ; then
new_lines+=("${line/EF/${rep}}")
fi
done <"$src"
printf "%s\n" "${new_lines[#]}" >> "$src"
Contents of f.dat before:
ABC,2,4
DEF,5,6
GHI,8,9
Contents of f.dat after:
ABC,2,4
DEF,5,6
GHI,8,9
DXY,5,6
Following on from the great answer by #tripleee, you can create a variation that uses a single call to sub() by outputting all records before the substitution is made, then add the updated record to the array to be output with the END rule, e.g.
awk -F, '1; /EF/ {sub(/EF/,"XY"); a[++n]=$0} END {for(i=1;i<=n;i++) print a[i]}' file
Example Use/Output
An expanded input based on your answer to my comment below the question that all occurrences of EF will be replaced with XY in all records, e.g.
$ cat file
ABC,2,4
DEF,5,6
GHI,8,9
EFZ,3,7
Use and output:
$ awk -F, '1; /EF/ {sub(/EF/,"XY"); a[++n]=$0} END {for(i=1;i<=n;i++) print a[i]}' file
ABC,2,4
DEF,5,6
GHI,8,9
EFZ,3,7
DXY,5,6
XYZ,3,7
Let me know if you have questions.

Replace every 4th occurence of char "_" with "#" in multiple files

I am trying to replace every 4th occurrence of "_" with "#" in multiple files with bash.
E.g.
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo..
would become
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo...
#perl -pe 's{_}{++$n % 4 ? $& : "#"}ge' *.txt
I have tried perl but the problem is this replaces every 4th _ carrying on from the last file. So for example, some files the first _ is replaced because it is not starting each new file at a count of 0, it carries on from the previous file.
I have tried:
#awk '{for(i=1; i<=NF; i++) if($i=="_") if(++count%4==0) $i="#"}1' *.txt
but this also does not work.
Using sed I cannot find a way to keep replacing every 4th occurrence as there are different numbers of _ in each file. Some files have 20 _, some have 200 _. Therefore, I cant specify a range.
I am really lost what to do, can anybody help?
You just need to reset the counter in the perl one using eof to tell when it's done reading each file:
perl -pe 's{_}{++$n % 4 ? "_" : "#"}ge; $n = 0 if eof' *.txt
This MAY be what you want, using GNU awk for RT:
$ awk -v RS='_' '{ORS=(FNR%4 ? RT : "#")} 1' file
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo..
It only reads each _-separated string into memory 1 at a time so should work no matter how large your input file, assuming there are _s in it.
It assumes you want to replace every 4th _ across the whole file as opposed to within individual lines.
A simple sed would handle this:
s='foo_foo_foo_foo_foo_foo_foo_foo_foo_foo'
sed -E 's/(([^_]+_){3}[^_]+)_/\1#/g' <<< "$s"
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
Explanation:
(: Start capture group #1
([^_]+_){3}: Match Match 1+ of non-_ characters followed by a _. Repeat this group 3 times to match 3 such words separated by _
[^_]+: Match 1+ of non-_ characters
): End capture group #1
_: Match a _
Replacement is \1# to replace 4th _ with a #
With GNU sed:
sed -nsE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
-n suppresses the automatic printing, -s processes each file separately, -E uses extended regular expressions.
The script is a loop between label a (:a) and the branch-to-label-a command (ba). Each iteration appends the next line of input to the pattern space (N). This way, after the last line has been read, the pattern space contains the whole file(*). During the last iteration, when the last line has been read ($), a substitute command (s) replaces every 4th _ in the pattern space by a # (s/(([^_]*_){3}[^_]*)_/\1#/g) and prints (p) the result.
When you will be satisfied with the result you can change the options:
sed -i -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, or:
sed -i.bkp -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, but keep a *.txt.bkp backup of each file.
(*) Note that if you have very large files this could cause memory overflows.
With your shown samples, please try following awk program. Have created an awk variable named fieldNum where I have assigned 4 to it, since OP needs to enter # after every 4th _, you can keep it as per your need too.
awk -v fieldNum="4" '
BEGIN{ FS=OFS="_" }
{
val=""
for(i=1;i<=NF;i++){
val=(val?val:"") $i (i%fieldNum==0?"#":(i<NF?OFS:""))
}
print val
}
' Input_file
With GNU awk
$ cat ip.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
123_45678_90
_
$ awk -v RS='(_[^_]+){3}_' -v ORS= '{sub(/_$/, "#", RT); print $0 RT}' ip.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
123_45678_90
#
-v RS='(_[^_]+){3}_' set input record separator to cover sequence of four _ (text matched by this separator will be available via RT)
-v ORS= empty output record separator
sub(/_$/, "#", RT) change last _ to #
Use -i inplace for inplace editing.
If the count should reset for each line:
perl -pe's/(?:_[^_]*){3}\K_/\#/g'
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
If the count shouldn't reset for each line, but should reset for each file:
perl -0777pe's/(?:_[^_]*){3}\K_/\#/g'
The -0777 cause the whole file to be treated as one line. This causes the count to work properly across lines.
But since a new a match is used for each file, the count is reset between files.
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -0777pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
To avoid that reading the entire file at once, you could continue using the same approach, but with the following added:
$n = 0 if eof;
Note that eof is not the same thing as eof()! See eof.

How to ignore case when using awk or sed [duplicate]

sed -i '/first/i This line to be added'
In this case,how to ignore case while searching for pattern =first
You can use the following:
sed 's/[Ff][Ii][Rr][Ss][Tt]/last/g' file
Otherwise, you have the /I and n/i flags:
sed 's/first/last/Ig' file
From man sed:
I
i
The I modifier to regular-expression matching is a GNU extension which
makes sed match regexp in a case-insensitive manner.
Test
$ cat file
first
FiRst
FIRST
fir3st
$ sed 's/[Ff][Ii][Rr][Ss][Tt]/last/g' file
last
last
last
fir3st
$ sed 's/first/last/Ig' file
last
last
last
fir3st
GNU sed
sed '/first/Ii This line to be added' file
You can try
sed 's/first/somethingelse/gI'
if you want to save some typing, try awk. I don't think sed has that option
awk -v IGNORECASE="1" '/first/{your logic}' file
For versions of awk that don't understand the IGNORECASE special variable, you can use something like this:
awk 'toupper($0) ~ /PATTERN/ { print "string to insert" } 1' file
Convert each line to uppercase before testing whether it matches the pattern and if it does, print the string. 1 is the shortest true condition, so awk does the default thing: { print }.
To use a variable, you could go with this:
awk -v var="$foo" 'BEGIN { pattern = toupper(foo) } toupper($0) ~ pattern { print "string to insert" } 1' file
This passes the shell variable $foo and transforms it to uppercase before the file is processed.
Slightly shorter with bash would be to use -v pattern="${foo^^}" and skip the BEGIN block.
Use the following, \b for word boundary
sed 's/\bfirst\b/This line to be added/Ig' file

Grep to exclude comment like # and -- with trailing spaces and within line

I tried to grep the word inside file which contains # and -- as a comment. The command that I used is
grep "^[^#]" -H -R -I "pathtofile" | grep "^[^--]" | grep -in ${1} | awk -F : ' { print $2 } ' | uniq)
which will print the file name of specific word. However, if there is a line like this
--test_specific_word_test test
The code above will treat above code as not to skip it. This case also apply to where the comment is in line with the code like var=1 --comment.
Should I use sed to delete comment line first or use just grep.
The downside is I have a significant amount of file to search and GNU grep is 2.0 and I can't upgrade the grep version because I don't have permission.
The command you've provided uses grep 4 times. You can skip commented lines with a single grep command:
grep -v "^ *\(--\|#\)" "pathtofile"
To print the filenames containing word1 use cut like so:
grep -Hv "^ *\(--\|#\)" filenames | grep "word1" | cut -d: -f1
To skip inline comments use sed:
sed "s/\(.*\)\(--\|#\).*/\1/g" inputfile
Sample input:
word1
word2
-word3 # inline comment
#comment1
--comment2
#comment3
output:
word1
word2
-word3
If in fact you are attempting to parse a programming language's source files, you are probably better off using a proper parser. Here is an attempt at refactoring your code into an Awk script, with several guesses as to what exactly the script should actually do.
find pathtofile -type f -exec awk -v word="$1" -F : '
# this doesn't reimplement grep -I though
{ sub("(#|--).*", "") } # remove comments
tolower($0) ~ tolower(word) && !($2 in a) { print FILENAME ":" FNR ":" $2; a[$2] }' {} +
This has the obvious flaw that if the programming language allows for # or -- in quoted strings and doesn't regard those as comments, the script will do the wrong thing.
There are no word boundaries in your script, so I didn't put any in mine either. This means if word="dog" then it will print any string which contains the three adjacent letters d-o-g in this order, even in substring matches like "doggone" or "endogenous". If that's not what you want, you can add word boundary markers -- if you have GNU Awk, you can say BEGIN { word = "\\<" word "\\> } at the beginning of the script; or see here.
The technique to add the key to an array and only print the key if it wasn't already in the array is a common way to implement uniq. This will fail if find returns so many files that it will end up running more than one instance of awk -- this will be controlled by the value of ARG_MAX of your kernel.

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Resources