How to append to lines in a file that do not contain a specific pattern using shell script - shell

I have a flat file as follows:
11|aaa
11|bbb|NO|xxx
11|ccc
11|ddd|NO|yyy
For lines that do not contain |NO|, I would like to add the string |YES| at the end. So my file should look like:
11|aaa|YES|
11|bbb|NO|xxx
11|ccc|YES|
11|ddd|NO|yyy
I am using AIX and sed -i option for inline replacements is not available. Hence, currently I'm using the following code to do this:
#Get the lines that do not contain |NO|
LINES=`grep -v "|NO|" file`
for i in LINES
do
sed "/$i/{s/$/|YES|/;}" file > temp
mv temp file
done
The above works, however, as my file contains over 40000 lines, it takes about 3 hours to run. I believe it is taking so much time because it has to search for each line and write to a temp file. Is there a faster way to achieve this ?

This will be quick:
sed '/NO/!s/$/|YES|/' filename

If temp.txt is your file, try:
awk '$0 !~ /NO/ {print $0 "|YES|"} $0 ~ /NO/ {print}' temp.txt

Simple with awk. Put the code below into a script and run it with awk -f script file > temp
/\|NO\|/ { print; next; } # just print anything which contains |NO| and read next line
{ print $0 "|YES|"; } # For any other line (no pattern), print the line + |YES|
I'm not sure about awk regexps; if it doesn't work, try to remove the two \ in the first pattern.

Related

awk/bash append headers in many csv files

I would like to transform the header of many csv files automatically using awk and bash scripts.
Currently, I am using the following code-block, which is working fine:
for FILE in *.csv;
do
awk 'FNR>1{print $0}' $FILE | awk 'NR == 1{print "aaa,bbb,ccc,ddd,eee,fff,ggg,hhh,iii,jjj,kkk,lll,mmm,nnn,...,zzz"}1' > OUT_$FILE
done
What these commands are doing is that it first removes the old header from $FILE and then append prepend a new comma-separated (very long) header aaa,bbb,ccc,ddd,eee,fff,ggg,hhh,iii,jjj,kkk,lll,mmm,nnn,...,zzz to $FILE and then save the output to OUT_$FILE.
Currently, I am copying the part aaa,bbb,ccc,ddd,eee,fff,ggg,hhh,iii,jjj,kkk,lll,mmm,nnn,...,zzz manually from another csv file and pasting into this field to replace the header from $FILE. While it is working, it is getting tedious, repetitive and time-consuming for many csv files.
Instead of copying the header manually, I am trying to extract the header from another csv file new_headers.csv and save to a new variable $NEWHEAD.
NEWHEAD=$(awk 'NR==1{print $0}' new_headers.csv)
While I can view the extracted header $NEWHEAD, I am not sure how to merge this command into previous workflow to append prepend the headers from $FILE.
I will certainly appreciate any suggestions to resolve this problem. Thank you :)
With GNU awk for "inplace" editing:
awk -i inplace 'NR==1{hdr=$0} {print (FNR>1 ? $0 : hdr)}' new_headers.csv *.csv
newheader=$(head -n 1 new_headers.csv)
for file in *.csv
do
{
printf '%s\n' "$newheader"
tail -n +2 "$file"
} > OUT_"$file"
done
notes:
head -n 1 outputs the first line of a file
tail -n +2 outputs all the lines but the first
{ } is to group commands, so that you redirect their output as a whole
You can read the header inside awk script, like this
awk '
BEGIN{
do {
h = (h) ? (h "\n" line) : line
} while ((getline line <"new_header.csv") > 0)
}
...
'
and h contains the new header.
$ awk 'NR==FNR {header=$0; next}
{print (FNR==1?header:$0) > (FILENAME".updated")}' new_header.csv other files...
capture the first record from the header file and replace the first lines from the rest of the files, updated files will have suffix ".updated".
caveat emptor not tested.

Prepend text to specific line numbers with variables

I have spent hours trying to solve this. There are a bunch of answers as to how to prepend to all lines or specific lines but not with a variable text and a variable number.
while [ $FirstVariable -lt $NextVariable ]; do
#sed -i "$FirstVariables/.*/$FirstVariableText/" "$PWD/Inprocess/$InprocessFile"
cat "$PWD/Inprocess/$InprocessFile" | awk 'NR==${FirstVariable}{print "$FirstVariableText"}1' > "$PWD/Inprocess/Temp$InprocessFile"
FirstVariable=$[$FirstVariable+1]
done
Essentially I am looking for a particular string delimiter and then figuring out where the next one is and appending the first result back into the following lines... Note that I already figured out the logic I am just having issues prepending the line with the variables.
Example:
This >
Line1:
1
2
3
Line2:
1
2
3
Would turn into >
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
You can do all that using below awk one liner.
Assuming your pattern starts with Line, then the below script can be used.
> awk '{if ($1 ~ /Line/ ){var=$1;print $0;}else{ if ($1 !="")print var $1}}' $PWD/Inprocess/$InprocessFile
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
Here is how the above script works:
If the first record contains word Line then it is copied into an awk variable var. From next word onwards, if the record is not empty, the newly created var is appended to that record and prints it producing the desired result.
If you need to pass the variables dynamically from shell to awk you can use -v option. Like below:
awk -v var1=$FirstVariable -v var2=$FirstVariableText 'NR==var{print var2}1' > "$PWD/Inprocess/Temp$InprocessFile"
The way you addressed the problem is by parsing everything both with bash and awk to process the file. You make use of bash to extract a line, and then use awk to manipulate this one line. The whole thing can actually be done with a single awk script:
awk '/^Line/{str=$1; print; next}{print (NF ? str $0 : "")}' inputfile > outputfile
or
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}{gsub(FS,OFS $1)}1' inputfile > outputfile

Use sed/awk to replace text in multiple lines at once

I have a very large (~60MB) text file in which I want to replace specific block lines with a predefined text. The line number of every block (3 lines) start is known and they are stored in a file:
...
11
30
42
58
...
I know that I can use the following option in order to replace a block:
sed -i "Xs,(X+3)s/.*/REPLACEMENT/" filename.txt
However, executing this command in a for loop like:
for line in $(cat linenumbers.txt); do
eline=$((${line}+3))
sed -i "Xs,(X+3)s/.*/REPLACEMENT/" filename.txt
done
is very slow and takes a lot of time (> 10') and I have 100s of files in which I have to replace blocks.
Is there any other way to instruct sed to do that in one pass?
awk to the rescue!
$ awk 'NR==FNR {start[$1]; next}
FNR in start {c=3}
c&&c-- {print "replacement"; next}1' indices file
this is a one pass process, you can save the output into a new file and overwrite the original one if you want.
Similar to #karakfas answer but a different interpretation of your requirements (hint: an actual example with input and output would have cleared up the confusion):
awk '
NR==FNR { start[$1]; next }
FNR in start { print "replacement"; c=3 }
c&&c-- { next }
{ print }
' indices file
This might work for you (GNU sed):
sed 's/.*/&,+3cREPLACEMENT/' lineNumbersFile | sed -f - file
Convert the line numbers file into a sed script and run it against the data file.

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Bash - extract file name and extension from a string

Here is grep command:
grep "%SWFPATH%/plugins/" filename
And its output:
set(hotspot[hs_bg_%2].url,%SWFPATH%/plugins/textfield.swf);
set(hotspot[hs_%2].url,%SWFPATH%/plugins/textfield.swf);
url="%SWFPATH%/plugins/textfield.swf"
url="%SWFPATH%/plugins/scrollarea.swf"
alturl="%SWFPATH%/plugins/scrollarea.js"
url="%SWFPATH%/plugins/textfield.swf"
I'd like to generate a file containing the names of the all files in the 'plugins/' directory, that are mentioned in a certain file.
Basically I need to extract the file name and the extension from every line.
I can manage to delete any duplicates but I can't figure out how to extract the information that I need.
This would be the content of the file that I would like to get:
textfield.swf
scrollarea.swf
strollarea.js
Thanks!!!
PS: The thread "Extract filename and extension in bash (14 answers)" explains how to get filename and extension from a 'variable'. What I'm trying to achieve is extracting these from a 'file', which is completely different'
Using awk:
grep "%SWFPATH%/plugins/" filename | \
awk '{ match($0, /plugins\/([^\/[:space:]]+)\.([[:alnum:]]+)/,submatch);
print "filename:"submatch[1];
print "extension:"submatch[2];
}'
Some explanation:
the match function takes every line processed by awk (indicated by $0) and looks for matches to that regex. Submatches (the parts of the string that match the parts of the regex between parentheses) are saved in the array submatch. print is as straightforward as it looks, it just prints stuff.
For this specific problem
awk '/\/plugins\// {sub(/.*\//, ""); sub(/(\);|")?$/, "");
arr[$0] = $0} END {for (i in arr) print arr[i]}' filename
Use awk to simply extract the filename and then sed to clean up the trailing )"; characters.
awk -F/ '{print $NF}' a | sed -e 's/);//' -e 's/"$//'

Resources