Sort alphabetically lines between 2 patterns in Bash - bash

I'd like to alphabetically lines between 2 patterns in a Bash shell script.
Given the following input file:
aaa
bbb
PATTERN1
foo
bar
baz
qux
PATTERN2
ccc
ddd
I expect as output:
aaa
bbb
PATTERN1
bar
baz
foo
qux
PATTERN2
ccc
ddd
Preferred tool is an AWK "one-liner". Sed and other solutions also accepted. It would be nice if an explanation is included.

This is a perfect case to use asort() to sort an array in GNU awk:
gawk '/PATTERN1/ {f=1; delete a}
/PATTERN2/ {f=0; n=asort(a); for (i=1;i<=n;i++) print a[i]}
!f
f{a[$0]=$0}' file
This uses a similar logic as How to select lines between two marker patterns which may occur multiple times with awk/sed with the addition that it:
Prints lines outside this range
Stores lines within this range
And when the range is over, sorts and prints them.
Detailed explanation:
/PATTERN1/ {f=1; delete a} when finding a line matching PATTERN1, sets a flag on, and clears the array of lines.
/PATTERN2/ {f=0; n=asort(a); for (i=1;i<=n;i++) print a[i]} when finding a line matching PATTERN2, sets the flag off. Also, sorts the array a[] containing all the lines in the range and print them.
!f if the flag is off (that is, outside the range), evaluate as True so that the line is printed.
f{a[$0]=$0} if the flag is on, store the line in the array a[] so that its info can be used later on.
Test
▶ gawk '/PATTERN1/ {f=1} /PATTERN2/ {f=0; n=asort(a); for (i=1;i<=n;i++) print a[i]} !f; f{a[$0]=$0}' FILE
aaa
bbb
PATTERN1
bar
baz
foo
qux
PATTERN2
ccc
ddd

You can use sed with head and tail:
{
sed '1,/^PATTERN1$/!d' FILE
sed '/^PATTERN1$/,/^PATTERN2$/!d' FILE | head -n-1 | tail -n+2 | sort
sed '/^PATTERN2$/,$!d' FILE
} > output
The first line prints everything from the 1st line to PATTERN1.
The second line takes the lines between PATTERN1 and PATTERN2, removes the last and first line, and sorts the remaining lines.
The third line prints everything from PATTERN2 to the end of the file.

More complicated, but may ease the memory load of storing lots of lines (your cfg file would have to be pretty huge for this to matter, but nevertheless...). Using GNU awk and a sort coprocess:
gawk -v p=1 '
/^PATTERN2/ { # when we we see the 2nd marker:
# close the "write" end of the pipe to sort. Then sort will know it
# has all the data and it can begin sorting
close("sort", "to");
# then sort will print out the sorted results, so read and print that
while (("sort" |& getline line) >0) print line
# and turn the boolean back to true
p=1
}
p {print} # if p is true, print the line
!p {print |& "sort"} # if p is false, send the line to `sort`
/^PATTERN1/ {p=0} # when we see the first marker, turn off printing
' FILE

It's a little unconventional but using Vim:
vim -c 'exe "normal /PATTERN1\<cr>jV/PATTERN2\<cr>k: ! sort\<cr>" | wq!' FILE
Where \<cr> is a carriage return, entered as CTRL-v then CTRL-M.
Further explanation:
Using vim normal mode,
/PATTERN1\<cr> - search for the first pattern
j - go to the next line
V - enter visual mode
/PATTERN2\<cr> - search for the second pattern
k - go back one line
: ! sort\<cr> - sort the visual text you just selected
wq! - save and exit

Obviously this is inferior to the GNU AWK solution, but all the same, this is a GNU sed solution:
sed '
/PATTERN1/,/PATTERN2/ {
/PATTERN1/b # branch/break if /PATTERN1/. This line is printed
/PATTERN2/ { # if /PATTERN2/,
x # swap hold and pattern spaces
s/^\n// # delete the leading newline. The first H puts it there
s/.*/sort <<< "&"/e # sort the pattern space by calling Unix sort
p # print the sorted pattern space
x # swap hold and pattern space again to retrieve PATTERN2
p # print it also
}
H # Append the pattern space to the hold space
d # delete this line for now - it will be printed in the block above
}
' FILE
Note that I rely on the e command, a GNU extension.
Testing:
▶ gsed '
/PATTERN1/,/PATTERN2/ {
/PATTERN1/b
/PATTERN2/ {
x
s/^\n//; s/.*/sort <<< "&"/ep
x
p
}
H
d
}
' FILE
aaa
bbb
PATTERN1
bar
baz
foo
qux
PATTERN2
ccc
ddd

Here is a small and easy to understand shell script for sorting lines between two patterns:
#!/bin/sh
in_file=$1
out_file=$2
temp_file_for_sort="$out_file.temp.for_sort"
curr_state=0
in_between_count=0
rm -rf $out_file
while IFS='' read -r line; do
if (( $curr_state == 0 )); then
#write this line to output
echo $line >> $out_file
is_start_line=`echo $line | grep "^PATTERN_START$"`
if [ -z "$is_start_line" ]; then
continue
else
rm -rf $temp_file_for_sort
in_between_count=0
curr_state=1
fi
else
is_end_line=`echo $line | grep "^PATTERN_END"`
if [ -z "$is_end_line" ]; then
#Line inside block - to be sorted
echo $line >> $temp_file_for_sort
in_between_count=$(( $in_between_count +1 ))
else
#End of block
curr_state=0
if (( $in_between_count != 0 )); then
sort -o $temp_file_for_sort $temp_file_for_sort
cat $temp_file_for_sort >> $out_file
rm -rf $temp_file_for_sort
fi
echo $line >> $out_file
fi
fi
done < $temp_file
#if something remains
if [ -f $temp_file_for_sort ]; then
cat $temp_file_for_sort >> $out_file
fi
rm -rf $temp_file_for_sort
Usage: <script_path> <input_file> <output_file>.
Pattern is hardcoded in file, can be changed as required (or taken as argument). Also, it creates a temporary file to sort intermediate data (<output_file>.temp.for_sort)
Algorithm:
Start with state = 0 and read the file line by line.
In state 0, line is written to output file and if START_PATTERN is encountered, state is set to 1.
In state 1, if line is not STOP_PATTERN, write line to temporary file
In state 1, if line is STOP_PATTERN, sort temporary file, append temporary file contents to output file (and remove temporary file) and write STOP_PATTERN to output file. Also, change state to 0.
At last if something is left in temporary file (case when STOP_PATTERN is missing), write contents of temporary file to output file

Along the lines of the solution proposed by #choroba, using GNU sed (depends on Q command):
{
sed -n '1,/PATTERN1/p' FILE
sed '1,/PATTERN1/d; /PATTERN2/Q' FILE | sort
sed -n '/PATTERN2/,$p' FILE
}
Explanation:
Use of the p prints a line in the range 1 to /PATTERN1/ inclusive and ($ is end of file) in '1,/PATTERN1/p' and /PATTERN2/,$p respectively.
Use of -n disables default behaviour of printing all lines. Useful in conjunction with p.
In the middle line, the d command is used to delete lines 1 to the /PATTERN1/ and also to Q (quit without printing, GNU sed only) on the first line matching /PATTERN2/. These are the lines to be sorted, and are thus fed into sort.

This can also be done with non-GNU awk and system command sort, make it work on both macOS and Linux.
awk -v SP='PATTERN1' -v EP='PATTERN2' -v cmd=sort '{
if (match($0, SP)>0) {flag=1}
else if (match($0, EP)>0) {
for (j=0;j<length(a);j++) {print a[j]|cmd}
close(cmd); delete a; i=0; flag=0}
else if (flag==1) {a[i++]=$0; next}
print $0
}' FILE
Output:
aaa
bbb
PATTERN1
bar
baz
foo
qux
PATTERN2
ccc
ddd

Related

Bash: Separating a file by blank lines and assigning to a list

So i have a file for example
a
b
c
d
I'd like to make the list of the lines with data out of this. The empty line would be the seperator. So above file's list would be
First element = a
Second element = b
c
Third element = d
Replace blank lines with ,, then remove newline characters:
cat <file> | sed 's/^$/, /' | tr -d '\n'
The following awk would do:
awk 'BEGIN{RS="";ORS=",";FS="\n";OFS=""}($1=$1)' file
This adds an extra , at the end. You can get rid of that in the following way:
awk 'BEGIN{RS="";ORS=",";FS="\n";OFS=""}
{$1=$1;s=s $0 ORS}END{sub(ORS"$","",s); print s}' file
But what happened now, by making this slight modification to eliminate the last ORS (i.e. comma), you have to store the full thing in memory. So you could then just do it more boring and less elegant by storing the full file in memory:
awk '{s=s $0}END{gsub(/\n\n/,",",s);gsub(/\n/,"",s); print s}' file
The following sed does exactly the same. Store the full file in memory and process it.
sed ':a;N;$!ba;s/\n\n/,/g;s/\n//g' <file>
There is, however, a way to play it a bit more clever with awk.
awk 'BEGIN{RS=OFS="";FS="\n"}{$1=$1; print (NR>1?",":"")$0}' file
It depends on what you need to do with that data.
With perl, you have a one-liner:
$ perl -00 -lnE 'say "element $. = $_"' file.txt
element 1 = a
element 2 = b
c
element 3 = d
But clearly you need to process the elements in some way, and I suspect Perl is not your cup of tea.
With bash you could do:
elements=()
n=0
while IFS= read -r line; do
[[ $line ]] && elements[n]+="$line"$'\n' || ((n++))
done < file.txt
# strip the trailing newline from each element
elements=("${elements[#]/%$'\n'/}")
# and show what's in the array
declare -p elements
declare -a elements='([0]="a" [1]="b
c" [2]="d")'
$ awk -v RS= '{print "Element " NR " = " $0}' file
Element 1 = a
Element 2 = b
c
Element 3 = d
If you really want to say First Element instead of Element 1 then enjoy the exercise :-).

Using grep -f to find the pattern themselves that have matches

I'm trying to give grep a pattern file (through -f) , but I want to learn which patterns are matching something in the search file
For example, given 1.txt:
a/(.*)
b/(.*)
b/c/(.*)
b/foo/(.*)
d/(.*)
e/(.*)
and 2.txt:
a/
a/foo/bar/
b/foo/
d/foo/
The patterns from 1.txt that match something in 2.txt are (omitting the (.*) suffix) are as follows:
a/
b/
b/foo/
d/
How can I "find the list of patterns that have a match"?
EDIT: I'm only looking for a prefix match but I think the question is interesting enough for general pattern matching.
EDIT: Since a for-loop based solution is given, I should say I'm not looking at calling grep 10000 times. :) The working solution I already have (listed below) is pretty slow:
for line in "${file1_arr[#]}"; do
if ! grep -qE "^$v(.*)\$"; then
echo "$line"
fi
done
Ideally I'm looking for a single grep call or so with less overhead.
In awk:
$ awk 'NR==FNR{a[$0]=FNR;next}{for(i in a)if($0 ~ i)print i,$0}' 1.txt 2.txt
a/(.*) a/
a/(.*) a/foo/bar
b/(.*) b/foo
d/(.*) d/foo
Explained:
$ awk ' # yes
NR==FNR { # process first file
a[$0]=FNR # hash regex, store record number just in case
next # process next record
}
{ # process second file
for(i in a) # loop every entry in 1.txt
if($0 ~ i) # if regex matches record
print i,$0} # print all matching regex and record
' 1.txt 2.txt
Edit: To output each regex just once (like shown here in the expected output) you could delete the regex from a once it's been used, that way it won't get matched and outputed more than once:
$ awk '
NR==FNR { a[$0]; next }
{
for(i in a)
if($0 ~ i) {
print i
delete a[i] # deleted regex wont get matched again
}
}' 1.txt 2.txt
vendor/cloud.google.com/go/compute/metadata/(.*)$
vendor/cloud.google.com/go/compute/(.*)$
vendor/cloud.google.com/go/(.*)$
vendor/cloud.google.com/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/dns/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/(.*)$
vendor/github.com/Azure/(.*)$
vendor/github.com/(.*)$
Also, My test showed about 60 % off (mini laptop, 1:16 to 29 s) the time with this modification for GNU awk (using data you provided in the comments, file1.txt and file2.txt):
$ awk '
BEGIN {
FS="." # . splits the url
}
NR==FNR { a[$1][$0]; next } # we index on the first part of url
{
for(i in a[$1]) # search space decreased
if($0 ~ i) {
print i
delete a[$1][i]
}
}' file1.txt file2.txt
The speedup decreases the search space by using the start of the strings up to the first period as the key for the hash, ie:
FS="." # split at first .
...
a[vendor/github][vendor/github.com/Azure/(.*)$] # example of a hash
...
for(i in a[$1]) # search space decreased
Now it does not have to search the whole hash for a matching regex. More feasibe would probably be to use FS="/" ; a[$1 FS $2] but this was just a quick test.
The following script:
#!/usr/bin/env bash
lines=$(wc -l < 1.txt)
for (( i=1; i<=$lines; i++ )); do
line=$(sed -n "$i"p 1.txt)
line=$(sed "s/\/(.*)$//" <<< "$line")
grep -E "$line" 2.txt 1>/dev/null && echo "$line"
done
prints lines in 1.txt that matched in 2.txt:
a
b
b/foo
d
comments:
# gets a single line from 1.txt
line=$(sed -n "$i"p 1.txt)
# removes trailing pattern /(.*) from $line variable
line=$(sed "s/\/(.*)$//" <<< "$line")
# if $line matches in 2.txt, print $line
grep -E "$line" 2.txt 1>/dev/null && echo "$line"
I tried the awk and sed based solutions, and I realized I can do this much faster using bash's builtin regexp engine if I read both files in memory.
Here's basically it.
text="$(cat 2.txt)" # read 2.txt
while read -r line; do # for each 'line' from 1.txt
re=[^\b]*${line} # prepend ^ or \b to the pattern
if [[ "$text" =~ $re ]]; then # match the pattern to 2.txt
echo "${line}" # if there's a match, print the pattern
fi
done < <(cat "1.txt")
Since this doesn't spawn any extra processes and just does it in-memory, I suspect this is quite efficient. My benchmarks with the files I linked under James' answer shows 8-9 seconds for this.
I don't see a solution with grep, but sed is an alternative to awk.
With sed I would like to see patterns like b/foo/.* in 1.txt, but I will show a solution based on the (.*).
The purpose of the first command is constructing sed constructions, that will replace the inputline with the regular expression, when it matches the regular expression. The different output lines must look like
sed -rn 's#b/c/(.*)#b/c/#p' 2.txt
and this can be done with
# Use subprocess
sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt
# resulting in
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt) 2.txt| sort -u
The solution is a bit difficult to read, caused bij the layout of 1.txt, where I would want lines like b/foo/.*.
The above commands will have 2 bugs:
When the match is on a part of the line, the non-matched part will be shown in the output. This can be fixed with matching the garbage
# Use lines like 's#.*b/foo(.*)#b/foo#p'
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\1#p/' 1.txt) 2.txt| sort -u
The second bug is that strings in 2.txt that have two matches, will be matched only once (the first match will edit the line in the stream).
This can be fixed by adding some unique marker (I will use \a) for the matching lines and repeating the inputlines on the output (with \n&).
The output can be viewed by looking for the \a markers.
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\\a\1\\n\&#p/' 1.txt) 2.txt|
sed -rn '/\a/ s/.(.*)/\1/p' | sort -u
EDIT:
The work-around with a marker and restoring the original input is not needed when you follow a different approach.
In sed you can print something to stdout without changing the stream.
One possibility (slow for this situation) is using
sed '/something/ eecho "something" '
Another possibility is using the "x" command (that eXchanges the pattern space with the hold buffer). You actuallu want to have a sedscript with commands like
\%a/% {h;s%.*%a/%p;x}
\%b/% {h;s%.*%b/%p;x}
\%b/c/% {h;s%.*%b/c/%p;x}
\%b/foo/% {h;s%.*%b/foo/%p;x}
\%d/% {h;s%.*%d/%p;x}
\%e/% {h;s%.*%e/%p;x}
Using above method the sed solution simplifies into
sed -nf <(
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt
) 2.txt | sort -u
When the file 1.txt is not changed often, you might want to preprocess that file.
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt > /tmp/sed.in
sed -nf /tmp/sed.in 2.txt | sort -u

Search file A for a list of strings located in file B and append the value associated with that string to the end of the line in file A

This is a bit complicated, well I think it is..
I have two files, File A and file B
File A contains delay information for a pin and is in the following format
AD22 15484
AB22 9485
AD23 10945
File B contains a component declaration that needs this information added to it and is in the format:
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
So what I am trying to achieve is the following output
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='15484';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='10945';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='9485';
There is no order to the pin numbers in file A or B
So I'm assuming the following needs to happen
open file A, read first line
search file B for first string field in the line just read
once found in file B at the end of the line add the text "\nPIN_DELAY='"
add the second string filed of the line read from file A
add the following text at the end "';"
repeat by opening file A, read the second line
I'm assuming it will be a combination of sed and awk commands and I'm currently trying to work it out but think this is beyond my knowledge. Many thanks in advance as I know it's complicated..
FILE2=`cat file2`
FILE1=`cat file1`
TMPFILE=`mktemp XXXXXXXX.tmp`
FLAG=0
for line in $FILE1;do
echo $line >> $TMPFILE
for line2 in $FILE2;do
if [ $FLAG == 1 ];then
echo -e "PIN_DELAY='$(echo $line2 | awk -F " " '{print $1}')'" >> $TMPFILE
FLAG=0
elif [ "`echo $line | grep $(echo $line2 | awk -F " " '{print $1}')`" != "" ];then
FLAG=1
fi
done
done
mv $TMPFILE file1
Works for me, you can also add a trap for remove tmp file if user send sigint.
awk to the rescue...
$ awk -vq="'" 'NR==FNR{a[$1]=$2;next} {print; for(k in a) if(match($0,k)) {print "PIN_DELAY=" q a[k] q ";"; next}}' keys data
'DXN_0':
PIN_NUMBER='(AD22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='15484';
'DXP_0':
PIN_NUMBER='(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,AD23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='10945';
'VREFN_0':
PIN_NUMBER='(AB22,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)';
PIN_DELAY='9485';
Explanation: scan the first file for key/value pairs. For each line in the second data file print the line, for any matching key print value of the key in the requested format. Single quotes in awk is little tricky, setting a q variable is one way of handling it.
FINAL Script for my application, A big thank you to all that helped..
# ! /usr/bin/sh
# script created by Adam with a LOT of help from users on stackoverflow
# must pass $1 file (package file from Xilinx)
# must pass $2 file (chips.prt file from the PCB design office)
# remove these temp files, throws error if not present tho, whoops!!
rm DELAYS.txt CHIP.txt OUTPUT.txt
# BELOW::create temp files for the code thanks to Glastis#stackoverflow https://stackoverflow.com/users/5101968/glastis I now know how to do this
DELAYS=`mktemp DELAYS.txt`
CHIP=`mktemp CHIP.txt`
OUTPUT=`mktemp OUTPUT.txt`
# BELOW::grep input file 1 (pkg file from Xilinx) for lines containing a delay in the form of n.n and use TAIL to remove something (can't remember), sed to remove blanks and replace with single space, sed to remove space before \n, use awk to print columns 3,9,10 and feed into awk again to calculate delay provided by fedorqui#stackoverflow https://stackoverflow.com/users/1983854/fedorqui
# In awk, NF refers to the number of fields on the current line. Since $n refers to the field number n, with $(NF-1) we refer to the penultimate field.
# {...}1 do stuff and then print the resulting line. 1 evaluates as True and anything True triggers awk to perform its default action, which is to print the current line.
# $(NF-1) + $NF)/2 * 141 perform the calculation: `(penultimate + last) / 2 * 141
# {$(NF-1)=sprintf( ... ) assign the result of the previous calculation to the penultimate field. Using sprintf with %.0f we make sure the rounding is performed, as described above.
# {...; NF--} once the calculation is done, we have its result in the penultimate field. To remove the last column, we just say "hey, decrease the number of fields" so that the last one gets "removed".
grep -E -0 '[0-9]\.[0-9]' $1 | tail -n +2 | sed -e 's/[[:blank:]]\+/ /g' -e 's/\s\n/\n/g' | awk '{print ","$3",",$9,$10}' | awk '{$(NF-1)=sprintf("%.0f", ($(NF-1) + $NF)/2 * 169); NF--}1' >> $DELAYS
# remove blanks in part file and add additional commas (,) so that the following awk command works properly
cat $2 | sed -e "s/[[:blank:]]\+//" -e "s/(/(,/g" -e 's/)/,)/g' >> $CHIP
# this awk command is provided by karakfa#stackoverflow https://stackoverflow.com/users/1435869/karakfa Explanation: scan the first file for key/value pairs. For each line in the second data file print the line, for any matching key print value of the key in the requested format. Single quotes in awk is little tricky, setting a q variable is one way of handling it. https://stackoverflow.com/questions/32458680/search-file-a-for-a-list-of-strings-located-in-file-b-and-append-the-value-assoc
awk -vq="'" 'NR==FNR{a[$1]=$2;next} {print; for(k in a) if(match($0,k)) {print "PIN_DELAY=" q a[k] q ";"; next}}' $DELAYS $CHIP >> $OUTPUT
# remove the additional commas (,) added in earlier before ) and after ( and you are done..
cat $OUTPUT | sed -e 's/(,/(/g' -e 's/,)/)/g' >> chipsd.prt

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Split one file into multiple files based on pattern

I have a binary file which I convert into a regular file using hexdump and few awk and sed commands. The output file looks something like this -
$cat temp
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000
000000087d3f513000000000000000000000000000000000001001001010f000000000026
58783100b354c52658783100b43d3d0000ad6413400103231665f301010b9130194899f2f
fffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f433031000000000004
6363070000000000000000000000000065450000b4fb6b4000393d3d1116cdcc57e58287d
3f55285a1084b
The temp file has few eye catchers (3d3d) which don't repeat that often. They kinda denote a start of new binary record. I need to split the file based on those eye catchers.
My desired output is to have multiple files (based on the number of eyecatchers in my temp file).
So my output would look something like this -
$cat temp1
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e582000000000000000
0000000000087d3f513000000000000000000000000000000000001001001010f00000000
002658783100b354c52658783100b4
$cat temp2
3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc0
15800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000
000000000065450000b4fb6b400039
$cat temp3
3d3d1116cdcc57e58287d3f55285a1084b
The RS variable in awk is nice for this, allowing you to define the record separator. Thus, you just need to capture each record in its own temp file. The simplest version is:
cat temp |
awk -v RS="3d3d" '{ print $0 > "temp" NR }'
The sample text starts with the eye-catcher 3d3d, so temp1 will be an empty file. Further, the eye-catcher itself won't be at the start of the temp files, as was shown for the temp files in the question. Finally, if there are a lot of records, you could run into the system limit on open files. Some minor complications will bring it closer to what you want and make it safer:
cat temp |
awk -v RS="3d3d" 'NR > 1 { print RS $0 > "temp" (NR-1); close("temp" (NR-1)) }'
#!/usr/bin/perl
undef $/;
$_ = <>;
$n = 0;
for $match (split(/(?=3d3d)/)) {
open(O, '>temp' . ++$n);
print O $match;
close(O);
}
This might work:
# sed 's/3d3d/\n&/2g' temp | split -dl1 - temp
# ls
temp temp00 temp01 temp02
# cat temp00
3d3d01f87347545002f1d5b2be4ee4d700010100018000cc57e5820000000000000000000000000087d3f513000000000000000000000000000000000001001001010f000000000026 58783100b354c52658783100b4
# cat temp01
3d3d0000ad6413400103231665f301010b9130194899f2ffffffffffff02007c00dc015800a040402802f1d5b2b8ca5674504f4330310000000000046363070000000000000000000000000065450000b4fb6b400039
# cat temp02
3d3d1116cdcc57e58287d3f55285a1084b
EDIT:
If there are newlines in the source file you can remove them first by using tr -d '\n' <temp and then pipe the output through the above sed command. If however you wish to preserve them then:
sed 's/3d3d/\n&/g;s/^\n\(3d3d\)/\1/' temp |csplit -zf temp - '/^3d3d/' {*}
Should do the trick
Mac OS X answer
Where that nice awk -v RS="pattern" trick doesn't work. Here's what I got working:
Given this example concatted.txt
filename=foo bar
foo bar line1
foo bar line2
filename=baz qux
baz qux line1
baz qux line2
use this command (remove comments to prevent it from failing)
# cat: useless use of cat ^__^;
# tr: replace all newlines with delimiter1 (which must not be in concatted.txt) so we have one line of all the next
# sed: replace file start pattern with delimiter2 (which must not be in concatted.txt) so we know where to split out each file
# tr: replace delimiter2 with NULL character since sed can't do it
# xargs: split giant single-line input on NULL character and pass 1 line (= 1 file) at a time to echo into the pipe
# sed: get all but last line (same as head -n -1) because there's an extra since concatted-file.txt ends in a NULL character.
# awk: does a bunch of stuff as the final command. Remember it's getting a single line to work with.
# {replace all delimiter1s in file with newlines (in place)}
# {match regex (sets RSTART and RLENGTH) then set filename to regex match (might end at delimiter1). Note in this case the number 9 is the length of "filename=" and the 2 removes the "§" }
# {write file to filename and close the file (to avoid "too many files open" error)}
cat ../concatted-file.txt \
| tr '\n' '§' \
| sed 's/filename=/∂filename=/g' \
| tr '∂' '\0' \
| xargs -t -0 -n1 echo \
| sed \$d \
| awk '{match($0, /filename=[^§]+§/)} {filename=substr($0, RSTART+9, RLENGTH-9-2)".txt"} {gsub(/§/, "\n", $0)} {print $0 > filename; close(filename)}'
results in these two files named foo bar.txt and baz qux.txt respectively:
filename=foo bar
foo bar line1
foo bar line2
filename=baz qux
baz qux line1
baz qux line2
Hope this helps!
It depends if it's a single line in your temp file or not. But assuming if it's a single line, you can go with:
sed 's/\(.\)\(3d3d\)/\1#\2/g' FILE | awk -F "#" '{ for (i=1; i++; i<=NF) { print $i > "temp" i } }'
The first sed inserts a # as a field/record separator, then awk splits on # and prints every "field" to its own file.
If the input file is already split on 3d3d then you can go with:
awk '/^3d3d/ { i++ } { print > "temp" i }' temp
HTH

Resources