Using sed to find-and-replace in a text file using strings from another text file - bash

I have two files as follows. The first is sample.txt:
new haven co-op toronto on $1245
joe schmo co-op powell river bc $4444
The second is locations.txt:
toronto
powell river
on
bc
We'd like to use sed to produce a marked up sample-new.txt that added ; before and after each of these. So that the final string would appear like:
new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444
Is this possible using bash? The actual files are much longer (thousands of lines in each case) but as a one-time job we're not too concerned about processing time.
--- edited to add ---
My original approach was something like this:
cat locations.txt | xargs -i sed 's/{}/;/' sample.txt
But it only ran the script once per pattern, as opposed to the methods you've proposed here.

Using awk:
awk 'NR==FNR{a[NR]=$0; next;} {for(i in a)gsub("\\<"a[i]"\\>",";"a[i]";"); print} ' locations.txt sample.txt
Using awk+sed
sed -f <(awk '{print "s|\\<"$0"\\>|;"$0";|g"}' locations.txt) sample.txt
Same using pure sed:
sed -f <(sed 's/.*/s|\\<&\\>|\;&\;|g/' locations.txt) sample.txt
(After you show your coding attempts, I will add the explanation of why this works.)

Just to complete your set of options, you can do this in pure bash, slowly:
#!/usr/bin/env bash
readarray -t places < t2
while read line; do
for place in "${places[#]}"; do
line="${line/ $place / ;$place; }"
done
echo "$line"
done < t1
Note that this likely won't work as expected if you include places that are inside other places, for example "niagara on the lake" which is in "on":
foo bar co-op ;niagara ;on; the lake; on $1
Instead, you might want to do more targeted pattern matching, which will be much easier in awk:
#!/usr/bin/awk -f
# Collect the location list into the index of an array
NR==FNR {
places[$0]
next
}
# Now step through the input file
{
# Handle two-letter provinces
if ($(NF-1) in places) {
$(NF-1)=";" $(NF-1) ";"
}
# Step through the remaining places doing substitutions as we find matches
for (place in places) {
if (length(place)>2 && index($0,place)) {
sub(place,";"place";")
}
}
}
# Print every line
1
This works for me using the data in your question:
$ cat places
toronto
powell river
niagara on the lake
on
bc
$ ./tst places input
new haven co-op ;toronto; ;on; $1245
joe schmo co-op ;powell river; ;bc; $4444
foo nar co-op ;niagara on the lake; ;on; $1
You may have a problem if your places file contains an actual non-province comprising two letters. I'm not sure if such things exist in Canada, but if they do, you'll either need to tweak such lines manually, or make the script more complex by handling provinces separately from cities.

Related

How to print matching all names given as a argument?

I want to write a script for any name given as an argument and prints the list of paths
to home directories of people with the name.
I am new at scripts. Is there any simple way to do this with awk or egrep command?
Example:
$ show names jakub anna (as an argument)
/home/users/jakubo
/home/students/j_luczka
/home/students/kubeusz
/home/students/jakub5z
/home/students/qwertinx
/home/users/lazinska
/home/students/annalaz
Here is the my friend's code but I have to write it from a different way and it has to be simple like this code
#!/bin/bash
for name in $#
do
awk -v n="$name" -F ':' 'BEGIN{IGNORECASE=1};$5~n{print $6}' /etc/passwd | while read line
do
echo $line
done
done
Possible to use a simple awk script to look for matching names.
The list of names can be passed as a space separated list to awk, which will construct (in the BEGIN section) a combined pattern (e.g. '(names|jakub|anna)'). The pattern is used for testing the user name column ($5) of the password file.
#! /bin/sh
awk -v "L=$*" -F: '
BEGIN {
name_pat = "(" gensub(" ", "|", "g", L) ")"
}
$5 ~ name_pat { print $6 }
' /etc/passwd
Since at present the question as a whole is unclear, this is more of a long comment, and only a partial answer.
There is one easy simplification, since the sample code includes:
... | while read line
do
echo $line
done
All of the code shown above after and including the | is needless, and does nothing, (like a UUoC), and should therefore be removed. (Actually echo $line with an unquoted $line would remove formatting and repeated spaces, but that's not relevant to the task at hand, so we can say the code above does nothing.)

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

grep pattern from file, print the pattern instead matched string

I want to grep with patterns from file containing regex.
When the pattern matches, it prints the matched stringa but not the pattern.
How can I get the pattern instead matched strings?
pattern.txt
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Donut Gorilla Chocolate
Chocolate (English|Fall) apple gorilla
gorilla chocolate (apple|ball)
(ball|donut) apple
strings.txt
apple ball Donut
donut ball chocolate
donut Ball Chocolate
apple donut
chocolate ball Apple
This is grep command
grep -Eix -f pattern.txt strings.txt
This command prints matched strings from strings.txt
apple ball Donut
donut ball chocolate
donut Ball Chocolate
But I want to find which patterns were used to match from pattern.txt
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
The pattern.txt can be lower cases, upper cases, line with regex and without, free numbers of words and regex elements. There is no other kind of regex than brackets and pipe.
I don't want to use loop to read pattern.txt each line to grep as it's slow.
Is there way to print which pattern or line number of pattern file in grep command? or any other command than grep can do the job not too slow?
Using grep I have no idea but with GNU awk:
$ awk '
BEGIN { IGNORECASE = 1 } # for case insensitivity
NR==FNR { # process pattern file
a[$0] # hash the entries to a
next # process next line
}
{ # process strings file
for(i in a) # loop all pattern file entries
if($0 ~ "^" i "$") { # if there is a match (see comments)
print i # output the matching pattern file entry
# delete a[i] # uncomment to delete matched patterns from a
# next # uncomment to end searching after first match
}
}' pattern strings
outputs:
D (A|B) C
For each line in strings script will loop every pattern line to see if there are more than one match. There is only one match due to case-sensitivity. You can battle that, for example, using GNU awk's IGNORECASE.
Also, if you want each matched one pattern file entry to be outputed once, you could delete them from a after first match: add delete a[i] after the print. That might give you some performance advantage also.
EDIT: Since OP changed the Input_file(s) so adding solutions as per changed Input_file(s) too now.
awk '
FNR==NR{
a[toupper($1),toupper($NF)]
b[toupper($2)]
next
}
{
val=toupper($2)
gsub(/\)|\(|\|/," ",val)
num=split(val,array," ")
for(i=1;i<=num;i++){
if(array[i] in b){
flag=1
break
}
}
}
flag && ((toupper($1),toupper($NF)) in a){
print;
flag=""
}' string pattern
Output will be as follows.
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Solution 1st: Adding a generic solution where let's say your Input_file named pattern have more than 2 values on 2nd field eg--> (B|C|D|E) then following may help you here.
awk '
FNR==NR{
a[$1,$NF]
b[toupper($2)]
next
}
{
val=$2
gsub(/\)|\(|\|/," ",val)
num=split(val,array," ")
for(i=1;i<=num;i++){
if(array[i] in b){
flag=1
break
}
}
}
flag && (($1,$NF) in a)
{
flag=""
}' string pattern
Solution 2nd: Could you please try following. But strictly considering that your Input_file(s) are same pattern as per shown samples only(where I am considering that your Input_file named pattern will have only 2 values in 2nd field of it)
awk '
FNR==NR{
a[$1,$NF]
b[toupper($2)]
next
}
{
val=$2
gsub(/\)|\(|\|/," ",val)
split(val,array," ")
}
((array[1] in b) || (array[2] in b)) && (($1,$NF) in a)
' string pattern
Output will be as follows.
A (B|C) D
D (A|B) C
Maybe switch the paradigm?
while read pat
do grep -Eix "$pat" strings.txt >"$pat" &
done <patterns.txt
That's going to make ugly filenames, but you'd have clear lists per set. You could scrub the filenames first if you prefer. Maybe (assuming the patterns resolve to uniqueness this easily...)
while read pat
do grep -Eix "$pat" strings.txt >"${pat//[^A-Z]/}" &
done <patterns.txt
It ought to be reasonably quick, and is relatively simple to implement.
Hope that helps.
You could try with bash built-ins:
$ cat foo.sh
#!/usr/bin/env bash
# case insensitive
shopt -s nocasematch
# associative array of patterns
declare -A patterns=()
while read -r p; do
patterns["$p"]=1
done < pattern.txt
# read strings, test remaining patterns,
# if match print pattern and remove it from array
while read -r s; do
for p in "${!patterns[#]}"; do
if [[ $s =~ ^$p$ ]]; then
printf "%s\n" "$p"
unset patterns["$p"]
fi
done
done < strings.txt
$ ./foo.sh
Apple (Ball|chocolate|fall) Donut
donut (apple|ball) Chocolate
Not sure about the performance but as there are no child processes, it should be much faster than invoking grep for each pattern.
Of course, if you have millions of patterns, storing them in an associative array could exhaust your available memory.

Grabing values from one file (via awk) and using them in another (via sed)

I am moving using gawk to grab some values but not all values from a file. I have another file that's a template that I will use to replace certain piece then generate a file specific to those values I grab. I would like to use sed to substitute these fields of interest that are in the template.
the dog NAME , likes to ACTION in water when he's bored
another file,f1, would have the name of the dog and the action
Maxs,swim
StoneCold,digs
Thor,leaps
So I can grab these values and store them into an associative array...what I cant do, or see, is how do I get these to my sed script?
so a simple sed script could be like this
s/NAME/ value from f1
s/ACTION/ value from f1
so my out put for the template would be
the dog Maxs , likes to swim in water when he's bored
So if I ran a bash file, the command would look something like this, or what I have attempted
gawk -f f1 animalNameAction | sed -f (is there a way to put something here) template | cat
gawk -f f1 animalNameAction > PulledValues| sed -f PulledValues template | cat
but none of this has worked. So I am left wondering how this could be done.
You can do this, using awk itself,
I assume, template can be of multiline char,
so in FNR==NR{} block, I saved entire file (template) contents in variable t,
and in other block, I replaced NAME and ACTION with first and second fields from comma separated file.
Here is example :
$ cat template
the dog NAME , likes to ACTION in water when he's bored
$ cat file
Maxs,swim
StoneCold,digs
Thor,leaps
$ awk 'FNR==NR{ t = (t ? t RS :"") $0; next}{ s=t; gsub(/NAME/,$1,s); gsub(/ACTION/,$2,s); print s}' template FS=',' file
the dog Maxs , likes to swim in water when he's bored
the dog StoneCold , likes to digs in water when he's bored
the dog Thor , likes to leaps in water when he's bored
Better Readable :
awk 'FNR==NR{
t = (t ? t RS :"") $0;
next
}
{
s=t;
gsub(/NAME/,$1,s);
gsub(/ACTION/,$2,s);
print s
}
' template FS=',' file

Resources