color the search result when the input is paragraph not a single line - bash

I am trying to print the contents of the whole file with highlighted search string.
For a simple file where Record is equal to a single line then I can do this easily using:
grep --color=auto "myseacrchpattern" inputfile
Here my records are in form of paragraphs not single line. Example:
CREATE TABLE mytable ( id SERIAL,
name varchar(20),
cost int );
CREATE TABLE notmytable ( id SERIAL,
name varchar(20),
cost int );
If I do use grep for keyword : "notmytable",it will give me colored output but only that line is printed.
grep --color=auto 'notmytable' inputfile
CREATE TABLE notmytable ( id SERIAL, # <-- "notmytable" is in red but its not the whole query
I need something like this :
CREATE TABLE notmytable ( id SERIAL, # <---"notmytable" is in red
name varchar(20),
cost int );
I can print the desired paragraph with awk or perl but how to color it :
awk -v RS=';' -v ORS=';\n' '/notmytable/' inputfile
CREATE TABLE notmytable ( id SERIAL,
name varchar(20),
cost int );
OR perl :
perl -00lne 'print $_ if /notmytable/' inputfile
CREATE TABLE notmytable ( id SERIAL,
name varchar(20),
cost int );

perl "-MTerm::ANSIColor qw(:constants)" -00lnE'
next if not /notmytable/;
for (split "\n") { /notmytable/ ? say RED $_, RESET : say }
' input
The :constants tag provides RED and such. There are other ways, see Term::ANSIColor
Note that there has to be some duplicate searching, since we need to first identify the paragraph, but then print only that line in color while printing others normally.
If only the pattern need be colored, double-parsing isn't needed (and it's far easier and nicer)
perl -MTerm::ANSIColor -00lnE'say if s/(notmytable)/colored($1,"red")/eg' input

If a string matching the desired regexp is found then surround it by the appropriate characters to change its color and print the record containing it:
$ awk -v RS=';' -v ORS=';\n' 'gsub(/notmytable/,"<RED>&</RED>")' file
CREATE TABLE <RED>notmytable</RED> ( id SERIAL,
name varchar(20),
cost int );
Just change <RED> and </RED> to the escape sequences for that color, e.g.
awk -v RS=';' -v ORS=';\n' 'gsub(/notmytable/,"\033[31m&\033[0m")' file
or if you don't want to hard-code those color values you could do:
awk -v RS=';' -v ORS=';\n' -v red="$(tput setaf 1)" -v nrm="$(tput sgr0)" 'gsub(/notmytable/,red"&"nrm)' file
BTW if you have blank lines between all records you'll probably find using -v RS= -v ORS='\n\n' works better for you than -v RS=';' -v ORS=';\n'.

Combine your awk or perl solution with grep:
perl -00lne 'print $_ if /notmytable/' input|grep -C1000 notmytable
The -C1000 option makes grep keep 1000 lines of surrounding context around the matching line, actually turning grep into a mere colorizer rather than line selector.
You can wrap it into a bash function:
function paragrep() { perl -00lne 'print $_ if /'"$1"/ "$2"|grep -C1000 "$1"; }
Usage example:
$ paragrep notmytable input
CREATE TABLE notmytable ( id SERIAL, # <---"notmytable" is in red
name varchar(20),
cost int );
$

Related

Regex for printing pattern from string

i have a file with below content. i need to separate the content into 2 files
o/p1 should have content everything within first braces () and ` removed and only 1&2 columns printed.
o/p2 should have location with its value
$ cat dt.txt
CREATE EXTERNAL TABLE `rte.fteff_ft`(
`dt` date,
`wk_id` int,
`yq_id` int(10,00),
`te_ind` string,
`yw_dt` date,
`em_dt` date comment dfdsf sdfsdf)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://dfdf/data/ffff/ODE/TdddfT/'
TBLPROPERTIES (
'last_modified_by'='asdas',
'last_modified_time'='1639551681',
'numFiles'='1',
'totalSize'='2848434',
'transient_lastDdlTime'='1639551681')
i need output from the above in two files.
o/p1: a.txt
dt date,
wk_id int,
yq_id int(10,00),
te_ind string,
yw_dt date,
em_dt date
o/p2: b.txt
LOCATION
'hdfs://dfdf/data/ffff/ODE/TdddfT/'
First, use sed to run a couple of commands, to operate on the range of lines between 'CREATE EXTERNAL' and 'ROW DELIMITED FORMAT' where they occur at the start of the line, not including those lines. Then replace grave accent marks with nothing, then keep only the first 2 words.
sed -E '/CREATE EXTERNAL/,/ROW FORMAT DELIMITED/!d;//d;s/`//g; s/(([^ ]+ ){2}).*/\1/' dt.txt > a.txt
EDIT: To remove the commas at the end of the line, add another command of s/,$// . Make sure to anchor the comma to the end of the line else you'll get the comma in the int declaration.
sed -E '/CREATE EXTERNAL/,/ROW FORMAT DELIMITED/!d;//d;s/`//g;s/,$//; s/(([^ ]+ ){2}).*/\1/' dt.txt > a.txt
Second, use the -A option to grep to match the word 'LOCATION' on a line by itself plus the following 1 line.
grep -A 1 '^LOCATION$' dt.txt > b.txt

Command line: retrieving specific column from CSV file

I have a CSV file called articles.csv with headers as follows:
article_id, article_title, article_shares, article_date.
The first row of data in the article is found as $ articles.csv | sed "1 d" and this returns: "895", "Trump, Clinton, America. Who will win, who will lose?", "100", "01/05/2016".
I want to return the fourth column of data (the date of the article) so I use the following code:
$ articles.csv | sed "1 d" | cut -d , -f 4.
However I don't get the date, I get America. Who will win. How do I get the output of the fourth column, regardless of the fact that some columns have commas in them?
A quick and dirty solution:
... | awk -F'",' '{print $4}'
A slow but clean solution:
... | ruby -ne $'require "csv"; print CSV.parse($_)[0][3]'
Note: CSV format should not have spaces between fields, so change your record to:
"895","Trump, Clinton, America. Who will win, who will lose?","100","01/05/2016"

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

replace multiple lines identifying end character

I have the below code
CREATE TABLE Table1(
column1 double NOT NULL,
column2 varchar(60) NULL,
column3 varchar(60) NULL,
column4 double NOT NULL,
CONSTRAINT Index1 PRIMARY KEY CLUSTERED
(
column2 ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON PRIMARY
) ON PRIMARY
GO
GO
and I want to replace
CONSTRAINT Index1 PRIMARY KEY CLUSTERED
(
column2 ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON PRIMARY
) ON PRIMARY
GO
with
)
You can't assume GO is the last character of the file. After Go there can be another table script.
How can I do that with single sed or awk.
Update:
You can use the following sed command to replace even the last , before the CONSTRAINT block:
sed -r '/,/{N;/CONSTRAINT/{:a;N;/GO/!ba;s/([^,]+).*/\1\n)/};/CONSTRAINT/!n}' input.sql
Let me explain it as a multiline script:
# Search for a comma
/,/ {
# If a command was found slurp in the next line
# and append it to the current line in pattern buffer
N
# If the pattern buffer does not contain the word CONSTRAINT
# print the pattern buffer and go on with the next line of input
# meaning start searching for a comma
/CONSTRAINT/! n
# If the pattern CONSTRAINT was found we loop until we find the
# word GO
/CONSTRAINT/ {
# Define a start label for the loop
:a
# Append the next line of input to the pattern buffer
N
# If GO is still not found in the pattern buffern
# step to the start label of the loop
/GO/! ba
# The loop was exited meaning the pattern GO was found.
# We keep the first line of the pattern buffer - without
# the comma at the end and replace everything else by a )
s/([^,]+).*/\1\n)/
}
}
You can save the above multiline script in a file and execute it using
sed -rf script.sed input.sql
You can use the following sed command:
sed '/CONSTRAINT/{:a;N;/GO/!ba;s/.*/)/}' input.sql
The pattern searches for a line containing /CONSTRAINT/. If the pattern is found a block of commands is started wrapped between { }. In the block we first define a label a through :a. The we get the next line of input through N and append it to the pattern buffer. Unless we find the pattern /GO/! we'll continue at label a using the branch command b. If the pattern /GO/ is found we simply replace the buffer by a ).
An alternative can be using using a range like FredPhil suggested:
sed '/CONSTRAINT/,/GO/{s/GO/)/;te;d;:e}'
This may look scary but it is not difficult to grasp with a bit of explanation:
SED_DELIM=$(echo -en "\001")
START=' CONSTRAINT Index1 PRIMARY KEY CLUSTERED'
END='GO'
sed -n $'\x5c'"${SED_DELIM}${START}${SED_DELIM},"$'\x5c'"${SED_DELIM}${END}${SED_DELIM}{s${SED_DELIM}GO${SED_DELIM})${SED_DELIM};t a;d;:a;};p" test2.txt
The sed has the following form you may be more familiar with:
sed /regex1/,/regex2/{commands}
First it uses the SOH non-printable as the delimiter \001
Sets the START and END tags for sed multiline match
Then performs the sed command:
-n do not print by default
$'\x5c' is a Bash string literal that corresponds to backslash \
The backslashes are necessary to escape the non-printable delimiter on the multiline range match.
{s${SED_DELIM}GO${SED_DELIM})${SED_DELIM};t a;d;:a;};p:
s${SED_DELIM}GO${SED_DELIM})${SED_DELIM} replace the line that matches GO with )
t a; if there is a successful substitution in the prior statement then branch to the :a label
d if there is no subsitution then delete the line
p print whatever the result is after the commands
branch to the
I didn't see their answers prior to posting this - this answer is the same as FredPhil/hek2mgl - except in this manner you have a mechanism to be more dynamic on the LHS since you can change the delimiter to a character that is much less likely to appear in the dataset.
With GNU awk for multi-char RS and assuming you want to get rid of the comma before the "CONSTRAINT":
$ cat tst.awk
BEGIN{ RS="^$"; ORS="" }
{
gsub(/\<GO\>/,"\034")
gsub(/,\s*CONSTRAINT[^\034]+\034/,")")
gsub(/\034/,"GO")
print
}
$ gawk -f tst.awk file
CREATE TABLE Table1(
column1 double NOT NULL,
column2 varchar(60) NULL,
column3 varchar(60) NULL,
column4 double NOT NULL)
GO
The above works by replacing every stand-alone "GO" with a control char that's unlikely to appear in your input (in this case I used the same value as the default SUBSEP) so we can use that char in a negated character list in the middle gsub() to create a regexp that ends with the first "GO" after "CONSTRAINT". This is one way to do "non-greedy" matching in awk.
If there is no char that you KNOW cannot appear in your input, you can create one like this:
$ cat tst.awk
BEGIN{ RS="^$"; ORS="" }
{
gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/\<GO\>/,"b")
gsub(/,\s*CONSTRAINT[^b]+b/,")")
gsub(/b/,"GO"); gsub(/aB/,"b"); gsub(/aA/,"a")
print
}
$
$ gawk -f tst.awk file
CREATE TABLE Table1(
column1 double NOT NULL,
column2 varchar(60) NULL,
column3 varchar(60) NULL,
column4 double NOT NULL)
GO
The above initially converts all "a"s to "aA" and "b"s to "aB" so that
there are no longer any "b"s in the record, and
since all original "a"s now have an "A" after them, the only occurrences of
"aB" represent where "bs" were originally located
and that means that we can now convert all "GO"s to "b"s just like we converted them to "\034" in the first script above. Then we do the main gsub() and then unroll our initial gsub()s.
This idea of gsub()ing to create chars that cannot previously exist, using those chars, then unrolling the initial gsub()s is an extremely useful idiom to learn and remember, e.g. see https://stackoverflow.com/a/13062682/1745001 for another application.
To see it working one step at a time:
$ cat file
foo bar Hello World World able bodies
$ awk '{gsub(/a/,"aA")}1' file
foo baAr Hello World World aAble bodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB")}1' file
foo aBaAr Hello World World aAaBle aBodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/World/,"b")}1' file
foo aBaAr Hello b b aAaBle aBodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/World/,"b"); gsub(/Hello[^b]+b/,"We Are The")}1' file
foo aBaAr We Are The b aAaBle aBodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/World/,"b"); gsub(/Hello[^b]+b/,"We Are The"); gsub(/b/,"World")}1' file
foo aBaAr We Are The World aAaBle aBodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/World/,"b"); gsub(/Hello[^b]+b/,"We Are The"); gsub(/b/,"World"); gsub(/aB/,"b")}1' file
foo baAr We Are The World aAble bodies
$ awk '{gsub(/a/,"aA"); gsub(/b/,"aB"); gsub(/World/,"b"); gsub(/Hello[^b]+b/,"We Are The"); gsub(/b/,"World"); gsub(/aB/,"b"); ; gsub(/aA/,"a")}1' file
foo bar We Are The World able bodies

Bash script replace two fields in a text file using variables

This should be a simple fix but I cannot wrap my head around it at the moment.
I have a comma-delimited file called my_course that contains a list of courses with some information about them.
I need to get user input about the last two fields and change them accordingly.
Each field is constructed like:
CourseNumber,CourseTitle,CreditHours,Status,Grade
Example file:
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
I get the user input for 3 things: Course Number, Status (0 or 1), and Grade (A,B,C,N/A)
So far I have tried matching the line containing the course number and changing the last two fields. I haven't been about to figure out how to modify the last two fields using sed so I'm using this horrible jumble of awk and sed:
temporary=$(awk -v status=$status -v grade=$grade '
BEGIN { FS="," }; $(NF)=""; $(NF-1)="";
/'$cNum'/ {printf $0","status","grade;}' my_course)
sed -i "s/CSC$cNum.*/$temporary/g" my_course
The issue that I'm running into here is the number of fields in the course title can range from 1 to 4 so I can't just easily print the first n fields. I've tried removing the last two fields and appending the new values for status and grade but that isn't working for me.
Note: I have already done checks to ensure that the user inputs valid data.
Use a simple awk-script:
BEGIN {
FS=","
OFS=FS
}
$0 ~ course {
$(NF-1)=status
$NF=grade
} {print}
and on the cmd-line, set three parameters for the various parameters like course, status and grade.
in action:
$ cat input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC3210" -vstatus="1" -vgrade="A" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,1,A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC1010" -vstatus="1" -vgrade="B" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,1,B
It doesn't matter how much commas you have in course name as long as you look only at last two commas:
sed -i "/CSC$cNum/ s/.,[^,]*$$/$status,$grade/"
The trick is to use $ in pattern to match the end of line. $$ because of double quotes.
And don't bother building the "temporary" line - apply substitution only to line that matches course number.

Resources