Replace string with pattern keeping the string intact - shell

Would to replace the string either with sed or awk, where it identifies a patter as mentioned below :
Example: It looks for word starting with "XX" and ending with "XX" and replace the word with concatinating with "${hf:" at the start of "XX" and "}" at the end of "XX"
INPUT
CREATE TABLE XX_DB_XX.test_XX_YYYYMMDD_XX
AS
SELECT id
FROM XX_R_DB_XX.usr_XX_YYYYMMDD_XX
WHERE year = XX_YYYY_XX
AND month = XX_MM_XX
AND day = XX_DD_XX;
OUTPUT
CREATE TABLE ${hf:XX_DB_XX}.test_${hf:XX_YYYYMMDD_XX}
AS
SELECT id
FROM ${hf:XX_R_DB_XX}.usr_${hf:XX_YYYYMMDD_XX}
WHERE year = ${hf:XX_YYYY_XX}
AND month = ${hf:XX_MM_XX}
AND day = ${hf:XX_DD_XX};
Tried to replace the pattern matchin, but the issue is , in the output I want replace the $A , to the subsquet "XX_(*)_XX" string available over the input file.
cat test.hql | gawk '{ print gensub(/XX_+[A-Z,_]+_XX/, "${hiveconf:$A}", 1)
}' | gawk '{ print gensub(/XX_+[A-Z]+_XX/, "${hiveconf:$A}", 1) }'
OUTPUT: -> That I recived needs to be updated w.r.t the string available, So how can this be done:
CREATE TABLE ${hiveconf:$A}.test_${hiveconf:$A}
AS
SELECT id
FROM ${hiveconf:$A}.usr_${hiveconf:$A}
WHERE year = ${hiveconf:$A}
AND month = ${hiveconf:$A}
AND day = ${hiveconf:$A};

Following awk may help you on same.
awk '{gsub(/XX_[a-zA-Z]+_XX/,"${hf:&}")} 1' Input_file

That's what sed exists for,
sed 's/XX[[:alnum:]_]*XX/${hf:&}/g' file
[[:alnum:]_] stands for Alpha numeric or underscore. The append * means zero or more times of it in regular expression.

Or you could do
sed 's/\(XX[^'XX']*XX\)/${hf:\1}/g'
in cases where there may be non alphanumeric characters as well in between the XXs.
First an XX is matched after which waits till an XX is found.

Related

Need regex to remove a character in a datetime string in csv file

I have a csv file which has the following string:
"2016-10-25T14:07:49.298-07:00"
which I would like to replace with:
"2016-10-25", "14:07:49"
I matched the original string with a regular expression:
([0-9]{4}-[0-9]{2}-[0-9]{2})[T]([0-9]{2}\:[0-9]{2}\:[0-9]{2})\.[0-9]{3}-07\:00
but I need some help
With awk, assuming T and . are unique
$ echo '"2016-10-25T14:07:49.298-07:00"' | awk -F'[T.]' '{print $1 "\", \"" $2 "\""}'
"2016-10-25", "14:07:49"
-F'[T.]' assign T or . as field separator
Then print first and second field with required formatting
With sed:
sed -E 's/^([^T]+)T([^.]+).*/\1", "\2"/'
^([^T]+) matches the portion upto T, and put that in captured group 1
T matches T literally
([^.]+) matches upto next ., and put that in captured group (2)
.* matches the rest
in the replacement, the captured groups are used with proper formatting to get desired output, \1", "\2"
Example:
$ sed -E 's/^([^T]+)T([^.]+).*/\1", "\2"/' <<<'"2016-10-25T14:07:49.298-07:00"'
"2016-10-25", "14:07:49"

Replace till first occurrence of delimiter using sed

I am trying to write a sed command which replace the string till only first occurrence of the delimiter. For example, I have following lines in a file where '~' is delimiter:
ab c1~10/20/2010 00:00:00 ~1234~10.02~530.55
ab c2~10/10/2010T00:00Z:~12346~11.03~531
abc3~10/10/2010 00:00:00 00-000~122~12~532.44
abc4~10/11/2010~110~13~533
I want to replace all dates (second column) to "2010-10-10T00:00:00Z" this format. As you can see dates can be in different formats, content after "MM/dd/yyyy" does not matter to me, I want to ignore that and replace that with "T00:00:00Z". I have written following command to do so :
SEPAR="\([ \/._-]\)\{1\}";
sed -i "s/\(0[1-9]\|1[012]\)$SEPAR\([123][0]\|[012][1-9]\|3[1]\)$SEPAR\(\(10\|20\)[0-9][0-9]\).*~/\5\-\1\-\3T00:00:00Z~/g" $file_name;
But it replaces everything till the last column, for example it generates the following output (please note two columns are missing):
ab c1~2010-10-20T00:00:00Z~530.55
ab c2~2010-10-10T00:00:00Z~531
abc3~2010-10-10T00:00:00Z~532.44
abc4~2010-10-11T00:00:00Z~533
And my expected output is :
ab c1~2010-10-20T00:00:00Z~1234~10.02~530.55
ab c2~2010-10-10T00:00:00Z~12346~11.03~531
abc3~2010-10-10T00:00:00Z~122~12~532.44
abc4~2010-10-11T00:00:00Z~110~13~533
Please help me writing the last part ".*~" which is replacing everything.
You can use awk for this:
awk 'BEGIN{FS=OFS="~"} {
sub(/[T ].*/, "", $2)
split($2, a, /\//)
$2 = a[3] "-" a[1] "-" a[2] "T00:00:00Z"
} 1' file
ab c1~2010-10-20T00:00:00Z~1234~10.02~530.55
ab c2~2010-10-10T00:00:00Z~12346~11.03~531
abc3~2010-10-10T00:00:00Z~122~12~532.44
abc4~2010-10-11T00:00:00Z~110~13~533

Analyze a control table by Shell Script

A shell script is analysing a control table to get the right parameter for it's processing.
Currently, it is simple - using grep, it points to the correct line, awk {print $n} determines the right columns.
Columns are separated by space only. No special rules, just values separated by space.
All is fine and working, the users like it.
As long as none of the columns is left empty. For last colum, it's ok to leave it empty, but if somebody does not fill in a column in the mid, it confuses the awk {print $n} logic.
Of course, one could as the users to fill in every entry, or one could just define the column delimiter as ";" .
In case something is skipped, one could use " ;; " However, I would prefer not to change table style.
So the question is:
How to effectively analyze a table having blanks in colum values? Table is like this:
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write 3345 johnny walker
Update curiosity.org jerry
What might be of some help:
If there is a value set in a column, it is (more a less precise) under its column header description.
Cheers,
Tarik
You don't say what your desired output is but this shows you the right approach:
$ cat tst.awk
NR==1 {
print
while ( match($0,/[^[:space:]]+[[:space:]]*/) ) {
width[++i] = RLENGTH
$0 = substr($0,RSTART+RLENGTH)
}
next
}
{
i = 0
while ( (fld = substr($0,1,width[++i])) != "" ) {
gsub(/^ +| +$/,"",fld)
printf "%-*s", width[i], (fld == "" ? "[empty]" : fld)
$0 = substr($0,width[i]+1)
}
print ""
}
$
$ awk -f tst.awk file
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write [empty] 3345 johnny walker
Update curiosity.org [empty] jerry [empty]
It uses the width of the each field in the title line to determine the width of every field in every line of the file, and then just replaces empty fields with the string "[empty]" and left-aligns every field just to pretty it up a bit.

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!
Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Resources