Replace till first occurrence of delimiter using sed - bash

I am trying to write a sed command which replace the string till only first occurrence of the delimiter. For example, I have following lines in a file where '~' is delimiter:
ab c1~10/20/2010 00:00:00 ~1234~10.02~530.55
ab c2~10/10/2010T00:00Z:~12346~11.03~531
abc3~10/10/2010 00:00:00 00-000~122~12~532.44
abc4~10/11/2010~110~13~533
I want to replace all dates (second column) to "2010-10-10T00:00:00Z" this format. As you can see dates can be in different formats, content after "MM/dd/yyyy" does not matter to me, I want to ignore that and replace that with "T00:00:00Z". I have written following command to do so :
SEPAR="\([ \/._-]\)\{1\}";
sed -i "s/\(0[1-9]\|1[012]\)$SEPAR\([123][0]\|[012][1-9]\|3[1]\)$SEPAR\(\(10\|20\)[0-9][0-9]\).*~/\5\-\1\-\3T00:00:00Z~/g" $file_name;
But it replaces everything till the last column, for example it generates the following output (please note two columns are missing):
ab c1~2010-10-20T00:00:00Z~530.55
ab c2~2010-10-10T00:00:00Z~531
abc3~2010-10-10T00:00:00Z~532.44
abc4~2010-10-11T00:00:00Z~533
And my expected output is :
ab c1~2010-10-20T00:00:00Z~1234~10.02~530.55
ab c2~2010-10-10T00:00:00Z~12346~11.03~531
abc3~2010-10-10T00:00:00Z~122~12~532.44
abc4~2010-10-11T00:00:00Z~110~13~533
Please help me writing the last part ".*~" which is replacing everything.

You can use awk for this:
awk 'BEGIN{FS=OFS="~"} {
sub(/[T ].*/, "", $2)
split($2, a, /\//)
$2 = a[3] "-" a[1] "-" a[2] "T00:00:00Z"
} 1' file
ab c1~2010-10-20T00:00:00Z~1234~10.02~530.55
ab c2~2010-10-10T00:00:00Z~12346~11.03~531
abc3~2010-10-10T00:00:00Z~122~12~532.44
abc4~2010-10-11T00:00:00Z~110~13~533

Related

How Can I Use Sort or another bash cmd To Get 1 line from all the lines if 1st 2nd and 3rd Field are The same

I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).

Replace string with pattern keeping the string intact

Would to replace the string either with sed or awk, where it identifies a patter as mentioned below :
Example: It looks for word starting with "XX" and ending with "XX" and replace the word with concatinating with "${hf:" at the start of "XX" and "}" at the end of "XX"
INPUT
CREATE TABLE XX_DB_XX.test_XX_YYYYMMDD_XX
AS
SELECT id
FROM XX_R_DB_XX.usr_XX_YYYYMMDD_XX
WHERE year = XX_YYYY_XX
AND month = XX_MM_XX
AND day = XX_DD_XX;
OUTPUT
CREATE TABLE ${hf:XX_DB_XX}.test_${hf:XX_YYYYMMDD_XX}
AS
SELECT id
FROM ${hf:XX_R_DB_XX}.usr_${hf:XX_YYYYMMDD_XX}
WHERE year = ${hf:XX_YYYY_XX}
AND month = ${hf:XX_MM_XX}
AND day = ${hf:XX_DD_XX};
Tried to replace the pattern matchin, but the issue is , in the output I want replace the $A , to the subsquet "XX_(*)_XX" string available over the input file.
cat test.hql | gawk '{ print gensub(/XX_+[A-Z,_]+_XX/, "${hiveconf:$A}", 1)
}' | gawk '{ print gensub(/XX_+[A-Z]+_XX/, "${hiveconf:$A}", 1) }'
OUTPUT: -> That I recived needs to be updated w.r.t the string available, So how can this be done:
CREATE TABLE ${hiveconf:$A}.test_${hiveconf:$A}
AS
SELECT id
FROM ${hiveconf:$A}.usr_${hiveconf:$A}
WHERE year = ${hiveconf:$A}
AND month = ${hiveconf:$A}
AND day = ${hiveconf:$A};
Following awk may help you on same.
awk '{gsub(/XX_[a-zA-Z]+_XX/,"${hf:&}")} 1' Input_file
That's what sed exists for,
sed 's/XX[[:alnum:]_]*XX/${hf:&}/g' file
[[:alnum:]_] stands for Alpha numeric or underscore. The append * means zero or more times of it in regular expression.
Or you could do
sed 's/\(XX[^'XX']*XX\)/${hf:\1}/g'
in cases where there may be non alphanumeric characters as well in between the XXs.
First an XX is matched after which waits till an XX is found.

How to replace last n characters in the kth occurence of a line containing a certain substring using sed or awk?

Suppose I have a file that resembles the following format:
\\ Random other lines \\
...
27861NA+ NA+89122 13.480 11.554 10.082
27862NA+ NA+89123 2.166 5.896 10.108
27863NA+ NA+89124 8.289 6.843 3.090
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
I am trying to find a way of replacing the last 24 characters of each line with a string that I have prepared, for the first 3 instances of lines in the file that contain the string "NA+".
For example, my output would ideally look like:
\\ Random other lines \\
...
27861NA+ NA+89122 my first string hello
27862NA+ NA+89123 my second string foo
27863NA+ NA+89124 my final string bar $$
27864NA+ NA+89125 12.972 5.936 4.498
27865CL- CL-89126 13.914 2.125 12.915
27866CL- CL-89127 12.050 13.907 3.559
...
\\ Random other lines \\
So far, I have found a sed command that will remove the last 24 characters from every line in the file:
sed 's/.\{24\}$//' myfile.txt
And also an awk command that will return the kth line that contains the desired substring:
awk '/NA+/{i++}i==1' myfile.txt
Does anyone have an idea about how I could replace the last 24 characters in the 1st, 2nd, and 3rd lines of my file that each contain a certain substring?
With single awk:
awk -v str="my string" '!f && /NA\+/{ f=1; n=NR+3 }n && n>NR{ $4=$5=""; $3=str }1' myfile.txt
string="my first string hello"
awk -v string="$string" '{ if ( $0 ~ "NA" ) {cnt++} if (cnt < 4 ) { print substr($0,1,length($0)-23)string } else { print }}' NA
Using awk, set a string and pass it awk with -v. Search for strings containing NA and the increment the variable cnt. When cnt is less that 4, print the everything but the last 23 characters adding the string passed to the end. Otherwise print the line.
This might work for you (GNU sed):
sed '/NA+/{x;s/\n/&/3;x;ta;H;s/.\{24\}$/some string/;b;:a;n;ba}' file
This uses the hold space (HS) to keep a count of the number of lines the script has seen of the required string (NA+). Once it has seen n (in this case n=3) such lines it just prints the remainder of the file.

Display column from empty column (fixed width and space delimited) in bash

I have log file (in txt) with the following text
UNIT PHYS STATE LOCATION INFO
TCSM-1098 SE-NH -
ETPE-5-0 1403 SE-OU BCSU-1 ACTV FLTY
ETIP-6 1402 SE-NH -
They r delimited by space...
How am I acquired the output like below?
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
Thank in advance
This is what I've tried so far
cat file.txt | awk 'BEGIN { FS = "[[:space:]][[:space:]]+" } {print $1,$2,$3,$4}' | sed 's/ /|/g'
It produces output like this
|UNIT|PHYS|STATE|LOCATION|INFO|
|TCSM-1098|SE-NH|-|
|ETPE-5-0|1403|SE-OU|BCSU-1|ACTV|FLTY
|ETIP-6|1402|SE-NH|-|
The column isn't excatly like what I hope for
It seems it's not delimited but fixed-width format.
$ perl -ple '
$_ = join "|",
map {s/^\s+|\s+$//g;$_}
unpack ("a11 a5 a6 a22 a30",$_);
' <file.txt
how it works
-p switch : loop over input lines (default var: $_) and print it
-l switch : chomp line ending (\n) and add it to output
-e : inline command
unpack function : takes defined format and input line and returns an array
map function : apply block to each element of array: regex to remove heading trailing spaces
join function : takes delimiter and array and gives string
$_ = : affects the string to default var for output
Perl to the rescue!
perl -wE 'my #lengths;
$_ = <>;
push #lengths, length $1 while /(\S+\s*)/g;
$lengths[-1] = "*";
my $f;
say join "|",
map s/^\s+|\s+$//gr,
unpack "A" . join("A", #lengths), $_
while (!$f++ or $_ = <>);' -- infile
The format is not whitespace separated, it's a fixed-width.
The #lengths array will be populated by the widths of the columns taken from the first line of the input. The last column width is replaced with *, as its width can't be deduced from the header.
Then, an unpack template is created from the lengths that's used to parse the file.
$f is just a flag that makes it possible to apply the template to the header line itself.
With GNU awk for FIELDWITDHS to handle fixed-width fields:
awk -v FIELDWIDTHS='11 5 6 22 99' -v OFS='|' '{$1=$1; gsub(/ *\| */,"|"); sub(/ +$/,"")}1' file
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
I think it's pretty clear and self-explanatory but let me know if you have any questions.
Manually, in awk:
$ awk 'BEGIN{split("11 5 6 23 99", cols); }
{s=0;
for (i in cols) {
field = substr($0, s, cols[i]);
s += cols[i];
sub(/^ */, "", field);
sub(/ *$/, "", field);
printf "%s|", field;
};
printf "\n" } ' file
UNIT|PHYS|STATE|LOCATION|INFO|
TCSM-1098||SE-NH||-|
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY|
ETIP-6|1402|SE-NH||-|
The widths of the columns are set in the BEGIN block, then for each line we take substrings of the line of the required length. s counts the starting position of the current column, the sub() calls remove leading and trailing spaces. The code as such prints a trailing | on each line, but that can be worked around by making the first or last column a special case.
Note that the last field is not like in your output, it's hard to tell where the split between ACTV and FLTY should be. Is that fixed width too, or is the space a separator there?

AWK between 2 patterns - first occurence

I am having this example of ini file. I need to extract the names between 2 patterns Name_Z1 and OBJ=Name_Z1 and put them each on a line.
The problem is that there are more than one occurences with Name_Z1 and OBJ=Name_Z1 and i only need first occurence.
[Name_Z5]
random;text
Names;Jesus;Tom;Miguel
random;text
OBJ=Name_Z5
[Name_Z1]
random;text
Names;Jhon;Alex;Smith
random;text
OBJ=Name_Z1
[Name_Z2]
random;text
Names;Chris;Mara;Iordana
random;text
OBJ=Name_Z2
[Name_Z1_Phone]
random;text
Names;Bill;Stan;Mike
random;text
OBJ=Name_Z1_Phone
My desired output would be:
Jhon
Alex
Smith
I am currently writing a more ample script in bash and i am stuck on this. I prefer awk to do the job.
My greatly appreciation for who can help me. Thank you!
For Wintermute solution: The [Name_Z1] part looks like this:
[CAB_Z1]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;AIRE;ALIMENTA;BATER;CONVERTIDOR;DISTRIBUCION;FUEGO;HURTO;MAINS;MALLO;MAYOR;MENOR;PANEL;TEMP
NAME=CAB_Z1
And the [Name_Z1_Phone] part looks like this:
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
The fix should be somewhere around the "|PerceivedSeverity"
Expected Output:
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
This should work:
sed -n '/^\[Name_Z1/,/^OBJ=Name_Z1/ { /^Names/ { s/^Names;//; s/;/\n/g; p; q } }' foo.txt
Explanation: Written readably, the code is
/^\[Name_Z1/,/^OBJ=Name_Z1/ {
/^Names/ {
s/^Names;//
s/;/\n/g
p
q
}
}
This means: In the pattern range /^\[Name_Z1/,/^OBJ=Name_Z1/, for all lines that match the pattern /^Names/, remove the Names; in the beginning, then replace all remaining ; with newlines, print the whole thing, and then quit. Since it immediately quits, it will only handle the first such line in the first such pattern range.
EDIT: The update made things a bit more complicated. I suggest
sed -n '/^\[CAB_Z1/,/^NAME=CAB_Z1/ { /^FilterAttr=/ { s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/; s/;/\n/g; p; q } }' foo.txt
The main difference is that instead of removing ^Names from a line, the substitution
s/^.*contains;\(.*\)|PerceivedSeverity.*$/\1/;
is applied. This isolates the part between contains; and |PerceivedSeverity before continuing as before. It assumes that there is only one such part in the line. If the match is ambiguous, it will pick the one that appears last in the line.
An (g)awk way that doesn't need a set number of fields(although i have assumed that contains; will always be on the line you need the names from.
(g)awk '(x+=/Z1/)&&match($0,/contains;([^|]+)/,a)&&gsub(";","\n",a[1]){print a[1];exit}' f
Explanation
(x+=/Z1/) - Increments x when Z1 is found. Also part of a
condition so x must exist to continue.
match($0,/contains;([^|]+)/,a) - Matches contains; and then captures everything after
up to the |. Stores the capture in a. Again a
condition so must succeed to continue.
gsub(";","\n",a[1]) - Substitutes all the ; for newlines in the capture
group a[1].
{print a[1];exit}' - If all conditions are met then print a[1] and exit.
This way should work in (m)awk
awk '(x+=/Z1/)&&/contains/{split($0,a,"|");y=split(a[2],b,";");for(i=3;i<=y;i++)
print b[i];exit}' file
sed -n '/\[Name_Z1\]/,/OBJ=Name_Z1$/ s/Names;//p' file.txt | tr ';' '\n'
That is sed -n to avoid printing anything not explicitly requested. Start from Name_Z1 and finish at OBJ=Name_Z1. Remove Names; and print the rest of the line where it occurs. Finally, replace semicolons with newlines.
Awk solution would be
$ awk -F";" '/Name_Z1/{f=1} f && /Names/{print $2,$3,$4} /OBJ=Name_Z1/{exit}' OFS="\n" input
Jhon
Alex
Smith
OR
$ awk -F";" '/Name_Z1/{f++} f==1 && /Names/{print $2,$3,$4}' OFS="\n" input
Jhon
Alex
Smith
-F";" sets the field seperator as ;
/Name_Z1/{f++} matches the line with pattern /Name_Z1/ If matched increment {f++}
f==1 && /Names/{print $2,$3,$4} is same as if f == 1 and maches pattern Name with line if true, then print the the columns 2 3 and 4 (delimted by ;)
OFS="\n" sets the output filed seperator as \n new line
EDIT
$ awk -F"[;|]" '/Z1/{f++} f==1 && NF>1{for (i=5; i<15; i++)print $i}' input
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
Here is a more generic solution for data in group of blocks.
This awk does not need the end tag, just the start.
awk -vRS= -F"\n" '/^\[Name_Z1\]/ {n=split($3,a,";");for (i=2;i<=n;i++) print a[i];exit}' file
Jhon
Alex
Smith
How it works:
awk -vRS= -F"\n" ' # By setting RS to nothing, one record equals one block. Then FS is set to one line as a field
/^\[Name_Z1\]/ { # Search for block with [Name_Z1]
n=split($3,a,";") # Split field 3, the names and store number of fields in variable n
for (i=2;i<=n;i++) # Loop from second to last field
print a[i] # Print the fields
exit # Exits after first find
' file
With updated data
cat file
data
[CAB_Z1_FUEGO]
READ_ONLY=false
FilterAttr=CeaseTime;blank|ObjectOfReference;contains;511047;512044;513008;593026;598326;CL5518;CL5521;CL5538;CL5612;CL5620|PerceivedSeverity;=;Critical;Major;Minor|ProbableCause;!=;HOUSE ALARM;IO DEVICE|ProblemText;contains;FUEGO
NAME=CAB_Z1_FUEGO
data
awk -vRS= -F"\n" '/^\[CAB_Z1_FUEGO\]/ {split($3,a,"|");n=split(a[2],b,";");for (i=3;i<=n;i++) print b[i]}' file
511047
512044
513008
593026
598326
CL5518
CL5521
CL5538
CL5612
CL5620
The following awk script will do what you want:
awk 's==1&&/^Names/{gsub("Names;","",$0);gsub(";","\n",$0);print}/^\[Name_Z1\]$/||/^OBJ=Name_Z1$/{s++}' inputFileName
In more detail:
s==1 && /^Names;/ {
gsub ("Names;","",$0);
gsub(";","\n",$0);
print
}
/^\[Name_Z1\]$/ || /^OBJ=Name_Z1$/ {
s++
}
The state s starts with a value of zero and is incremented whenever you find one of the two lines:
[Name_Z1]
OBJ=Name_Z1
That means, between the first set of those lines, s will be equal to one. That's where the other condition comes in. When s is one and you find a line starting with Names;, you do two substitutions.
The first is to get rid of the Names; at the front, the second is to replace all ; semi-colon characters with a newline. Then you print it out.
The output for your given test data is, as expected:
Jhon
Alex
Smith

Resources