Sed command to replace numbers between space and : - shell

I have a file with a records like the below
FIRST 1: SECOND 2: THREE 4: FIVE 255: SIX 255
I want to remove values between space and :
FIRST:SECOND:THREE:FIVE:SIX
with code
awk -F '[[:space:]]*,:*' '{$1=$1}1' OFS=, file

tried on gnu awk:
awk -F' [0-9]*(: *|$)' -vOFS=':' '{print $1,$2,$3,$4,$5}' file
tried on gnu sed:
sed -E 's/\s+[0-9]+(:|$)\s*/\1/g' file
Explanation of awk,
regex , a space, followed by [0-9]+ one or more number followed by literal : followed by one or more space: *, if all such matched, then collect everything else than this matched pattern, ie. FIRST, SECOND,... so on because -F option determine it as field separator (FS) and $1, $2 .. so on is always else than FS. But the output needs nice look ie. has FS so that'd be : and it'd be awk variable definition -vOFS=':'

You can add [[:digit:]] also with a ending asterisk, and leave only a space just after OFS= :
$ awk -F '[[:space:]][[:digit:]]*' '{$1=$1}1' OFS= file
FIRST:SECOND:THREE:FIVE:SIX

To get the output we want in idiomatic awk, we make the input field separator (with -F) contain all the stuff we want to eliminate (anchored with :), and make the output field separator (OFS) what we want it replaced with. The catch is that this won't eliminate the space and numbers at the end of the line, and for this we need to do something more. GNU’s implementation of awk will allow us to use a regular expression for the input record separator (RS), but we could just do a simple sub() with POSIX complaint awk as well. Finally, force recalculation via $1=$1... the side effects for this pattern/statement are that the buffer will be recalculated doing FS/RS substitution for us, and that non-blank lines will take the default action -- which is to print.
gawk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: -v RS='[[:space:]]*[[:digit:]]*\n' '$1=$1' file
Or:
awk -F '[[:space:]]*[[:digit:]]*:[[:space:]]*' -v OFS=: '{ sub(/[[:space:]]*[[:digit:]]*$/, “”) } $1=$1' file
A sed implementation is fun but probably slower (because current versions of awk have better regex implementations).
sed 's/[[:space:]]*[[:digit:]]*:[[:space:]]/:/g; s/[[:space:]]*[[:digit:]]*[[:space:]]*$//' file
Or if POSIX character classes are not available...
sed 's/[\t ]*[0-9]*:[\t ]/:/g; s/[\t ]*[0-9]*[\t ]*$//' file
Something tells me that your “FIRST, SECOND, THIRD...” might be more complicated, and might contain digits... in this case, you might want to experiment with replacing * with + for awk or with \+ for sed.

Related

"grep" a csv file including multi-lines fields?

file.csv:
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
I want to grep the "XA100" entry like this:
grep XA100 file.csv
to obtain this result:
XA100;"this is
the multi-line"
but grep return only one line:
XA100;"this is
source.csv contains 3 entries.
The "XA100" entry contain a multi-line field.
And grep doesn't seem to be the right tool to "grep" CSV file including multilines fields.
Do you know the way to make the job ?
Edit: the real world file contains many columns. The researched term can be in any column (not at begin of line, nor at the begin of field). All fields are encapsulated by ". Any field can contain a multi-line, from 1 line to any, and this cannot be predicted.
Give this line a try:
awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' file
I extended your example a bit:
kent$ cat f
XA90;"standard"
XA100;"this is
the
multi-
line"
XA110;"other standard"
kent$ awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' f
XA100;"this is
the
multi-
line"
In the comments you mention: In the real world file, each line start with ". I assume they also end with " and present you this:
Test file:
$ cat file
"single line"
"multi-
lined"
Code and outputs:
$ awk 'BEGIN{RS=ORS="\"\n"} /single/' file
"single line"
$ awk 'BEGIN{RS=ORS="\"\n"} /m/' file
"multi-
lined"
You can also parametrize the search:
$ awk -v s="multi" 'BEGIN{RS=ORS="\"\n"} match($0,s)' file
"multi-
lined"
try:
Solution 1:
awk -v RS="XA" 'NR==3{gsub(/$\n$/,"");print RS $0}' Input_file
Making Record separator as string XA then looking for line 3rd here and then globally substituting the $\n$(which is to remove the extra line at the end of the line) with NULL. Then printing the Record Separator with the current line.
Solution 2:
awk '/XA100/{print;getline;while($0 !~ /^XA/){print;getline}}' Input_file
Looking for string XA100 then printing the current line and using getline to go to next line, using while loop then which will run and print the lines until a line is starting from XA.
If this file was exported from MS-Excel or similar then lines end with \r\n while the newlines inside quotes are just \ns so then all you need is:
$ awk -v RS='\r\n' '/XA100/' file
XA100;"this is
the multi-line"
The above uses GNU awk for multi-char RS. On some platforms, e.g. cygwin, you'll have to add -v BINMODE=3 so gawk sees the \rs rather than them getting stripped by underlying C primitives.
Otherwise, it's extremely hard to parse CSV files in general without a real CSV parser (which awk currently doesn't have but is in the works for GNU awk) but you could do this (again with GNU awk for multi-char RS):
$ cat file
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
$ awk -v RS="\"[^\"]*\"" -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
XA90;"standard"
XA100;"this is the multi-line"
XA110;"other standard"
to replace all newlines within quotes with blank chars and then process it as regular 1-line-per-record file.
Using PS response, this works for the small example:
sed 's/^X/\n&/' file.csv | awk -v RS= '/XA100/ {print}'
For my real world CSV file, with many columns, with researched term anywhere, with unknown count of multi-lines, with characters " replaced by "", with multi-lines lines beginning with ", with all fields encapsulated by ", this works. Note the exclusion of the second character " in sed part:
sed 's/^"[^"]/\n&/' file.csv | awk -v RS= '/RESEARCH_TERM/ {print}'
Because first column of any entry cannot start with "". First column allways looks like "XXXXXXXXX", where X is any character but ".
Thank you all for so much responses, maybe others solutions are working depending the CSV file format you use.

Shell Script Replace a Specified Column with sed

I have a example dataset separated by semicolon as below;
123;IZMIR;ZMIR;123
abc;ANKAR;aaa;999
AAA;ZMIR;ZMIR;bob
BBB;ANKR;RRRR;ABC
I would like to replace values in a specified column. Lets say I want to change "ZMIR" AS "IZMIR" but only for the third column, the ones on the second column must stay the same.
Desired output is;
123;IZMIR;IZMIR;123
abc;ANKAR;aaa;999
AAA;ZMIR;IZMIR;bob
BBB;ANKR;RRRR;ABC
I tried;
sed 's/;ZMIR;/;IZMIR;/' file.txt
the problem is that it changes all the values on the file not just the 3rd one.
I also tried;
awk -F";" '{gsub("ZMIR",";IZMIR;",$2)}1'
and here it specifies the column but, it somehow adds spaces;
123 I;IZMIR; ZMIR 123
abc;ANKAR;aaa;999
AAA ;IZMIR; ZMIR bob
BBB;ANKR;RRRR;ABC
sed doesn't know about columns, awk does (but in awk they're called "fields"):
awk 'BEGIN{FS=OFS=";"} $3=="ZMIR"{$3="IZMIR"} 1' file
Note that since the above is doing a literal string search and replace, you don't have to worry about regexp or backreference metacharacters in the search or replacement strings, unlike in a sed solution (see https://stackoverflow.com/a/29626460/1745001).
wrt what you tried previously with awk:
awk -F";" '{gsub("ZMIR",";IZMIR;",$2)}1'
That says: find "ZMIR" in the 2nd semi-colon-separated field and replace it with ";IZMIR;" and also change every existing ";" on the line to a blank character.
To learn awk, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
If you exactly know where the word to replace is located and how many of them are in that line you could use sed with something like:
sed '3 s/ZMIR/IZMIR/2'
With the 3 in the beginning you are selecting the third line and with the 2 in the end the second occurrence. However the awk solution is a better one. But just that you know how it works in sed ;)
This might work for you (GNU sed):
sed -r 's/[^;]+/\n&\n/3;s/\nZMIR\n/IZMIR/;s/\n//g' file
Surround the required field by unique markers then replace the required string (plus markers) by the replacement string. Finally remove the unique markers.
Perl on Command Line
Input
123;IZMIR;ZMIR;123
000;ANKAR;aaa;999
AAA;ZMIR;ZMIR;bob
BBB;ANKR;RRRR;ABC
$. == 1 means first row it does the work only for this row So second row $. == 2
$F[0] means first column and it only does on this column So fourth column $F[3]
-a -F\; means that delimiter is ;
what you want
perl -a -F\; -pe 's/$F[0]/***/ if $. == 1' your-file
output
***;IZMIR;ZMIR;123
abc;ANKAR;aaa;999
AAA;ZMIR;ZMIR;bob
BBB;ANKR;RRRR;ABC
for row == 2 and column == 2
perl -a -F\; -pe 's/$F[1]/***/ if $. == 2' your-file
123;IZMIR;ZMIR;123
abc;***;aaa;999
AAA;ZMIR;ZMIR;bob
BBB;ANKR;RRRR;ABC
Also without -a -F
perl -pe 's/123/***/ if $. == 1' your-file
output
***;IZMIR;ZMIR;123
abc;ANKAR;aaa;999
AAA;ZMIR;ZMIR;bob
BBB;ANKR;RRRR;ABC
If you want to edit you can add -i option that means Edit in-place And that's it, it simply find, replace and save in the same file
perl -i -a -F\; and so on
You need to include some absolute references in the line:
^ for beginning of the line
unequivocal separation pattern
^.*ZMIR and [^;]*;ZMIR give different values where first take everything before ZMIR and sed take the longest possible
Specific
sed 's/^\([^;]*;[^;]*;\)ZMIR;/\1IZMIR;/' YourFile
generic where Old and New are batch variable (Remember, this is regex value so regex rules to apply like escaping some char)
#Old='ZMIR'
#New='IZMIR'
sed 's/^\(\([^;]*;\)\{2\}\)'${Old}';/\1'${New}';/' YourFile
In this simple case sed is an alternative, but awk is better for a complex or long line.

How to extract specific string in a file using awk, sed or other methods in bash?

I have a file with the following text (multiple lines with different values):
TokenRange(start_token:8050285221437500528,end_token:8051783269940793406,...
I want to extract the value of start_token and end_token. I tried awk and cut, but I am not able to figure out the best way to extract the targeted values.
Something like:
cat filename| get the values of start_token and end_token
grep -oP '(?<=token:)\d+' filename
Explanation:
-o: print only part that matches, not complete line
-P: use Perl regex engine (for look-around)
(?<=token:): positive look-behind – zero-width pattern
\d+: one or more digits
Result:
8050285221437500528
8051783269940793406
A (potentially more efficient) variant of this, as pointed out by hek2mgl in his comment, uses \K, the variable-width look-behind:
grep -oP 'token:\K\d+'
\K keeps everything that has been matched to the left of it, but does not include it in the match (see perlre).
Using awk:
awk -F '[(:,]' '{print $3, $5}' file
8050285221437500528 8051783269940793406
First value is start_token and last value is end_token.
a sed version
sed -e '/^TokenRange(/!d' -e 's/.*:\([0-9]*\),.*:\([0-9]*\),.*/\1 \2/' YourFile

Using a multi-character field separator in awk on Solaris

I wish to use a string (BIRCH) as a field delimiter in awk to print second field. I am trying the following command:
cat tmp.log|awk -FBirch '{ print $2}'
Below output is getting printed:
irch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Desired output:
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Contents of tmp.log file.
-bash-3.2# cat tmp.log
Dec 05 13:49:23 [x.x.x.x.180.100] business-log-dev/int [TEST][0x80000001][business-log][info] mpgw(Test): trans(8497187)[request][10.x.x.x]:
Birch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Am I doing something wrong?
OS: Solaris10
Shell: Bash
Tried below command suggested in one of the ansers below. I am getting the desired output, but with an extra empty line at the top. How can this be eliminated from the output?
-bash-3.2# /usr/xpg4/bin/awk -FBirch '{print $2}' tmp.log
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Originally, I suggested putting quotes around "Birch" (-F'Birch') but actually, I don't think that should make any difference.
I'm not at all experienced working with Solaris but you may want to also try using nawk ("new awk") instead of awk.
nawk -FBirch '{print $2}' file
If this works, you may want to consider creating an alias so that you always use the newer version of awk with more features.
You may also want to try using the version of awk in the /usr/xpg4/bin directory, which is a POSIX compliant implementation so should support multi-character FS:
/usr/xpg4/bin/awk -FBirch '{print $2}' file
If you only want to print lines which have more than one field, you can add a condition:
/usr/xpg4/bin/awk -FBirch 'NF>1{print $2}' file
This only prints the second field when there is more than one field.
From the man page of the default awk on solaris usr/bin/awk
-Fc Uses the character c as the field separator
(FS) character. See the discussion of FS
below.
As you can see solaris awk only takes a single character as a Field separator
Also in the man page is split
split(s, a, fs)
Split the string s into array elements a[1], a[2], ...
a[n], and returns n. The separation is done with the
regular expression fs or with the field separator FS if
fs is not given.
As you can see here it takes a regular expression as a separator so we can use.
awk 'split($0,a,"Birch"){print a[2]}' file
To print the second field split by Birch

How do I print a field from a pipe-separated file?

I have a file with fields separated by pipe characters and I want to print only the second field. This attempt fails:
$ cat file | awk -F| '{print $2}'
awk: syntax error near line 1
awk: bailing out near line 1
bash: {print $2}: command not found
Is there a way to do this?
Or just use one command:
cut -d '|' -f FIELDNUMBER
The key point here is that the pipe character (|) must be escaped to the shell. Use "\|" or "'|'" to protect it from shell interpertation and allow it to be passed to awk on the command line.
Reading the comments I see that the original poster presents a simplified version of the original problem which involved filtering file before selecting and printing the fields. A pass through grep was used and the result piped into awk for field selection. That accounts for the wholly unnecessary cat file that appears in the question (it replaces the grep <pattern> file).
Fine, that will work. However, awk is largely a pattern matching tool on its own, and can be trusted to find and work on the matching lines without needing to invoke grep. Use something like:
awk -F\| '/<pattern>/{print $2;}{next;}' file
The /<pattern>/ bit tells awk to perform the action that follows on lines that match <pattern>.
The lost-looking {next;} is a default action skipping to the next line in the input. It does not seem to be necessary, but I have this habit from long ago...
The pipe character needs to be escaped so that the shell doesn't interpret it. A simple solution:
$ awk -F\| '{print $2}' file
Another choice would be to quote the character:
$ awk -F'|' '{print $2}' file
Another way using awk
awk 'BEGIN { FS = "|" } ; { print $2 }'
And 'file' contains no pipe symbols, so it prints nothing. You should either use 'cat file' or simply list the file after the awk program.

Resources