Replace string after first semicolon while retaining the string after that - shell

I have a result file, values separated by ; as below:
137;AJP14028.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
and I want to change the second value (AJP14028.1_VP35) to only AJP14028, without the ".1_VP35" at the back. So the result will be:
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Any idea on how to do this? I am trying to solve this using either sed or awk but I am not really familiar with them yet.

With that input, and focusing on the second field, you can use awk:
$ awk 'BEGIN{FS=OFS=";"} {split($2, arr, /\.1/); $2=arr[1]} 1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Explanation:
BEGIN{FS=OFS=";"} sets FS and OFS to ";". This splits the input on the ; character and set the output field separator to that same character.
{split($2, arr, /\.1/) splits the second field on the pattern of a literal .1 and places the result in an array.
$2=arr[1] is an awk idiom that resets the second field, $2, to the trimmed value. A side effect is the total record, $0 is reset using the output field separator, OFS
1 at the end is another awkism -- print the current record.
If you just have the fixed string .1_VP35 to remove (and you do not care if it is field specific) you can just used sed:
sed 's/\.1_VP35//' file

awk '{sub(/.1_VP35/,"")}1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E

sed -r 's/(^[^.]*)(.[^;]*)(.*)/\1\3/g' inputfile
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Here: back referencing is used to divide the input line into three groups,seprated by `()'. Later they are referred as "\1" and so on.
The first group will match from the start of the line till the first dot.
The second group will match string followed by the first dot till the first semicolon.
The third group will match everything followed by it.

This might work for you (GNU sed):
sed 's/\(;[^.]*\)[^;]*/\1/' file
Make a back reference of the first ; and everything thereafter which is not a . and then remove everything from thereon which is not a ;.

Related

Search for Double Quotes (") in the file and copy the whole line in different file

I have a requirement to read through all the files and look for <double quotes> (") and copy the whole line to a different file. The challenge is here that to identify the whole line when there is a new character in the line.
The file format is like this - values are separated with delimiter |*| and end with |##|.
In the attached (image), the highlighted in green should go to new file, Logic would be check for " and if it finds read line starting from (line after |##| to until next |##| )
10338|*|BVL-O-G-01020-R4|*||*|BVL|*||*|Y|*|Y|*||*|CFC6E82284990A7AE040800AA5644B19|*|jmorlan|*|2011.12.21 15:52:01|##|
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
10342|*|BVL-O-4-01020-R7|*||*|DVL|*||*|Y|*|Y|*||*|RRFC6E82284990A7AE040800AA5644B19|*|sppa|*|2011.12.21 15:52:01|##|
Assuming you mean that the sections between |##| should be considered as newline, next question is does you file contain any real newlines? If not, grep is probably not going to be very efficient as it works on a line-by-line basis. If any real newlines are supposed to be considered part of the text, then definitely, grep is going to be unhappy.
If you really want to do it in 1 go in grep:
grep -Eoz '(^|\|##\|)([^|]|\|[^#]|\|#[^#]|\|##[^|])"([^|]|\|[^#]|\|#[^#]|\|##[^|])(\|##\||$)'
This is looking for any sequence that starts with |##| (or is the start of the file) is followed by some characters, a quote, and some more characters, then ends with |##| (or end of file). By using -z grep will ignore any newlines in the file.
The complex "any characters" ([^|]|\|[^#]|\|#[^#]|\|##[^|])* expression is because grep is greedy. It basically looks for repeating sequences that are not |##|. Perhaps turning off greed is good, but that will depend on the power of the regexp engine in your version of grep.
But much easier, and probably faster, to use sed to break up the records and inject "NULL" line-breaks:
sed 's/\|##\|/\x00/g' | grep -z '"'
This is simply replacing your end of line pattern |##| with the null character, then asking grep to find quote while treating null character as end of line.
This answer provides two solutions a Gnu Awk solution and a POSIX version.
POSIX awk
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
GNU awk 1
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
GNU awk 2
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
On the sample data provided in the question, all provided solutions give the following output:
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
note: It is possible that you are suffering from the Carriage Return problem if the file comes from a Windows machine. Pleas run dos2unix on the file before using it with these tools.
How does this work? (POSIX)
Using a POSIX version of awk we can do
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
The idea is to build a record r by appending every line to r. If the current line ends with "|##|", then we check if the record r contains a <double quote> ". If this is the case, we print the record r and reset the record r to an empty string. If it does not contain the <double quote>, we just reset it.
How does this work? (GNU)
Using GNU awk you can do this directly using the record separator RS
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
The idea here is that the file contains various records. The OP clearly stated that the information of a record is split in fields separated by |*|, but more importantly, the records themselves are separated by |##|. So in the presented example of the OP, the first record is line1 while the second record is spread over line 2 and line 3.
In awk, you can define a record separator by means of the variable RS. In its default state, RS is the <newline> character \n which makes each line a separate record which can be referenced by $0. In POSIX, the record separator can only be a single character which separates the records, while in Gnu awk, this can be a regular expression (see addendum below).
Since the record separator of the OP is the string "|##|" followed all or not by a <newline> character \n, we need to define RS=\\|##\\|\n?. Why so complicated?
the <pipe> | symbol is the OR operation (alternation operator) in a regular expression, so we need to escape it. But since string literals that are used as regular expressions are parsed twice, we also need to escape it twice. So | &rightarrow; \\| (see here)
the \n? is because it seems that the actual record separator is the string "|##|\n", but maybe some records do not have a newline character, especially the last record.
When you print records, using the print statement it automatically appends the output record separator ORS after each line. By default this is again a <newline> character \n. Since the record separator RS is not a part of the record $0 you need to update the value ORS to ORS="|##|\n". This time, not a regex, so you do not need to escape at all.
The statement /"/ is a shorthand for /"/{print $0} which means If the current record $0 contains a <double quote> ", then print the current record $0 followed by the output record separator ORS.
Note: since we actually already use Gnu awk, we can actually reduce the whole thing even further to:
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
Which makes use of the matched record separator RT that corresponds to the text found by RS. By replacing the print statement by a printf statement, we do not need to ORS anymore and just manually add RT to the record $0.
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual

Align numbers using only sed

I need to align decimal numbers with the "," symbol using only the sed command. The "," should go in the 5th position. For example:
183,7
2346,7
7,999
Should turn into:
183,7
2346,7
7,999
The maximum amount of numbers before the comma is 4. I have tried using this to remove spaces:
sed 's/ //g' input.txt > nospaces.txt
And then I thought about adding spaces depending on the number of digits before the comma, but I don't know how to do this using only sed.
Any help would be appreciated.
Assuming that there is only one number on each line; that there are at most four digits before the ,, and that there is always a ,:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;s/.*\(.....,.*\)/\1/;'
The first s gets rid of everything other than the (first) number on the line, and puts four spaces before it. The second one deletes everything before the fifth character prior to the ,, leaving just enough spaces to right justify the number.
The second s command might mangle input lines which didn't match the first s command. If it is possible that the input contains such lines, you can add a conditional branch to avoid executing the second substitution if the first one failed. With Gnu sed, this is trivial:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;T;s/.*\(.....,.*\)/\1/;'
T jumps to the end of the commands if the previous s failed. Posix standard sed only has a conditional branch on success, so you need to use this circuitous construction:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;ta;b;:a;s/.*\(.....,.*\)/\1/;'
where ta (conditional branch to a on success) is used to skip over a b (unconditional branch to end). :a is the label referred to by the t command.
if you change your mind, here is an awk solution
$ awk -F, 'NF{printf "%5d,%-d\n", $1,$2} !NF' file
183,7
2346,7
7,999
set the delimiter to comma and handle both parts as separate fields
Try with this:
gawk -F, '{ if($0=="") print ; else printf "%5d,%-d\n", $1, $2 }' input.txt
If you are using GNU sed, you could do as below
sed -r 's/([0-9]+),([0-9]+)/printf "%5s,%d" \1 \2/e' input.txt

Make changes to a file (sed, awk)

I am trying to clean up the next file:
1. 10.160.120.10 ; 140.0.0.40 ;Data-- 1155~00120~xtl~12/01/2016 03:00:24~000BBBBBA4FB~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
2. ¼?Amµxðïej£„7‹ìËÏð‡.4 --
3. 10.160.120.11 ; 140.10.10.10 ;Data-- 1155~00120~xtl~12/01/2016 03:00:54~2B3BB1EB1BBB~£ˆD]†CÀ,£ÑÉ»In&Ry+/jÑ%A¡ã ÷d_#C÷—NÏÕÞ
3. Ü‚úè"åD\’c\ûñ7x°yFæï --
Note that the numbers are not an actual part of the file. They are just reference for the number of line. The size of the line depends on the encoded message (That is why the 3 is reapeated because it basically one line). There are thousands of records but they follow the same pattern. Each record ends with a (--).
Basically what I am trying to achive is to just get the IPs side by side.
For example:
10.160.120.10 000BBBBBA4FB
My first step would be to delete everything between the first (;) and the fourth (~) since that pattern is the same for each record.
Which leads me to this.
sed 's/;.*~//'
However this particular command would delete everything untill the last (~) and not the fourth.
If it succesfully removes everything between the first (;) and the fourth (~) it would get me something like this:
0.165.65.113 0008B9A4F3~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
¼?Amµxðïej£„7‹ìËÏð‡.4 --
And then I guess I could delete everything after the first (~) so I can get the desired output.
Am I following the right procedure? Should I achive this with swd or awk? Any suggestion are appreciated!
Instead of trying to remove stuff, why don't you just keep the stuff you want?
sed -r -n 's/^[^0-9]*(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9A-F]{12}).*$/\1 \3/p'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
# IP Address 12 Hex digits
Explanation:
\1 \3 means enter everything that matched the first and the third set of parenthesis of the search term.
^[^0-9]* matches all non-digits from the beginning of the file
([0-9]{1,3}\.){3}[0-9]{1,3} matches an IP address. The whole term is in parentheses because we want to keep it. The inner (...) could be referenced as \2 in the replacement term, but we don't need that.
[0-9A-F]{12} is simply 12 hexadecimal digits (upper case, use `[0-9a-fA-F] if you expect lower cases as well)
Assuming your data struture is the same
use several field separator at once with a class including ";" and "~". Be carefull , not space alone as separator like by default that return a different field 3 (and 6)
awk -F '[[:blank:]*[;~][[:blank:]]*' '/--$/ {print $1 " " $7}' YourFile
Assuming there is only space char and no tab as separator and data line have Data
awk -F ' *[;~] *' '/--$/ {print $1 " " $7}' YourFile

sed print more than one matches in a line

I have a file, including some strings and variables, like:
${cat.mouse.dog}
bird://localhost:${xfire.port}/${plfservice.url}
bird://localhost:${xfire.port}/${spkservice.synch.url}
bird://localhost:${xfire.port}/${spkservice.asynch.request.url}
${soabp.protocol}://${hpc.reward113.host}:${hpc.reward113.port}
${configtool.store.folder}/config/hpctemplates.htb
I want to print all the strings between "{}". In some lines there are more than one such string and in this case they should remain in the same line. The output should be:
cat.mouse.dog
xfire.port plfservice.url
xfire.port spkservice.synch.url
xfire.port spkservice.asynch.request.url
soabp.protocol hpc.reward113.host hpc.reward113.port
configtool.store.folder
I tried the following:
sed -n 's/.*{//;s/}.*//p' filename
but it printed only the last occurrence of each line. How can I get all the occurrences, remaining in the same line, as in the original file?
This might work for you (GNU sed):
sed -n 's/${/\n/g;T;s/[^\n]*\n\([^}]*\)}[^\n]*/\1 /g;s/ $//p' file
Replace all ${ by newlines and if there are non then move on as there is nothing to process. If there are newlines then remove non-newline characters to the left and non-newline characters to the right of the next } globally. To finish off remove the extra space introduced in the RHS of the global substitution.
If you're not against awk, you can try the following:
awk -v RS='{|}' -v ORS=' ' '/\n/{printf "\n"} (NR+1)%2' file
The record separator RS is set to either { or }. This splits the wanted pattern from the rest.
The script then displays 1 record out of 2 with the statement (NR+1)%2.
In order to keep the alignment as expected, the output record separator is set to a space ORS=' ' and everytime a newline is encountered this statement /\n/{printf "\n"} inserts one.

How to make awk ignore the field delimiter inside double quotes? [duplicate]

This question already has answers here:
Escaping separator within double quotes, in awk
(3 answers)
Closed 7 years ago.
I need to delete 2 columns in a comma seperated values file.
Consider the following line in the csv file:
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
Now, the result I want at the end:
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
I used the following command:
awk 'BEGIN{FS=OFS=","}{print $1,$4}'
But the embedded comma which is inside quotes is creating a problem, Following is the result I am getting:
"abc#xyz.com,field3
"def#xyz.com",field4
Now my question is how do I make awk ignore the "," which are inside the double quotes?
From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.
This is not a bash/awk solution, but I recommend CSVKit, which can be installed by pip install csvkit. It provides a collection of command line tools to work specifically with CSV, including csvcut, which does exactly what you ask for:
csvcut --columns=1,4 <<EOF
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
EOF
Output:
"abc#xyz.com,www.example.com",field4
def#xyz.com,field4
It strips the unnecessary quotes, which I suppose shouldn't be a problem.
Read the docs of CSVKit here on RTD. ThoughtBot has a nice little blog post introducing this tool, which is where I learnt about CSVKit.
In your sample input file, it is the first field and only the first field, that is quoted. If this is true in general, then consider the following as a method for deleting the second and third columns:
$ awk -F, '{for (i=1;i<=NF;i++){printf "%s%s",(i>1)?",":"",$i; if ($i ~ /"$/)i=i+2};print""}' file
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
As mentioned in the comments, awk does not natively understand quoted separators. This solution works around that by looking for the first field that ends with a quote. It then skips the two fields that follow.
The Details
for (i=1;i<=NF;i++)
This starts a for over each field i.
printf "%s%s",(i>1)?",":"",$i
This prints field i. If it is not the first field, the field is preceded by a comma.
if ($i ~ /"$/)i=i+2
If the current field ends with a double-quote, this then increments the field counter by 2. This is how we skip over fields 2 and 3.
print""
After we are done with the for loop, this prints a newline.
This awk should work regardless of where the quoted field is and works on escaped quotes as well.
awk '{while(match($0,/"[^"]+",|([^,]+(,|$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file
Input
"abc#xyz.com,www.example.com",field2,field3,field4
"def#xyz.com",field2,field3,field4
field1,"abc#xyz.com,www.example.com",field3,field4
Output
"abc#xyz.com,www.example.com",field4
"def#xyz.com",field4
field1,field4
It even works on
field1,"field,2","but this field has ""escaped"\" quotes",field4
That the mighty FPAT variable fails on !
Explanation
while(match($0,/"[^"]+",|([^,]+(,|$))/,a))
Starts a while loop that continues as long as the match is a success(i.e there is a field).
The match matches the first occurence of the regex which incidentally matches the fields and store it in array a
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]
Sets $0 to begin at the end of matched field and adds the matched field to the corresponding array position in b.
print b[1] b[4];x=0}
Prints the fields you want from b and sets x back to zero for the next line.
Flaws
Will fail if field contains both escaped quotes and a comma
Edit
Updated to support empty fields
awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
$0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
print b[1] b[4];x=0}' file

Resources