How to separate a line with " " delimiter but, excluding string encapsulated in the single quotes? - bash

This is my first post ever so please forgive me if I missed any details.
PROBLEM STATEMENT:
I have a bunch of these lines in the file. The fields are separated by space.
'Temp.200.200B.Y2K & K-102 & P-503B.SP' (tp9012ga-bt102-734b-pqm4-kjk94kj10), PASSED, 2023-02-12T06:39:48Z, 2023-02-12T07:25:48.044Z, 1440] took 99ms including network delay.
I would like to keep what's in the single quotes and also break these into fields with " " delimiter. The desired output is below.
'Temp.200.200B.Y2K & K-102 & P-503B.SP' (tp9012ga-bt102-734b-pqm4-kjk94kj10), 2023-02-12T06:39:48Z, 2023-02-12T07:25:48.044Z, 99
now keep in mind that the character inside of the single quotes varies vastly but, they are always encapsulated within single quotes.
I have tried cut with a space delimiter but, it also considers spaces in the string inside of the single quotes.
cut -d\' -f1-6
Also, if you notice my desired output, I also wanted to remove some fields and some characters such as 'ms' from 99ms.

How to separate a line with " " delimiter but, excluding string
encapsulated in the single quotes?
I would harness GNU AWK for this task following way, consider following simple example, let file.txt content be
fields without quotes
'quoted field' 'another quoted field' 'yet another field'
mixed 'quoted field' unquoted
then
awk 'BEGIN{FPAT="\047[^\047]*\047|[^ ]*"}{print "1st field is",$1; print "2nd field is",$2; print "3rd field is",$3}' file.txt
gives output
1st field is fields
2nd field is without
3rd field is quotes
1st field is 'quoted field'
2nd field is 'another quoted field'
3rd field is 'yet another field'
1st field is mixed
2nd field is 'quoted field'
3rd field is unquoted
Explanation: I use FPAT to inform GNU AWK what constitutes field, namely single quote (as ' is used as terminator I use \047 which is ASCII code of that character in octal) followed by zero-or-more non-quotes followed by single quote OR (|) zero-or-more non-space characters. Disclaimer: this solution assumes ' are perfectly balanced and there is never ' inside quoted field which is non-terminating.
(tested in GNU Awk 5.0.1)

This might work for you (GNU sed):
sed -E 's/'\''[^'\'']*'\''|\S+/&\n/g
s/.*/echo "&"|sed -n "1,2p;4,5p;8s#ms##p"/e
s/\n//g' file
Prepend newlines to space delimiters.
Using the evaluation within the substitution command, run a second invocation of sed and treat each field as a line.
Remove or amend the lines (fields).
Remove the inserted newlines.

By looking at the problem statement and the desired output, you may need to go for , as delimiter along with a combination of awk and sed.
I will simply echo your PROBLEM STATEMENT string in this case to show you how it can be done.
I am assuming the line format is the same in your file (no issues with characters inside the quote changing vastly except for ,)
echo "'Temp.200.200B.Y2K & K-102 & P-503B.SP' (tp9012ga-bt102-734b-pqm4-kjk94kj10), PASSED, 2023-02-12T06:39:48Z, 2023-02-12T07:25:48.044Z, 1440] took 99ms including network delay." | awk -F "," '{print $1,","$3","$4","$5}' | sed -e 's/ms .*//g' -e 's/[0-9]*] took //g'
The Output:
'Temp.200.200B.Y2K & K-102 & P-503B.SP' (tp9012ga-bt102-734b-pqm4-kjk94kj10) , 2023-02-12T06:39:48Z, 2023-02-12T07:25:48.044Z, 99
EDIT:
#Ed Morton - I tried your approach and you are right. It can be done using awk only as well. The command is given below.
echo "'Temp.200.200B.Y2K & K-102 & P-503B.SP' (tp9012ga-bt102-734b-pqm4-kjk94kj10), PASSED, 2023-02-12T06:39:48Z, 2023-02-12T07:25:48.044Z, 1440] took 99ms including network delay." | awk -F "," '{ gsub("[0-9]*] took ","",$5); gsub("ms .*","",$5); print $1,","$3","$4","$5}'

Related

AWK match exact string inside square brackets

I have a file similar to the below-illustrated data.
https://www.test.example.com [503]
https://www.tst.example.com [403]
https://www.tt.example.com [302]
I want to fetch lines that match with the second column. For example, lines matching [403] should print only https://www.tst.example.com.
I tried escaping the square brackets with the below command, which gave me a warning.
$ awk -F "$2 == '\[403]\'" file.txt
awk: warning: escape sequence `\[' treated as plain `['
awk: warning: escape sequence `\'' treated as plain `''
You are mixing regular expressions and plain strings. [ is a regex special character, but you are not using a regex here, just a literal string comparison. You don't need any escaping at all (though you might want to reverse the usage of single and double quotes for simplicity, unless you are actually using Windows).
awk '$2 == "[403]"' file.txt
In basically all the Unix shells, the double quotes you used don't protect dollar signs, so $2 would be substituted by the shell, probably with nothing, or else with some unrelated string (whatever got passed in as the second command-line argument to the shell).
The -F option, if present, requires an argument; but based on your example data, the default field separator - any sequence of whitespace - should work fine. If you want to force it to e.g. a single space, try -F ' '.
Could you please try following, written and tested with shown samples in GNU awk.
awk -F'([[:space:]]*)?\\[|\\]([[:space:]]*)?' '$2=="403"{print $1}' Input_file
Explanation: Setting field separator as either spaces(optional)[ OR [spaces(optional) for all lines. Then checking if 2nd field is 403 then print the first field as per OP's request.
Will do what you want, with the benefit of allowing you to pass the desired code as an argument, rather than having it hardcoded into the awk script.
awk -v http_code=403 '$2 == "["http_code"]"' file.txt

Shell scripting cut -d " " -f4 file.txt command

I have a file with words separated by only single space.
I want to read 4th word from each line of file using command:
cut -d " " -f4 file.txt
It works fine, but I don't understand its property.
If a line contains 4 or more words then it prints the 4th word.
If a line contains only 1 word then it prints that word.
If a line contains 2 or 3 words then it prints nothing.
I want to know that how it is working.
From man cut:
-f, --fields=LIST
select only these fields; also print any line that contains no delimiter character, unless the -s option is specified
If a line contains 1 word, then it does not contain the delimiter and therefore cut prints the whole line (which is exactly that one word).
Other cases are obvious: the line contains at least one delimiter, therefore it prints the fourth word, if available.
If you add the -s parameter, it will print the fourth word only if available (and thus ignore lines with one word without delimiter).
By default, cut expects each input line to contain the delimiter (space in the OP example). Lines that do not contain the delimited are printed as-is.
The default behavior can be changes with -s, which will always print the 4th column, even when the delimited is not found on the line (the case of ` word). Use
cut -s -d " " -f4 file.txt
As to the why this is the default behavior - no clear answer. Probably, this behavior was used to allow some lines to be excluded from the filtering. The initial Unix systems had lot of semi-structured files, where this functionality could have been used to process man pages, nroff pages and similar.
From the man page:
-f list
Cut based on a list of fields, assumed to be separated in the file by
a delimiter character (see -d). Each selected field shall be output.
Output fields shall be separated by a single occurrence of the field
delimiter character. Lines with no field delimiters shall be passed
through intact, unless -s is specified. It shall not be an error to
select fields not present in the input line.
-s, --only-delimited do not print lines not containing delimiters
See also: https://unix.stackexchange.com/questions/157677/does-cut-return-any-fields-if-separator-does-not-exist

Search for Double Quotes (") in the file and copy the whole line in different file

I have a requirement to read through all the files and look for <double quotes> (") and copy the whole line to a different file. The challenge is here that to identify the whole line when there is a new character in the line.
The file format is like this - values are separated with delimiter |*| and end with |##|.
In the attached (image), the highlighted in green should go to new file, Logic would be check for " and if it finds read line starting from (line after |##| to until next |##| )
10338|*|BVL-O-G-01020-R4|*||*|BVL|*||*|Y|*|Y|*||*|CFC6E82284990A7AE040800AA5644B19|*|jmorlan|*|2011.12.21 15:52:01|##|
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
10342|*|BVL-O-4-01020-R7|*||*|DVL|*||*|Y|*|Y|*||*|RRFC6E82284990A7AE040800AA5644B19|*|sppa|*|2011.12.21 15:52:01|##|
Assuming you mean that the sections between |##| should be considered as newline, next question is does you file contain any real newlines? If not, grep is probably not going to be very efficient as it works on a line-by-line basis. If any real newlines are supposed to be considered part of the text, then definitely, grep is going to be unhappy.
If you really want to do it in 1 go in grep:
grep -Eoz '(^|\|##\|)([^|]|\|[^#]|\|#[^#]|\|##[^|])"([^|]|\|[^#]|\|#[^#]|\|##[^|])(\|##\||$)'
This is looking for any sequence that starts with |##| (or is the start of the file) is followed by some characters, a quote, and some more characters, then ends with |##| (or end of file). By using -z grep will ignore any newlines in the file.
The complex "any characters" ([^|]|\|[^#]|\|#[^#]|\|##[^|])* expression is because grep is greedy. It basically looks for repeating sequences that are not |##|. Perhaps turning off greed is good, but that will depend on the power of the regexp engine in your version of grep.
But much easier, and probably faster, to use sed to break up the records and inject "NULL" line-breaks:
sed 's/\|##\|/\x00/g' | grep -z '"'
This is simply replacing your end of line pattern |##| with the null character, then asking grep to find quote while treating null character as end of line.
This answer provides two solutions a Gnu Awk solution and a POSIX version.
POSIX awk
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
GNU awk 1
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
GNU awk 2
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
On the sample data provided in the question, all provided solutions give the following output:
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
note: It is possible that you are suffering from the Carriage Return problem if the file comes from a Windows machine. Pleas run dos2unix on the file before using it with these tools.
How does this work? (POSIX)
Using a POSIX version of awk we can do
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
The idea is to build a record r by appending every line to r. If the current line ends with "|##|", then we check if the record r contains a <double quote> ". If this is the case, we print the record r and reset the record r to an empty string. If it does not contain the <double quote>, we just reset it.
How does this work? (GNU)
Using GNU awk you can do this directly using the record separator RS
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
The idea here is that the file contains various records. The OP clearly stated that the information of a record is split in fields separated by |*|, but more importantly, the records themselves are separated by |##|. So in the presented example of the OP, the first record is line1 while the second record is spread over line 2 and line 3.
In awk, you can define a record separator by means of the variable RS. In its default state, RS is the <newline> character \n which makes each line a separate record which can be referenced by $0. In POSIX, the record separator can only be a single character which separates the records, while in Gnu awk, this can be a regular expression (see addendum below).
Since the record separator of the OP is the string "|##|" followed all or not by a <newline> character \n, we need to define RS=\\|##\\|\n?. Why so complicated?
the <pipe> | symbol is the OR operation (alternation operator) in a regular expression, so we need to escape it. But since string literals that are used as regular expressions are parsed twice, we also need to escape it twice. So | &rightarrow; \\| (see here)
the \n? is because it seems that the actual record separator is the string "|##|\n", but maybe some records do not have a newline character, especially the last record.
When you print records, using the print statement it automatically appends the output record separator ORS after each line. By default this is again a <newline> character \n. Since the record separator RS is not a part of the record $0 you need to update the value ORS to ORS="|##|\n". This time, not a regex, so you do not need to escape at all.
The statement /"/ is a shorthand for /"/{print $0} which means If the current record $0 contains a <double quote> ", then print the current record $0 followed by the output record separator ORS.
Note: since we actually already use Gnu awk, we can actually reduce the whole thing even further to:
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
Which makes use of the matched record separator RT that corresponds to the text found by RS. By replacing the print statement by a printf statement, we do not need to ORS anymore and just manually add RT to the record $0.
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual

Make changes to a file (sed, awk)

I am trying to clean up the next file:
1. 10.160.120.10 ; 140.0.0.40 ;Data-- 1155~00120~xtl~12/01/2016 03:00:24~000BBBBBA4FB~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
2. ¼?Amµxðïej£„7‹ìËÏð‡.4 --
3. 10.160.120.11 ; 140.10.10.10 ;Data-- 1155~00120~xtl~12/01/2016 03:00:54~2B3BB1EB1BBB~£ˆD]†CÀ,£ÑÉ»In&Ry+/jÑ%A¡ã ÷d_#C÷—NÏÕÞ
3. Ü‚úè"åD\’c\ûñ7x°yFæï --
Note that the numbers are not an actual part of the file. They are just reference for the number of line. The size of the line depends on the encoded message (That is why the 3 is reapeated because it basically one line). There are thousands of records but they follow the same pattern. Each record ends with a (--).
Basically what I am trying to achive is to just get the IPs side by side.
For example:
10.160.120.10 000BBBBBA4FB
My first step would be to delete everything between the first (;) and the fourth (~) since that pattern is the same for each record.
Which leads me to this.
sed 's/;.*~//'
However this particular command would delete everything untill the last (~) and not the fourth.
If it succesfully removes everything between the first (;) and the fourth (~) it would get me something like this:
0.165.65.113 0008B9A4F3~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
¼?Amµxðïej£„7‹ìËÏð‡.4 --
And then I guess I could delete everything after the first (~) so I can get the desired output.
Am I following the right procedure? Should I achive this with swd or awk? Any suggestion are appreciated!
Instead of trying to remove stuff, why don't you just keep the stuff you want?
sed -r -n 's/^[^0-9]*(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9A-F]{12}).*$/\1 \3/p'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
# IP Address 12 Hex digits
Explanation:
\1 \3 means enter everything that matched the first and the third set of parenthesis of the search term.
^[^0-9]* matches all non-digits from the beginning of the file
([0-9]{1,3}\.){3}[0-9]{1,3} matches an IP address. The whole term is in parentheses because we want to keep it. The inner (...) could be referenced as \2 in the replacement term, but we don't need that.
[0-9A-F]{12} is simply 12 hexadecimal digits (upper case, use `[0-9a-fA-F] if you expect lower cases as well)
Assuming your data struture is the same
use several field separator at once with a class including ";" and "~". Be carefull , not space alone as separator like by default that return a different field 3 (and 6)
awk -F '[[:blank:]*[;~][[:blank:]]*' '/--$/ {print $1 " " $7}' YourFile
Assuming there is only space char and no tab as separator and data line have Data
awk -F ' *[;~] *' '/--$/ {print $1 " " $7}' YourFile

unterminated address regex while using sed

I am trying to use the sed command to find and print the number that appears between "\MP2=" and "\" in a portion of a line that appears like this in a large .log file
\MP2=-193.0977448\
I am using the command below and getting the following error:
sed "/\MP2=/,/\/p" input.log
sed: -e expression #1, char 12: unterminated address regex
Advice on how to alter this would be greatly appreciated!
Superficially, you just need to double up the backslashes (and it's generally best to use single quotes around the sed program):
sed '/\\MP2=/,/\\/p' input.log
Why? The double-backslash is necessary to tell sed to look for one backslash. The shell also interprets backslashes inside double quoted strings, which complicates things (you'd need to write 4 backslashes to ensure sed sees 2 and interprets it as 'look for 1 backslash') — using single quoted strings avoids that problem.
However, the /pat1/,/pat2/ notation refers to two separate lines. It looks like you really want:
sed -n '/\\MP2=.*\\/p' input.log
The -n suppresses the default printing (probably a good idea on the first alternative too), and the pattern looks for a single line containing \MP2= followed eventually by a backslash.
If you want to print just the number (as the question says), then you need to work a little harder. You need to match everything on the line, but capture just the 'number' and remove everything except the number before printing what's left (which is just the number):
sed -n '/.*\\MP2=\([^\]*\)\\.*/ s//\1/p' input.log
You don't need the double backslash in the [^\] (negated) character class, though it does no harm.
If the starting and ending pattern are on the same line, you need a substitution. The range expression /r1/,/r2/ is true from (an entire) line which matches r1, through to the next entire line which matches r2.
You want this instead;
sed -n 's/.*\\MP2=\([^\\]*\)\\.*/\1/p' file
This extracts just the match, by replacing the entire line with just the match (the escaped parentheses create a group which you can refer back to in the substitution; this is called a back reference. Some sed dialects don't want backslashes before the grouping parentheses.)
awk is a better tool for this:
awk -F= '$1=="MP2" {print $2}' RS='\' input.log
Set the record separator to \ and the field separator to '=', and it's pretty trivial.

Resources