Shell scripting cut -d " " -f4 file.txt command - shell

I have a file with words separated by only single space.
I want to read 4th word from each line of file using command:
cut -d " " -f4 file.txt
It works fine, but I don't understand its property.
If a line contains 4 or more words then it prints the 4th word.
If a line contains only 1 word then it prints that word.
If a line contains 2 or 3 words then it prints nothing.
I want to know that how it is working.

From man cut:
-f, --fields=LIST
select only these fields; also print any line that contains no delimiter character, unless the -s option is specified
If a line contains 1 word, then it does not contain the delimiter and therefore cut prints the whole line (which is exactly that one word).
Other cases are obvious: the line contains at least one delimiter, therefore it prints the fourth word, if available.
If you add the -s parameter, it will print the fourth word only if available (and thus ignore lines with one word without delimiter).

By default, cut expects each input line to contain the delimiter (space in the OP example). Lines that do not contain the delimited are printed as-is.
The default behavior can be changes with -s, which will always print the 4th column, even when the delimited is not found on the line (the case of ` word). Use
cut -s -d " " -f4 file.txt
As to the why this is the default behavior - no clear answer. Probably, this behavior was used to allow some lines to be excluded from the filtering. The initial Unix systems had lot of semi-structured files, where this functionality could have been used to process man pages, nroff pages and similar.
From the man page:
-f list
Cut based on a list of fields, assumed to be separated in the file by
a delimiter character (see -d). Each selected field shall be output.
Output fields shall be separated by a single occurrence of the field
delimiter character. Lines with no field delimiters shall be passed
through intact, unless -s is specified. It shall not be an error to
select fields not present in the input line.
-s, --only-delimited do not print lines not containing delimiters
See also: https://unix.stackexchange.com/questions/157677/does-cut-return-any-fields-if-separator-does-not-exist

Related

Search for Double Quotes (") in the file and copy the whole line in different file

I have a requirement to read through all the files and look for <double quotes> (") and copy the whole line to a different file. The challenge is here that to identify the whole line when there is a new character in the line.
The file format is like this - values are separated with delimiter |*| and end with |##|.
In the attached (image), the highlighted in green should go to new file, Logic would be check for " and if it finds read line starting from (line after |##| to until next |##| )
10338|*|BVL-O-G-01020-R4|*||*|BVL|*||*|Y|*|Y|*||*|CFC6E82284990A7AE040800AA5644B19|*|jmorlan|*|2011.12.21 15:52:01|##|
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
10342|*|BVL-O-4-01020-R7|*||*|DVL|*||*|Y|*|Y|*||*|RRFC6E82284990A7AE040800AA5644B19|*|sppa|*|2011.12.21 15:52:01|##|
Assuming you mean that the sections between |##| should be considered as newline, next question is does you file contain any real newlines? If not, grep is probably not going to be very efficient as it works on a line-by-line basis. If any real newlines are supposed to be considered part of the text, then definitely, grep is going to be unhappy.
If you really want to do it in 1 go in grep:
grep -Eoz '(^|\|##\|)([^|]|\|[^#]|\|#[^#]|\|##[^|])"([^|]|\|[^#]|\|#[^#]|\|##[^|])(\|##\||$)'
This is looking for any sequence that starts with |##| (or is the start of the file) is followed by some characters, a quote, and some more characters, then ends with |##| (or end of file). By using -z grep will ignore any newlines in the file.
The complex "any characters" ([^|]|\|[^#]|\|#[^#]|\|##[^|])* expression is because grep is greedy. It basically looks for repeating sequences that are not |##|. Perhaps turning off greed is good, but that will depend on the power of the regexp engine in your version of grep.
But much easier, and probably faster, to use sed to break up the records and inject "NULL" line-breaks:
sed 's/\|##\|/\x00/g' | grep -z '"'
This is simply replacing your end of line pattern |##| with the null character, then asking grep to find quote while treating null character as end of line.
This answer provides two solutions a Gnu Awk solution and a POSIX version.
POSIX awk
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
GNU awk 1
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
GNU awk 2
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
On the sample data provided in the question, all provided solutions give the following output:
10358|*|BI-MED-CDMA-MCS-90-118-EXAM|*|Exam for 001-MCS-90-118:
Planning, Conducting and Reporting Post Marketing Surveillance "Studies and Safety Reporting from Non Trial Activities |*|GLOBAL_MEDICAL|*||*|Y|*|N|*||*|CFC6E822849A0A7AE040800AA5644B19|*|finke|*|2012.04.30 04:23:27|##|
note: It is possible that you are suffering from the Carriage Return problem if the file comes from a Windows machine. Pleas run dos2unix on the file before using it with these tools.
How does this work? (POSIX)
Using a POSIX version of awk we can do
awk '{r=r ? r "\n" $0 : $0}
/\|##\|$/ { if (r ~ /"/) print r; r=""}' inputfile > outputfile
The idea is to build a record r by appending every line to r. If the current line ends with "|##|", then we check if the record r contains a <double quote> ". If this is the case, we print the record r and reset the record r to an empty string. If it does not contain the <double quote>, we just reset it.
How does this work? (GNU)
Using GNU awk you can do this directly using the record separator RS
awk 'BEGIN{RS="\\|##\\|\n?";ORS="|##|\n"}/"/' inputfile > outputfile
The idea here is that the file contains various records. The OP clearly stated that the information of a record is split in fields separated by |*|, but more importantly, the records themselves are separated by |##|. So in the presented example of the OP, the first record is line1 while the second record is spread over line 2 and line 3.
In awk, you can define a record separator by means of the variable RS. In its default state, RS is the <newline> character \n which makes each line a separate record which can be referenced by $0. In POSIX, the record separator can only be a single character which separates the records, while in Gnu awk, this can be a regular expression (see addendum below).
Since the record separator of the OP is the string "|##|" followed all or not by a <newline> character \n, we need to define RS=\\|##\\|\n?. Why so complicated?
the <pipe> | symbol is the OR operation (alternation operator) in a regular expression, so we need to escape it. But since string literals that are used as regular expressions are parsed twice, we also need to escape it twice. So | &rightarrow; \\| (see here)
the \n? is because it seems that the actual record separator is the string "|##|\n", but maybe some records do not have a newline character, especially the last record.
When you print records, using the print statement it automatically appends the output record separator ORS after each line. By default this is again a <newline> character \n. Since the record separator RS is not a part of the record $0 you need to update the value ORS to ORS="|##|\n". This time, not a regex, so you do not need to escape at all.
The statement /"/ is a shorthand for /"/{print $0} which means If the current record $0 contains a <double quote> ", then print the current record $0 followed by the output record separator ORS.
Note: since we actually already use Gnu awk, we can actually reduce the whole thing even further to:
awk 'BEGIN{RS="\\|##\\|\n?"}/"/{printf $0 RT}' inputfile > outputfile
Which makes use of the matched record separator RT that corresponds to the text found by RS. By replacing the print statement by a printf statement, we do not need to ORS anymore and just manually add RT to the record $0.
RS: The input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines. If it is a regexp, records are separated by matches of the regexp in the input text.
The ability for RS to be a regular expression is a gawk extension. In most other AWK implementations, or if gawk is in compatibility mode (see Options), just the first character of RS’s value is used.
ORS: The output record separator. It is output at the end of every print statement. Its default value is "\n", the newline character.
RT: (GNU AWK specific) The input text that matched the text denoted by RS, the record separator. It is set every time a record is read.
source: GNU AWK manual

Make changes to a file (sed, awk)

I am trying to clean up the next file:
1. 10.160.120.10 ; 140.0.0.40 ;Data-- 1155~00120~xtl~12/01/2016 03:00:24~000BBBBBA4FB~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
2. ¼?Amµxðïej£„7‹ìËÏð‡.4 --
3. 10.160.120.11 ; 140.10.10.10 ;Data-- 1155~00120~xtl~12/01/2016 03:00:54~2B3BB1EB1BBB~£ˆD]†CÀ,£ÑÉ»In&Ry+/jÑ%A¡ã ÷d_#C÷—NÏÕÞ
3. Ü‚úè"åD\’c\ûñ7x°yFæï --
Note that the numbers are not an actual part of the file. They are just reference for the number of line. The size of the line depends on the encoded message (That is why the 3 is reapeated because it basically one line). There are thousands of records but they follow the same pattern. Each record ends with a (--).
Basically what I am trying to achive is to just get the IPs side by side.
For example:
10.160.120.10 000BBBBBA4FB
My first step would be to delete everything between the first (;) and the fourth (~) since that pattern is the same for each record.
Which leads me to this.
sed 's/;.*~//'
However this particular command would delete everything untill the last (~) and not the fourth.
If it succesfully removes everything between the first (;) and the fourth (~) it would get me something like this:
0.165.65.113 0008B9A4F3~ÍežG5„È&gÈe#Ÿ#•Œ‘„¦åEI²6frÞõ+ã:®*ÓÓÂ"ða5»V$è~
¼?Amµxðïej£„7‹ìËÏð‡.4 --
And then I guess I could delete everything after the first (~) so I can get the desired output.
Am I following the right procedure? Should I achive this with swd or awk? Any suggestion are appreciated!
Instead of trying to remove stuff, why don't you just keep the stuff you want?
sed -r -n 's/^[^0-9]*(([0-9]{1,3}\.){3}[0-9]{1,3}).*([0-9A-F]{12}).*$/\1 \3/p'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^
# IP Address 12 Hex digits
Explanation:
\1 \3 means enter everything that matched the first and the third set of parenthesis of the search term.
^[^0-9]* matches all non-digits from the beginning of the file
([0-9]{1,3}\.){3}[0-9]{1,3} matches an IP address. The whole term is in parentheses because we want to keep it. The inner (...) could be referenced as \2 in the replacement term, but we don't need that.
[0-9A-F]{12} is simply 12 hexadecimal digits (upper case, use `[0-9a-fA-F] if you expect lower cases as well)
Assuming your data struture is the same
use several field separator at once with a class including ";" and "~". Be carefull , not space alone as separator like by default that return a different field 3 (and 6)
awk -F '[[:blank:]*[;~][[:blank:]]*' '/--$/ {print $1 " " $7}' YourFile
Assuming there is only space char and no tab as separator and data line have Data
awk -F ' *[;~] *' '/--$/ {print $1 " " $7}' YourFile

grep: keep lines by number in specific column

I know how to do it with awk, for example, keep lines, which contains number 3 in second column: $ awk '"$2" == 3'
But how to do the same with only grep?
What about for first column?
Grep is not great for this, awk is better. But assuming your columns are separated by spaces, then you want
grep -E '^[^ ]+ +3( |$)'
Explanation: find something that has a start of line, followed by one or more non-space characters (first column), then one or more space characters (column separator), then the number 3, then either a space (because there's another column) or end of line (if there's no other column).
(Updated to fix syntax after testing.)
Here is the longer explanation for my mysterious command grep -P '^[^\t]*\t3\t' your_file from the comments:
I assumed that the column delimiter is a tab. grep without -P would require some strange things to use it directly (see e.g. see here ) . The -P makes it possible to just write \t without any problems. If for example your delimiter is ; then you could replace the \t with ; and you dont need the -P option.
Having said that, lets explain the idea behind the regular expression: You said, you want to match a 3 in the second column:
^ means: at the beginning of the line
[^\t]* means: zero or more (*) occurences of something not a tab ([^\t] here the ^ means "not a")
followed by tab
followed by 3
followed by tab
Now we have effectively expressed the idea that we need a 3 as the content of the second column (\t3\t) and we are not interested in the precise content of the first column. The ^[^\t]*\t is only necessary to express the idea "what follows is in the second column".
If you want to match something in the fourth column, you could use this to "skip" the first three column and match a 4 in the fourth column:
^([^\t]*\t){3}4. (Note the parenthesis and the {3}).
As you can see many details and awk is much more elegant and easy.
You can read this up in the documentation of grep and then you will need to study something about regular expression, e.g. start here.

use sed to merge lines and add comma

I found several related questions, but none of them fits what I need, and since I am a real beginner, I can't figure it out.
I have a text file with entries like this, separated by a blank line:
example entry &with/ special characters
next line (any characters)
next %*entry
more words
I would like the output merge the lines, put a comma between, and delete empty lines. I.e., the example should look like this:
example entry &with/ special characters, next line (any characters)
next %*entry, more words
I would prefer sed, because I know it a little bit, but am also happy about any other solution on the linux command line.
Improved per Kent's elegant suggestion:
awk 'BEGIN{RS="";FS="\n";OFS=","}{$1=$1}7' file
which allows any number of lines per block, rather than the 2 rigid lines per block I had. Thank you, Kent. Note: The 7 is Kent's trademark... any non-zero expression will cause awk to print the entire record, and he likes 7.
You can do this with awk:
awk 'BEGIN{RS="";FS="\n";OFS=","}{print $1,$2}' file
That sets the record separator to blank lines, the field separator to newlines and the output field separator to a comma.
Output:
example entry &with/ special characters,next line (any characters)
next %*entry,more words
Simple sed command,
sed ':a;N;$!ba;s/\n/, /g;s/, , /\n/g' file
:a;N;$!ba;s/\n/, /g -> According to this answer, this code replaces all the new lines with ,(comma and space).
So After running only the first command, the output would be
example entry &with/ special characters, next line (any characters), , next %*entry, more words
s/, , /\n/g - > Replacing , , with new line in the above output will give you the desired result.
example entry &with/ special characters, next line (any characters)
next %*entry, more words
This might work for you (GNU sed):
sed ':a;$!N;/.\n./s/\n/, /;ta;/^[^\n]/P;D' file
Append the next line to the current line and if there are characters either side of the newline substitute the newline with a comma and a space and then repeat. Eventually an empty line or the end-of-file will be reached, then only print the next line if it is not empty.
Another version but a little more sofisticated (allowing for white space in the empty line) would be:
sed ':a;$!N;/^\s*$/M!s/\n/, /;ta;/\`\s*$/M!P;D' file
sed -n '1h;1!H
$ {x
s/\([^[:cntrl:]]\)\n\([^[:cntrl:]]\)/\1, \2/g
s/\(\n\)\n\{1,\}/\1/g
p
}' YourFile
change all after loading file in buffer. Could be done "on the fly" while reading the file and based on empty line or not.
use -e on GNU sed

How to extract from a file text between tokens using bash scripts

I was reading this question: Extract lines between 2 tokens in a text file using bash
because I have a very similar problem...
I have to extract (and save it to $variable before printing) text in this xml file:
<--more labels up this line>
<ExtraDataItem name="GUI/LastVMSelected" value="14cd3204-4774-46b8-be89-cc834efcba89"/>
<--more labels and text down this line-->
I only need to get the value= (obviously without brackets and no 'value='), but first, I think it have to search "GUI/LastVMSelected" to get to this line, because there could be a similar value field in other lines,and the value of that label is that i want.
If they are on the same line (as they seem to be from your example), it's even easier. Just:
sed -ne '/name="GUI\/LastVMSelected"/s/.*value="\([^"]*\)".*/\1/p'
Explanation:
-n: Suppress default print
/name="GUI\/LastVMSelected"/: only lines matching this pattern
s/.value="([^"])"./\1/p
substitute everything, capturing the parenthesized part (the value of value)
and print the result
I'm assuming that you're extracting from an XML document. If that is the case, have a look at the XMLStarlet command-line tools for processing XML. There's some documentation for querying XML docs here.
Use this:
for f in `grep "GUI/LastVMSelected" filename.txt | cut -d " " -f3`; do echo ${f:7:36}; done
grep gets you only the lines you need
cut splits the lines using some separator, and returns the Nth result of the split
-d " " sets the separator to space
-f3 returns the third result (1-based indexing)
${f:7:36} extracts the substring starting at index 7 that is 36 characters long. This gets rid of the leading value=" and trailing slash, etc.
Obviously if the order of the fields changes, this will break, but if you're just after something quick and dirty that works, this should be it.
Using my answer from the question you linked:
sed -n '/<!--more labels up this line-->/{:a;n;/<!--more labels and text down this line-->/b;\|GUI/LastVMSelected|s/value="\([^=]*\)"/\1/p;ba}' inputfile
Explanation:
-n - don't do an implicit print
/<!-- this is token 1 -->/{ - if the starting marker is found, then
:a - label "a"
n - read the next line
/<!-- this is token 2 -->/q - if it's the ending marker, quit
\|GUI/LastVMSelected| - if the line matches the string
s/value="\([^"]*\)"/\1/p - print the string after 'value=' and before the next quote
ba - branch to label "a"
} end if

Resources