awk: remove multiple tabs between each fields and output a line where each field is separated by a single tab - bash

I have a file whose 11th line should in theory have 1011 columns yet it looks like there are more than 1 tabs between each of its field. More specifically,
If I use
awk '{print NF}' file
then I can see that the 11th line has the same number of fields as all the rest (except for the first ten lines, which have a different format. That's expected).
But if I use
awk 'BEGIN{FS="\t"} {print NF}' file
I can see that the 11th line has 2001 fields. Based on that, I suspect some of its fields are separted by more than one whitespaces.
I'd like to have each field separated by 1 tab only, so I tried
awk 'BEGIN{OFS="\t"} {print}' file > file.modified
However, this doesn't solve the problem as
awk 'BEGIN{FS="\t"} {print NF}' file.modified
still indicates that the 11th line has 2001 fields.
Can anyone point out a way to achieve my goal? Thanks a lot! I have put the first 100 lines of my file in the following google drive link.
https://drive.google.com/file/d/1qOjzjUnJKJpc4VpDxwKPBcqMS7MUuyKy/view?usp=sharing

To squeeze multiple tabs to one tab, you could use tr:
tr -s '\t' <file >file.modified

This might help with GNU awk:
awk 'BEGIN{FS="\t+"; OFS="\t"} {$1=$1; print}' file
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Related

Is there a way to treat a single column of integers as an array in order to extract certain digits?

I am trying to treat a series of integers as an array in order to extract the "columns" of interest.
My data after extracting a column of integers looks something like:
01010101010
10101010101
00100111100
10111100000
01011000100
If I'm only interested in the 1st, 4th, and 11th integers, I'd like the output to look like this:
010
101
000
110
010
This problem is hard to describe in words, so I'm sorry for the lack of clarity. I've tried a number of suggestions, but many things such as awk's substr() are unable to skip positions (such as the 1st, 4th, and 11th positions here).
You can use the cut command:
cut -c 1,4,11 file
-c selects only characters.
or using (gnu) awk:
awk '{print $1 $4 $11}' FS= file
FS is the field separator which is set to nothing in order capture every single character.
With GNU awk which can use empty string as field separator, you could do:
awk -F '' '{print $1, $4, $11}' OFS='' infile
Could you please try following awk too.
awk '{print substr($0,1,1) substr($0,4,1) substr($0,11,1)}' Input_file

"grep" a csv file including multi-lines fields?

file.csv:
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
I want to grep the "XA100" entry like this:
grep XA100 file.csv
to obtain this result:
XA100;"this is
the multi-line"
but grep return only one line:
XA100;"this is
source.csv contains 3 entries.
The "XA100" entry contain a multi-line field.
And grep doesn't seem to be the right tool to "grep" CSV file including multilines fields.
Do you know the way to make the job ?
Edit: the real world file contains many columns. The researched term can be in any column (not at begin of line, nor at the begin of field). All fields are encapsulated by ". Any field can contain a multi-line, from 1 line to any, and this cannot be predicted.
Give this line a try:
awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' file
I extended your example a bit:
kent$ cat f
XA90;"standard"
XA100;"this is
the
multi-
line"
XA110;"other standard"
kent$ awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' f
XA100;"this is
the
multi-
line"
In the comments you mention: In the real world file, each line start with ". I assume they also end with " and present you this:
Test file:
$ cat file
"single line"
"multi-
lined"
Code and outputs:
$ awk 'BEGIN{RS=ORS="\"\n"} /single/' file
"single line"
$ awk 'BEGIN{RS=ORS="\"\n"} /m/' file
"multi-
lined"
You can also parametrize the search:
$ awk -v s="multi" 'BEGIN{RS=ORS="\"\n"} match($0,s)' file
"multi-
lined"
try:
Solution 1:
awk -v RS="XA" 'NR==3{gsub(/$\n$/,"");print RS $0}' Input_file
Making Record separator as string XA then looking for line 3rd here and then globally substituting the $\n$(which is to remove the extra line at the end of the line) with NULL. Then printing the Record Separator with the current line.
Solution 2:
awk '/XA100/{print;getline;while($0 !~ /^XA/){print;getline}}' Input_file
Looking for string XA100 then printing the current line and using getline to go to next line, using while loop then which will run and print the lines until a line is starting from XA.
If this file was exported from MS-Excel or similar then lines end with \r\n while the newlines inside quotes are just \ns so then all you need is:
$ awk -v RS='\r\n' '/XA100/' file
XA100;"this is
the multi-line"
The above uses GNU awk for multi-char RS. On some platforms, e.g. cygwin, you'll have to add -v BINMODE=3 so gawk sees the \rs rather than them getting stripped by underlying C primitives.
Otherwise, it's extremely hard to parse CSV files in general without a real CSV parser (which awk currently doesn't have but is in the works for GNU awk) but you could do this (again with GNU awk for multi-char RS):
$ cat file
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
$ awk -v RS="\"[^\"]*\"" -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
XA90;"standard"
XA100;"this is the multi-line"
XA110;"other standard"
to replace all newlines within quotes with blank chars and then process it as regular 1-line-per-record file.
Using PS response, this works for the small example:
sed 's/^X/\n&/' file.csv | awk -v RS= '/XA100/ {print}'
For my real world CSV file, with many columns, with researched term anywhere, with unknown count of multi-lines, with characters " replaced by "", with multi-lines lines beginning with ", with all fields encapsulated by ", this works. Note the exclusion of the second character " in sed part:
sed 's/^"[^"]/\n&/' file.csv | awk -v RS= '/RESEARCH_TERM/ {print}'
Because first column of any entry cannot start with "". First column allways looks like "XXXXXXXXX", where X is any character but ".
Thank you all for so much responses, maybe others solutions are working depending the CSV file format you use.

Using a multi-character field separator in awk on Solaris

I wish to use a string (BIRCH) as a field delimiter in awk to print second field. I am trying the following command:
cat tmp.log|awk -FBirch '{ print $2}'
Below output is getting printed:
irch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Desired output:
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Contents of tmp.log file.
-bash-3.2# cat tmp.log
Dec 05 13:49:23 [x.x.x.x.180.100] business-log-dev/int [TEST][0x80000001][business-log][info] mpgw(Test): trans(8497187)[request][10.x.x.x]:
Birch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Am I doing something wrong?
OS: Solaris10
Shell: Bash
Tried below command suggested in one of the ansers below. I am getting the desired output, but with an extra empty line at the top. How can this be eliminated from the output?
-bash-3.2# /usr/xpg4/bin/awk -FBirch '{print $2}' tmp.log
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Originally, I suggested putting quotes around "Birch" (-F'Birch') but actually, I don't think that should make any difference.
I'm not at all experienced working with Solaris but you may want to also try using nawk ("new awk") instead of awk.
nawk -FBirch '{print $2}' file
If this works, you may want to consider creating an alias so that you always use the newer version of awk with more features.
You may also want to try using the version of awk in the /usr/xpg4/bin directory, which is a POSIX compliant implementation so should support multi-character FS:
/usr/xpg4/bin/awk -FBirch '{print $2}' file
If you only want to print lines which have more than one field, you can add a condition:
/usr/xpg4/bin/awk -FBirch 'NF>1{print $2}' file
This only prints the second field when there is more than one field.
From the man page of the default awk on solaris usr/bin/awk
-Fc Uses the character c as the field separator
(FS) character. See the discussion of FS
below.
As you can see solaris awk only takes a single character as a Field separator
Also in the man page is split
split(s, a, fs)
Split the string s into array elements a[1], a[2], ...
a[n], and returns n. The separation is done with the
regular expression fs or with the field separator FS if
fs is not given.
As you can see here it takes a regular expression as a separator so we can use.
awk 'split($0,a,"Birch"){print a[2]}' file
To print the second field split by Birch

Remove first columns then leave remaining line untouched in awk

I am trying to use awk to remove first three fields in a text file. Removing the first three fields is easy. But the rest of the line gets messed up by awk: the delimiters are changed from tab to space
Here is what I have tried:
head pivot.threeb.tsv | awk 'BEGIN {IFS="\t"} {$1=$2=$3=""; print }'
The first three columns are properly removed. The Problem is the output ends up with the tabs between columns $4 $5 $6 etc converted to spaces.
Update: The other question for which this was marked as duplicate was created later than this one : look at the dates.
first as ED commented, you have to use FS as field separator in awk.
tab becomes space in your output, because you didn't define OFS.
awk 'BEGIN{FS=OFS="\t"}{$1=$2=$3="";print}' file
this will remove the first 3 fields, and leave rest text "untouched"( you will see the leading 3 tabs). also in output the <tab> would be kept.
awk 'BEGIN{FS=OFS="\t"}{print $4,$5,$6}' file
will output without leading spaces/tabs. but If you have 500 columns you have to do it in a loop, or use sub function or consider other tools, cut, for example.
Actually this can be done in a very simple cut command like this:
cut -f4- inFile
If you don't want the field separation altered then use sed to remove the first 3 columns instead:
sed -r 's/(\S+\s+){3}//' file
To store the changes back to the file you can use the -i option:
sed -ri 's/(\S+\s+){3}//' file
awk '{for (i=4; i<NF; i++) printf $i " "; print $NF}'

use awk to extract row records contain specific word

Say my input file is tab delimited, how do i identify if $0 contains a word "hello" and it needs to be case insensitive?
here is a hello whateverColumn2
nonono nonono whateverItIs
here HeLLo again mockColumn2
Thanks a lot!
Given your lines in file data.txt:
awk -F"\t" '/hello/ {print $0}' data.txt
will print
here is a hello whateverColumn2
here hello again mockColumn2
The -F"\t" sets tab as the field separator for the input lines.
Update (based on request in comments below by OP):
To make this case-insensitive use the IGNORECASE flag:
awk -F"\t" 'BEGIN{IGNORECASE=1} /hello/ {print $0}' data.txt
Note that the IGNORECASE variable is a GNU extension and may not be available in other versions of AWK.
Alternatively, an example using match. In order to make this case-insensitive, the input is converted into lower case:
awk -F"\t" '{if (match(tolower($0), "hello")) print $0}' data.txt
Since match can take regular expressions, the conversion to lowercase may not be necessary with the right regular expression.
Tested with GNU Awk 3.1.6 under Linux

Resources