Remove first columns then leave remaining line untouched in awk - shell

I am trying to use awk to remove first three fields in a text file. Removing the first three fields is easy. But the rest of the line gets messed up by awk: the delimiters are changed from tab to space
Here is what I have tried:
head pivot.threeb.tsv | awk 'BEGIN {IFS="\t"} {$1=$2=$3=""; print }'
The first three columns are properly removed. The Problem is the output ends up with the tabs between columns $4 $5 $6 etc converted to spaces.
Update: The other question for which this was marked as duplicate was created later than this one : look at the dates.

first as ED commented, you have to use FS as field separator in awk.
tab becomes space in your output, because you didn't define OFS.
awk 'BEGIN{FS=OFS="\t"}{$1=$2=$3="";print}' file
this will remove the first 3 fields, and leave rest text "untouched"( you will see the leading 3 tabs). also in output the <tab> would be kept.
awk 'BEGIN{FS=OFS="\t"}{print $4,$5,$6}' file
will output without leading spaces/tabs. but If you have 500 columns you have to do it in a loop, or use sub function or consider other tools, cut, for example.

Actually this can be done in a very simple cut command like this:
cut -f4- inFile

If you don't want the field separation altered then use sed to remove the first 3 columns instead:
sed -r 's/(\S+\s+){3}//' file
To store the changes back to the file you can use the -i option:
sed -ri 's/(\S+\s+){3}//' file

awk '{for (i=4; i<NF; i++) printf $i " "; print $NF}'

Related

How to add a space after a comma if it does not exist within the 6th column in a csv file?

Ubuntu 16.04
Bash 4.3.3
I also need a way to add a space after the comma if one does not exist in the 6th column. I had to comment the above line because it placed a space after all commas in the csv file.
Wrong: "This is 6th column,Hey guys,Red White & Blue,I know it,Right On"
Perfect: "This is 6th column, Hey guys, Red White & Blue, I know it, Right On"
I could almost see awk printing out the 6th column then having sed do the rest:
awk '{ print $6 }' "$feed " | sed -i 's/|/,/g; s/,/, /g; s/,\s\+/, /g'
This is what I have so far:
for feed in *; do
sed -r -i 's/([^,]{0,10})[^,]*/\1/5' "$feed"
sed -i '
s/<b>//g; s/*//g;
s/\([0-9]\)""/\1inch/g;
# s/|/,/g; s/,/, /g; s/,\s\+/, /g;
s/"one","drive"/"onetext","drive"/;
s/"comments"/"description"/;
s/"features"/"optiontext"/;
' "$feed"
done
s/|/,/g; s/,/, /g; s/,\s\+/, /g; works but is global and not within a column.
It sounds like all you need is this (using GNU awk for FPAT):
awk 'BEGIN{FPAT="[^,]*|\"[^\"]+\""; OFS=","} {gsub(/, ?/,", ",$6)} 1'
e.g.:
$ cat file
1,2,3,4,5,"This is 6th column,Hey guys,Red White & Blue,I know it,Right On",7,8
$ awk 'BEGIN{FPAT="[^,]*|\"[^\"]+\""; OFS=","} {gsub(/, ?/,", ",$6)} 1' file
1,2,3,4,5,"This is 6th column, Hey guys, Red White & Blue, I know it, Right On",7,8
It actually looks like your whole shell script including multiple calls to GNU sed could be done far more efficiently in just one call to GNU awk with no need for a surrounding shell loop, e.g. (untested):
awk -i inplace '
BEGIN{FPAT="[^,]*|\"[^\"]+\""; OFS=","}
{
$0 = gensub(/([^,]{0,10})[^,]*/,"\\1",5)
$0 = gensub(/([0-9])""/,"\\1inch","g")
sub(/"one","drive"/,"\"onetext\",\"drive\"")
sub(/"comments"/,"\"description\"")
sub(/"features"/,"\"optiontext\"")
gsub(/, ?/,", ",$6)
}
' *
This might work for you (GNU sed):
sed -r 's/[^,"]*("[^"]*")*/\n&\n/6;h;s/, ?/, /g;G;s/.*\n(.*)\n.*\n(.*)\n.*\n/\2\1/' file
Surround the 6th field by newlines. Make a copy of the line. Replace all commas followed by a possible space with a comma followed by a space. Append the original line and using pattern matching replace the amended field discarding the rest of the ameliorated line.

Shell script copying all columns of text file instead of specified ones

I trying to copy 3 columns from one text file and paste them into a new text file. However, whenever I execute this script, all of the columns in the original text file get copied. Here is the code I used:
cut -f 1,2,6 PROFILES.1.0.profile > compiledfile.txt
paste compiledfile.txt > myNewFile
Any suggestions as to what I'm doing wrong? Also, is there a simpler way to do this? Thanks!
Let's suppose that the input is comma-separated:
$ cat File
1,2,3,4,5,6,7
a,b,c,d,e,f,g
We can extract columns 1, 2, and 6 using cut:
$ cut -d, -f 1,2,6 File
1,2,6
a,b,f
Note the use of option -d, to specify that the column separator is a comma.
By default, cut uses a tab as the column separator. If the separator in your file is anything else, you must use the -d option.
Using awk
awk -vFS=your_delimiter_here -vOFS=your_delimiter_here 'print $1,$2,$6' PROFILES.1.0.profile > compiledfile.txt
should do it.
For comma separated fields the solution would be
awk -vFS=, -vOFS=, '{print $1,$2,$6}' PROFILES.1.0.profile > compiledfile.txt
FS is an awk builtin variable which stands for field-separator.
Similarly OFS stands for output-field-separator.
And the handy -v option with awk helps you assign a value to variable.
You could use awk to this.
awk -F "delimiter" '
{
print $1,$2 ,$3 #Where $1,$2 and so are column numbers
}' filename > newfile

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. VoilĂ !
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

awk to change the record separator (RS) to every 2 lines

I am wondering how to use Awk to process every 2 lines of data instead of every one. By default the record separator (RS) is set to every new line, how can I change this to every 2 lines.
It depends of what you want to achieve, but one way is to use the getline instruction. For each line, read next one and save it in a variable. So you will have first line in $0 and second one in even_line:
getline even_line
Divide&Conquer: do it in two steps:
use awk to introduce blank line
to separate each two-line record: NR%2==0 {print ""}
pipe to another awk process and
set record separator to blank line: BEGIN {RS=""}
Advantage: In the second awk process you have all fields of the two lines accessible as $1 to $NF.
awk '{print}; NR%2==0 {print ""}' data | \
awk 'BEGIN {RS=""}; {$1=$1;print}'
Note:
$1=$1 is used here to enforce an update on $0 (the whole record).
This guaranties that the output prints the two-line record on one line.
Once you modify a field in your program when you process the two-line records this is no longer required.
If you want to merge lines, use the paste utility:
$ printf "%s\n" one two three four five
one
two
three
four
five
$ printf "%s\n" one two three four five | paste -d " " - -
one two
three four
five
This is a bit hackish, but it's a literal answer to your question:
awk 'BEGIN {RS = "[^\n]*\n[^\n]*\n"} {$0 = RT; print $1, $NF}' inputfile
Set the record separator to a regex which matches two lines. Then for each line, set $0 to the record terminator (which is what matched the regex in RS). This performs field splitting on FS. The print statement is just a demonstration place holder.
Note that $0 will contain two newlines, but the fields will not contain any newlines.

Resources