Padding columns of csv - bash

I have a csv file which contains a large number of csv seperated lines of data. i want to find the maximum length of the line then need to print NO in a new column
file.csv
1,2,3,4,
1,4,7,8,9,10,11,13
1,2,
1,1,2,4,5,6,7,8,9,10,11
abc,def,ghi,jkl
expected result
1,2,3,4,,,,,,,,,,,,,,,,N0
1,4,7,8,9,10,11,13,,,,,NO
1,2,,,,,,,,,,,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,1,NO
abc,def,ghi,jkl,,,,,,,,NO
cat file | cat > file.csv
echo "N0" >> file.csv
output obtained
1,2,3,4,NO
1,4,7,8,9,10,11,13,NO
1,2,NO
1,1,2,4,5,6,7,8,9,10,11,NO
abc,def,ghi,jkl,NO

You need to read the file twice, once to get the maximum number of columns, once to print the output:
awk -F, 'NR==FNR{if(m<=NF)m=NF;next} # Runs only on first iteration
{printf "%s",$0;for(i=0;i<=(m-NF);i++)printf ",";print "NO"}' file file
filename twice -----^
Output (12 columns in each row):
1,2,3,4,,,,,,,,NO
1,4,7,8,9,10,11,13,,,,NO
1,2,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,11,NO
abc,def,ghi,jkl,,,,,,,,NO

It's hard to imagine why you'd want to pad the lines with commas so here's what I think you really want which is to make every line have the same number of fields:
$ awk 'BEGIN{FS=OFS=","} NR==FNR{m=(m>NF?m:NF);next} {$(m+1)="NO"} 1' file file
1,2,3,4,,,,,,,,NO
1,4,7,8,9,10,11,13,,,,NO
1,2,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,11,NO
abc,def,ghi,jkl,,,,,,,,NO
and here's what you said you want anyway:
$ awk '{n=length()} NR==FNR{m=(m>n?m:n);next} {p=sprintf("%*s",m-n+1,""); gsub(/ /,",",p); $0=$0 p "NO"} 1' file file
1,2,3,4,,,,,,,,,,,,,,,,,NO
1,4,7,8,9,10,11,13,,,,,,NO
1,2,,,,,,,,,,,,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,11,NO
abc,def,ghi,jkl,,,,,,,,,NO

awk -F, 'BEGIN{m=0}
{if(NF>m)m=NF;ar[NR]=$0;ars[NR]=NF;}
END{for(i=1;i<=NR;i++)
{for(j=ars[i];j<m;j++){ar[i]=ar[i]","}ar[i]=ar[i]"NO";
print ar[i]}}' <<<'1,2,3,4,
1,4,7,8,9,10,11,13
1,2,
1,1,2,4,5,6,7,8,9,10,11,12
abc,def,ghi,jkl
a,b'
output:
1,2,3,4,,,,,,,,NO
1,4,7,8,9,10,11,13,,,,NO
1,2,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,11,12NO
abc,def,ghi,jkl,,,,,,,,NO
a,b,,,,,,,,,,NO
if lines must be have same size:
awk -F, 'BEGIN{m=0}
{if(length($0)>m)m=length($0);ar[NR]=$0;ars[NR]=length($0);}
END{for(i=1;i<=NR;i++)
{for(j=ars[i];j<m;j++)
{ar[i]=ar[i]","}ar[i]=ar[i]"NO";
print ar[i]}}' <<<'1,2,3,4,
1,4,7,8,9,10,11,13
1,2,
1,1,2,4,5,6,7,8,9,10,11,12
abc,def,ghi,jkl
a,b'
output:
1,2,3,4,,,,,,,,,,,,,,,,,,,NO
1,4,7,8,9,10,11,13,,,,,,,,NO
1,2,,,,,,,,,,,,,,,,,,,,,,,NO
1,1,2,4,5,6,7,8,9,10,11,12NO
abc,def,ghi,jkl,,,,,,,,,,,NO
a,b,,,,,,,,,,,,,,,,,,,,,,,NO
if you want also comma after max length line run for loop until m+1;

Related

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

Bash - Need to get records with line numbers if found duplicates based on first two column values

Need to help in Bash to print duplicate records in a file with the corresponding line number.
Duplicates are to be identified based on first two column combination.
Example:
111|abc|scientist
222|ghu|developer
222|thu|doctor
222|ghu|engineer
I need output as below as duplicate is based on combination of 1st 2 columns along with its line number:
2, 222|ghu|developer
4, 222|ghu|engineer
Assuming your input is in file input.txt, try:
awk -F '|' '{if (t[$1$2]) {print t[$1$2]; print NR", "$0} t[$1$2] = NR", "$0}' input.txt
Improving the readability:
awk -F '|' '{
if (t[$1$2]) {
print t[$1$2]
print NR", "$0
}
t[$1$2] = NR", "$0
}' input.txt

Bash conversion file

I have a .csv file and i wanto to convert it in a .txt file through a bash script
The file appears like .csv file
I want to obtain a txt file with this format
velocity
List<vector>
300 // number of point
(
(U0 U1 U2)
(U0 U1 U2)
...
...
...
)
Many thanks for your help
This is a little ugly, but it does the trick I think:
wc -l < Inlet.csv | cat - Inlet.csv | awk -F, 'BEGIN{printf "velocity\nList<vector>\n"} NR==1{printf "%s\n(\n",$1} NR>2{print "("$1 $2 $3")"}END{print ")"}'
This does the following:
Gets the record count from wc and pipes it to
cat which concats the number with the file. Number first and pipes it to
Awk which splits by comma -F,
Prints out the junk at the top of your file 'BEGIN{printf "velocity\nList<vector>\n"}
If it's on the first record (the count) NR==1 it prints it followed by a line feed followed by a parantheses followed by another line feed {printf "%s\n(\n",$1}
If we are past the header (Record is greater than 2) NR>2 then we print out the first three fields surrounded by parantheses {print "("$1 $2 $3")"}
Finally at the end of the processing we print out a final parantheses to close that one we printed out when we were at record 1 in step 5 END{print ")"}

Bash: Converting 4 columns of text interleaved lines (tab-delimited columns to FASTQ file)

I need to convert a 4-column file to 4 lines per entry. The file is tab-delimited.
The file at current is arranged in the following format, with each line representing one record/sequence (with millions of such lines):
#SRR1012345.1 NCAATATCGTGG #4=DDFFFHDHH HWI-ST823:136:C24YTACXX
#SRR1012346.1 GATTACAGATCT #4=DDFFFHDHH HWI-ST823:136:C22YTAGXX
I need to rearrange this such that the four columns are presented as 4 lines:
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX
What would be the best way to go about doing this, preferably with a bash one-liner? Thank you for your assistance!
You can use tr:
< file tr '\t' '\n' > newfile
very clear to use awk here:
awk '{print $1; print $2; print $3; print $4}' file
$ awk -v OFS='\n' '{$1=$1}1' file
#SRR1012345.1
NCAATATCGTGG
#4=DDFFFHDHH
HWI-ST823:136:C24YTACXX
#SRR1012346.1
GATTACAGATCT
#4=DDFFFHDHH
HWI-ST823:136:C22YTAGXX

How to print the line number where a string appears in a file?

I have a specific word, and I would like to find out what line number in my file that word appears on.
This is happening in a c shell script.
I've been trying to play around with awk to find the line number, but so far I haven't been able to. I want to assign that line number to a variable as well.
Using grep
To look for word in file and print the line number, use the -n option to grep:
grep -n 'word' file
This prints both the line number and the line on which it matches.
Using awk
This will print the number of line on which the word word appears in the file:
awk '/word/{print NR}' file
This will print both the line number and the line on which word appears:
awk '/word/{print NR, $0}' file
You can replace word with any regular expression that you like.
How it works:
/word/
This selects lines containing word.
{print NR}
For the selected lines, this prints the line number (NR means Number of the Record). You can change this to print any information that you are interested in. Thus, {print NR, $0} would print the line number followed by the line itself, $0.
Assigning the line number to a variable
Use command substitution:
n=$(awk '/word/{print NR}' file)
Using shell variables as the pattern
Suppose that the regex that we are looking for is in the shell variable url:
awk -v x="$url" '$0~x {print NR}' file
And:
n=$(awk -v x="$url" '$0~x {print NR}' file)
Sed
You can use the sed command
sed -n '/pattern/=' file
Explanation
The -n suppresses normal output so it doesn't print the actual lines. It first matches the /pattern/, and then the = operator means print the line number. Note that this will print all lines that contains the pattern.
Use the NR Variable
Given a file containing:
foo
bar
baz
use the built-in NR variable to find the line number. For example:
$ awk '/bar/ { print NR }' /tmp/foo
2
find the line number for which the first column match RRBS
awk 'i++ {if($1~/RRBS/) print i}' ../../bak/bak.db

Resources