Replace specific commas in a csv file - bash

I have a file like this:
gene_id,transcript_id(s),length,effective_length,expected_count,TPM,FPKM,id
ENSG00000000003.14,ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4,2.23231E3,2.05961E3,2493,2.112E1,1.788E1,00065a62-5e18-4223-a884-12fca053a109
ENSG00000001084.10,ENST00000229416.10,ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.5,ENST00000514004.5,ENST00000514373.2,ENST00000514933.1,ENST00000515580.1,ENST00000616923.4,3.09456E3,2.92186E3,3111,1.858E1,1.573E1,00065a62-5e18-4223-a884-12fca053a109
The problem is that instead of ,, the file should've been tab delimited because the values starting from ENST (i.e. transcript_id(s)) are grouped in one column.
The number of ENST IDs is different in each line.
Each ENST ID has the same pattern: starts from ENST, followed by 11 digits followed by a period and then 1-3 digits: ^ENST[0-9]{11}[.][0-9]{1,3}.
I want to convert all the comma's between ENST ids to a : or any other character to read this as a csv file. Any help would be much appreciated. Thanks!

I imagine something as simple as
sed 's|,ENST|:ENST|g;s|:|,|' < /path/to/your/file
should work. No reason to over-complicate.

Related

replace a pattern with n number of spaces

I am new to shell scripting, appreciate any help regarding below problem. I have tried to use sed and awk but unable to find a solution.
Problem: I have a fixed width file which has amount fields that need to be replaced with spaces/any special character like $ and the record length has to be maintained. The length of amount fields can vary.
For ex. if sample_file.txt has record length of 10 and there are two amount fields starting at 2 and 6 of length 3 and 5 in this file as below:
a234b67890
It has to be modified as:
a$$$b$$$$$
This is for unix server.
Edit:
Also the records can have numeric characters at other positions which shouldn't be updated. So considering the previous example, the updated input is:
a234b678901234567890
And new output should be:
a$$$b$$$$$1234567890
Try using
inp=a234b67890
echo $inp | sed 's/[0-9]/$/g'
# gives a$$$b$$$$$
The only requirement is that the input should always be of record_length as sed replaces the numbers with the special character.
Hope this helps.

Using AWK to preserve lines based on a single line field being repeated/duplicate in a CSV file

Would someone help me form a script in Bash to keep only the unique lines, based solely on identifying duplicate values in a single field (the first field)
If I have data like this:
123456,23423,Smith,John,Jacob,Main St.,,Houston,78003<br>
654321,54524,Smith,Jenny,,Main St.,,Houston,78003<br>
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
123456,324324,Bryant,Kobe,,Special St.,,New York,2311<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
654321,43234,Smith,Jimbo,,Main St.,,Houston,78003<br>
And I like to only keep the lines which do not have matching first fields
(result would be a file with these contents below, based on above sample)
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
What would the bash/awk approach be? Thanks in advance.

how to remove force quotes for only one column in csv file

I'm generating some CSV output using Ruby's built-in CSV. Everything works fine, but the customer wants the price field in the output should be without double-quotes.
So the output looks like this:
"10789852616","Studentska-trgovina","27.80","EUR",
The customer wants to like this:
"10789852616","Studentska-trgovina",27.80,"EUR",
Try .to_f it returns the result of interpreting leading characters in str as a floating point number. Extraneous characters past the end of a valid number are ignored. If there is not a valid number at the start of str, 0.0 is returned.

return line of strings between two strings in a ruby variable

I would like to extract a line of strings but am having difficulties using the correct RegEx. Any help would be appreciated.
String to extract: KSEA 122053Z 21008KT 10SM FEW020 SCT250 17/08 A3044 RMK AO2 SLP313 T01720083 50005
For Some reason StackOverflow wont let me cut and paste the XML data here since it includes "<>" characters. Basically I am trying to extract data between "raw_text" ... "/raw_text" from a xml that will always be formatted like the following: http://www.aviationweather.gov/adds/dataserver_current/httpparam?dataSource=metars&requestType=retrieve&format=xml&hoursBeforeNow=3&mostRecent=true&stationString=PHNL%20KSEA
However, the Station name, in this case "KSEA" will not always be the same. It will change based on user input into a search variable.
Thanks In advance
if I can assume that every strings that you want starts with KSEA, then the answer would be:
.*(KSEA.*?)KSEA.*
using ? would let .* match as less as possible.

Bash for truncation

I have to make changes to a document where there are two columns separated by tab (\t) and each record separated by newline \n. the statements of the document are as follows:
/something/random/2345.txt
my aim is to remove the entire string and just keep the number 2345 in this case.I used
sed 's/something/random//g' file.csv
but I do not know how to escape the / cause sed syntax has / too. Also not all records have the same words so i would be looking for regex of the type
/*/*.*
But each entry has a number as a part of the record and I would like to extract that.
Also there are a few records which do not contain any number, I would like to delete those records along with the corresponding entry in the next column for that record.
The file is in CSV format.
You can escape the forward slash with a backslash, or you can use a different character than forward slash to delimit your expression. Observe:
echo foobar | sed sIfooIcrowI
> crowbar
Of course, you probably shouldn't use an alphabetic character for the delimiter. I'm just using it here to make the point that pretty much any normal character can be substituted for the slash.
You could just remove all non digit characters from brining of each statement in string :
sed 's/[^0-9]*\(.*\)[\t]*/\1/g'

Resources