select columns have the same string in header of columns - shell

I have a data file with a lot of columns. It was generated from files with the same format. The header is like this:
gene strand coord exression SRR1234 gene strand coord exression SRR1235 gene strand coord exression SRR1236
I hope to extract "gene" and columns with "SRR*" in shell.
Does anyone have experience on this?

cols=$(head -n 1 datafilename | sed -e "s/\s\+/\n/g" | nl -w1 | grep 'SRR*\|gene' | cut -f 1)
cut -f"${cols//$'\n'/,}" datafilename
How?
First, we read just the first row with head, then we change the whitespace (tabs, in this case) to newlines with sed, then we print out the columns with numbers next to them with nl. It'll look like this:
1 gene
2 strand
...
After that, we display only the lines with the items you care about via grep, and then we keep only the first field with cut, leaving us just the number that was in front of the column names for the columns we want. Now we have a newline separated list of numbers that we care about, so we use parameter expansion to do substitution to change it to a comma separated list of numbers, and we pass that to cut to display only those columns.

Related

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

Sort first column while merging second column

I am looking for a solution of the following problem. I have a text file with in the first column geneIDs and in the second single GOterms. Because each gene has multiple annotated GOterms, identical geneIDs do occur multiple times (with different GOterms in the second column. I only want to have unique geneIDs with GOterms merged:
I have:
TRINITY_DN10151_c0_g1 GO:0004175
TRINITY_DN10151_c0_g1 GO:0004252
TRINITY_DN10151_c0_g1 GO:0006508
TRINITY_DN10151_c0_g1 GO:0008233
TRINITY_DN102626_c42_g1 GO:0005198
TRINITY_DN102626_c42_g1 GO:0042302
TRINITY_DN102626_c58_g1 GO:0004175
I want:
TRINITY_DN10151_c0_g1 GO:0004175-GO:0004252-GO:0006508-GO:0008233
TRINITY_DN102626_c42_g1 GO:0005198-GO:0042302
etc..
Further it is important (and I have truly no idea how to solve this) that each GO term combination occurs once. So if two genes have the same GO term combination (A, B and C) in column 2 they should both have A-B-C. And not also A-C-B..
I have tried to use sort and uniq, but in the end I was only deleting rows.
Can someone help me with a unix solution?
You can do it with a rather cryptic sed command. (Every sed command is trivial or cryptic.)
sort filename | sed -e :a -e '$!N;s/^\([^ ]* \) *\(.*\)\n\1 */\1\2-/;ta' -e 'P;D'
Loosely translated, this says "Append the next line to this one and replace the newline and second gene name with a hyphen, as long as the two gene names are the same".
And the sort is to keep the GOterm order consistent across genes.

How to extract specific rows based on row number from a file

I am working on a RNA-Seq data set consisting of around 24000 rows (genes) and 1100 columns (samples), which is tab separated. For the analysis, I need to choose a specific gene set. It would be very helpful if there is a method to extract rows based on row number? It would be easier that way for me rather than with the gene names.
Below is an example of the data (4X4) -
gene    Sample1    Sample2    Sample3
A1BG       5658    5897      6064
AURKA    3656    3484      3415
AURKB    9479    10542    9895
From this, say for example, I want row 1, 3 and4, without a specific pattern
I have also asked on biostars.org.
You may use a for-loop to build the sed options like below
var=-n
for i in 1 3,4 # Put your space separated ranges here
do
var="${var} -e ${i}p"
done
sed $var filename
Note: In any case the requirement mentioned here would still be pain as it involves too much typing.
Say you have a file, or a program that generates a list of the line numbers you want, you could edit that with sed to make it into a script that prints those lines and passes it to a second invocation of sed.
In concrete terms, say you have a file called lines that says which lines you want (or it could equally be a program that generates the lines on its stdout):
1
3
4
You can make that into a sed script like this:
sed 's/$/p/' lines
1p
3p
4p
Now you can pass that to another sed as the commands to execute:
sed -n -f <(sed 's/$/p/' lines) FileYouWantLinesFrom
This has the advantage of being independent of the maximum length of arguments you can pass to a script because the sed commands are in a pseudo-file, i.e. not passed as arguments.
If you don't like/use bash and process substitution, you can do the same like this:
sed 's/$/p/' lines | sed -n -f /dev/stdin FileYouWantLinesFrom

grep: keep lines by number in specific column

I know how to do it with awk, for example, keep lines, which contains number 3 in second column: $ awk '"$2" == 3'
But how to do the same with only grep?
What about for first column?
Grep is not great for this, awk is better. But assuming your columns are separated by spaces, then you want
grep -E '^[^ ]+ +3( |$)'
Explanation: find something that has a start of line, followed by one or more non-space characters (first column), then one or more space characters (column separator), then the number 3, then either a space (because there's another column) or end of line (if there's no other column).
(Updated to fix syntax after testing.)
Here is the longer explanation for my mysterious command grep -P '^[^\t]*\t3\t' your_file from the comments:
I assumed that the column delimiter is a tab. grep without -P would require some strange things to use it directly (see e.g. see here ) . The -P makes it possible to just write \t without any problems. If for example your delimiter is ; then you could replace the \t with ; and you dont need the -P option.
Having said that, lets explain the idea behind the regular expression: You said, you want to match a 3 in the second column:
^ means: at the beginning of the line
[^\t]* means: zero or more (*) occurences of something not a tab ([^\t] here the ^ means "not a")
followed by tab
followed by 3
followed by tab
Now we have effectively expressed the idea that we need a 3 as the content of the second column (\t3\t) and we are not interested in the precise content of the first column. The ^[^\t]*\t is only necessary to express the idea "what follows is in the second column".
If you want to match something in the fourth column, you could use this to "skip" the first three column and match a 4 in the fourth column:
^([^\t]*\t){3}4. (Note the parenthesis and the {3}).
As you can see many details and awk is much more elegant and easy.
You can read this up in the documentation of grep and then you will need to study something about regular expression, e.g. start here.

Change field name and edit a csv file

I have a csv file I am looking at in bash that I am trying to manipulate. There are several things that I have/am trying to edit. Structure is like so where the first row are the column(field) headers
cat,dog,hippopotamus,zebra
1,,3,2
three species, five species,only one,multiple
at,home, at, home, wild, wild
How can I edit the field (column) names in the csv?
head -1 test.csv
shows what the field (column) names are, but it still has the commas in it as well and this doesn't allow for field name changing at all.
The other part about this is that I want to only edit titles that are greater than 8 characters in length, in which case I will just take the first 8 characters. I'm guessing I would use some sort of loop based on string length but since I don't know how to even edit the field name of just one column I'm not sure how to do this. In scenario above, changing hippopotamus to hippopot.
How can I replace empty cells in the csv to NA or NULL?
sed -i 's/ /NULL/g'
Thought would work but doesn't.
Some of the cells have commas within them, messing with the , delimiter. I used the code below and it seems to work, but is there a better/safer way to do this?
sed -i "s/, /_/g"
Or in a similar situation, if multiple columns contain strings sometimes with spaces within a string but I only want to remove the space in one of the columns while leaving the other columns alone, how can I achieve this?
sed -i 's/ //g' test.csv
Sed will allow a line number as a command prefix, to only work on a single line (or a range of numbers, to work on lines in that range). Try something like this.
sed -e '1s/cat/Feline/' test.csv > test2.csv
CSV files will store an empty field as either a comma at start of line, a comma at end of line, or a comma followed by another comma:
Field1,Field2,Field3
,"<-- empty field1",field3
field1,,"<-- empty field2"
field1,"empty field3-->",
You can use the following sed commands to fix these:
sed -e 's/^,/NA,/;s/,$/,NA/' -e ':loop' -e 's/,,/,NA,/g;tloop' test.csv
Your solution appears good. Be aware, however, that CSV should have quotes around any string containing a comma. And that's legit. It's also the point where sed stops being a good tool for manipulating CSV files. ;-) One suggestion would be to replace "interior" commas with "%2C", which is the HTML encoding for a comma. That's pretty distinctive, and at least somewhat standard.
sed numbers groups starting from the left-most paren. If your groups match multiple times, you can only get the last match contents, but if an outer group contains the multi-match, the outer group is still valid. (I assume here that you have already replaced the "interior" commas with something else.)
sed -e ':loop' -e '^\(\([^,]*,\)\{3\}\)\([^ ,]*\) /\1\3/;tloop'
This will remove the first space in column 4, then loop. It will stop when it finds the comma that ends the column, or end-of-line.
Note that the first part, called \1, is general. You can replace the 3 with whatever field, minus one, and that will get you to the start of the field. The actual work is in the second part, \3, where you can do what you like. (Note that \2 is included within \1, and not particularly useful.)

Resources