Does anyone know a command that can get me the nth column of a tab-delimited file, when the items in the columns of the file contain spaces? I tried awk and cut, but I think they are interpreting the spaces in the items as tabs and so are giving me incorrect values. I double checked by manually counting columns and I think this is the case.
You can set tab as a delimiter in the cut command like this:
cut -d$'\t' -f2 file.txt
Input (tab separated columns that contain spaces):
first item second item third item
123 456 789 987 654 321 741 852 933
Output (when selecting 2nd columnn):
second item
987 654 321
As you can see, the spaces didn't interfere with the column separation.
Related
I am new to bash and I'm not sure how to do the following:
I have a text file in the format:
123
John
234
Sally
456
Lucy
...
I want to output it to a csv file in the form:
123,John
234,Sally
456,Lucy
...
A good job for sed:
sed '/[0-9]/{N;s/\n/,/}' txtfile
It detects lines having numbers and, when found, replaces the newline character by a comma.
If you also want to get rid of the blank lines in-between,
sed '/[0-9]/{N;s/\n/,/;n;d}' txtfile
Notice that if your file is as regular as the sample you gave, you don't even need the regex, 'N;s/\n/,/;n;d' would suffice.
I have a data file with a lot of columns. It was generated from files with the same format. The header is like this:
gene strand coord exression SRR1234 gene strand coord exression SRR1235 gene strand coord exression SRR1236
I hope to extract "gene" and columns with "SRR*" in shell.
Does anyone have experience on this?
cols=$(head -n 1 datafilename | sed -e "s/\s\+/\n/g" | nl -w1 | grep 'SRR*\|gene' | cut -f 1)
cut -f"${cols//$'\n'/,}" datafilename
How?
First, we read just the first row with head, then we change the whitespace (tabs, in this case) to newlines with sed, then we print out the columns with numbers next to them with nl. It'll look like this:
1 gene
2 strand
...
After that, we display only the lines with the items you care about via grep, and then we keep only the first field with cut, leaving us just the number that was in front of the column names for the columns we want. Now we have a newline separated list of numbers that we care about, so we use parameter expansion to do substitution to change it to a comma separated list of numbers, and we pass that to cut to display only those columns.
I have a large data file containing over a thousand entries. I would like to sort them but maintain the original line numbers. For instance,
1:100
2:120
3:10
4:59
Where the first number is the line number, not saved in the data file, separated by a colon from the real number. I would like to sort it and keep the line numbers bound to their original lines, with an output of:
2:120
1:100
4:59
3:10
If possible, I would like to do this without creating another file, and numbering them by hand is not an option for the data size I'm using.
Given a file test.dat:
100
120
10
59
... the command:
$ cat -n test.dat | sort --key=2 -nr
2 120
1 100
4 59
3 10
... gives the output that you seem to be looking for (though with the fields delimited by tabs, which is easily changed if necessary):
$ cat -n test.dat | sort --key=2 -nr | sed -e's/\t/:/'
2:120
1:100
4:59
3:10
I have a file having 200 lines and I want to remove 5 long lines( each line contains special characters).
$cat abc
............
comments[asci?_203] part of jobs where to delete
5 similar lines
.....
I tried sed to remove these 5 lines, using line numbers(nl) on the file, but did not work.
Thanks
Have you tried to remove the lines with awk? This is untested with special characters, but maybe it could work
awk '{if (length($0)<55) print $0}' < abc
Replace 55 with the maximum line length you want to keep
Below is sorted (on the basis of column one )tab delineated file named file.txt
barbie 325 social activist
david 214 IT professional
david 457 mathematician
david 458 biologist
john 85 engineer
john 98 doctor
peter 100 statistician
i want run uniq command on the basis of column one using options (-t and -k in case of sort commands)
uniq -d (-t$'\t' -k1,1) file.txt [this is incorrect syntax in bracket , but i want to run it in similar way]
Well this would be quite easy but i am unable to find my way.
so that i can get output as :
david 214 IT professional
john 85 engineer
Help me , Thanks in advance :)
Debian uniq used to have this option, but it was removed for compatibility reasons. You can create your own AWK or Perl script easily. This prints only the lines with the first occurrence of the first field:
awk -F '\t' '!x[$1]++' file.txt
x[$1] is an associative array on the contents of the first field ($1); it gets incremented for each line, but it is also the as the condition which specifies whether or not the current line should be printed; with the negation, it is true only if this field value has not been encountered before. (Reminder: the general form of an AWK script is zero or more of condition { action } and both parts are optional; if {action} is missing, the default action is to print the current line. [If the condition is missing, the action is taken unconditionally.])