Every 4 rows leave following 6 lines removed and so until the end of the file.
No rows deleted can be written in another file
file to remove the lines
34
511
6977
511
0
22
20
8569
15
23
6466
390
1
54
9140
-100
0
12
10
5308
19
12
9240
442
1
46
433
55
file after removing lines
34
511
6977
511
6466
390
1
54
19
12
9240
442
The basis for this is the each_with_index function:
lines.each_with_index do |line, i|
case (i % 10)
when 0..3
puts line
end
end
You can adapt that code to put the output somewhere else, like an additional array or what have you.
Related
I'm trying to find collinearity between a group of genes from two different species using MCScanX. But I don't know what I could be possibly doing wrong anymore. I've checked both input files countless times (.gff and .blast), and they seem to be in line with what the manual says.
Like, for the first species, I've downloaded the gff file from figshare. I already had the fasta file containing only the proteins of interest (that I also got from figshare), so gene ids matched. Then, I downloaded both the gff and the protein fasta file from coffee genome hub. I used the coffee proteins fasta file as the reference genome in rBLAST to align the first specie's genes against it. After blasting (and keeping only the first five best alignments with e-values greater than 1e-10), I filtered both gff files so they only contained genes that matched those in the blast file, and then concatenated them. So the final files look like this:
View (test.blast) #just imagine they're tab separated values
sp1.id1 sp2.id1 44.186 43 20 1 369 411 206 244 0.013 37.4sp1.id1 sp2.id2 25.203 123 80 4 301 413 542 662 0.00029 43.5sp1.id1 sp2.id3 27.843 255 130 15 97 333 458 676 1.75e-05 47.8sp1.id1 sp2.id4 26.667 105 65 3 301 396 329 430 0.004 39.7sp1.id1 sp2.id5 27.103 107 71 3 301 402 356 460 0.000217 43.5sp1.id2 sp2.id6 27.368 95 58 2 40 132 54 139 0.41 32sp1.id2 sp2.id7 27.5 120 82 3 23 138 770 888 0.042 35sp1.id2 sp2.id8 38.596 57 35 0 21 77 126 182 0.000217 42sp1.id2 sp2.id9 36.17 94 56 2 39 129 633 725 1.01e-05 46.6sp1.id2 sp2.id10 37.288 59 34 2 75 133 345 400 0.000105 43.1sp1.id3 sp2.id11 33.846 65 42 1 449 512 360 424 0.038 37.4sp1.id3 sp2.id12 40 50 16 2 676 725 672 707 6.7 30sp1.id3 sp2.id13 31.707 41 25 1 370 410 113 150 2.3 30.4sp1.id3 sp2.id14 31.081 74 45 1 483 550 1 74 3.3 30sp1.id3 sp2.id15 35.938 64 39 1 377 438 150 213 0.000185 43.5
View (test.gff) #just imagine they're tab separated values
ex0 sp2.id1 78543527 78548673ex0 sp2.id2 97152108 97154783ex1 sp2.id3 16555894 16557150ex2 sp2.id4 3166320 3168862ex3 sp2.id5 7206652 7209129ex4 sp2.id6 5079355 5084496ex5 sp2.id7 27162800 27167939ex6 sp2.id8 5584698 5589330ex6 sp2.id9 7085405 7087405ex7 sp2.id10 1105021 1109131ex8 sp2.id11 24426286 24430072ex9 sp2.id12 2734060 2737246ex9 sp2.id13 179361 183499ex10 sp2.id14 893983 899296ex11 sp2.id15 23731978 23733073ts1 sp1.id1 5444897 5448367ts2 sp1.id2 28930274 28935578ts3 sp1.id3 10716894 10721909
So I moved both files to the test folder inside MCScanX directory and ran MCScan (using Ubuntu 20.04.5 LTS, the WSL feature) with:
../MCScanX ./test
I've also tried
../MCScanX -b 2 ./test
(since "-b 2" is the parameter for inter-species patterns of syntenic blocks)
but all I ever get is
255 matches imported (17 discarded)85 pairwise comparisons0 alignments generated
What am I missing????
I should be getting a test.synteny file that, as per the manual's example, looks like this:
## Alignment 0: score=9171.0 e_value=0 N=187 at1&at1 plus
0- 0: AT1G17240 AT1G72300 0
0- 1: AT1G17290 AT1G72330 0
...
0-185: AT1G22330 AT1G78260 1e-63
0-186: AT1G22340 AT1G78270 3e-174
##Alignment 1: score=5084.0 e_value=5.6e-251 N=106 at1&at1 plus
I have a tab file with two columns like that
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
and the desired output should be
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 14 27
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 6 209 299 305
2 10 13 23 40 47 58 87 10 23 40 58
I would like to change the numbers in 2nd column for random numbers in 1st column resulting in an output in 2nd column with the same number of numbers. I mean e.g. if there are four numbers in 2nd column for x row, the output must have four random numbers from 1st column for this row, and so on...
I'm try to create two arrays by AWK and split and replace every number in 2nd column for numbers in 1st column but not in a randomly way. I have seen the rand() function but I don't know exactly how joint these two things in a script. Is it possible to do in BASH environment or are there other better ways to do it in BASH environment? Thanks in advance
awk to the rescue!
$ awk -F'\t' 'function shuf(a,n)
{for(i=1;i<n;i++)
{j=i+int(rand()*(n+1-i));
t=a[i]; a[i]=a[j]; a[j]=t}}
function join(a,n,x,s)
{for(i=1;i<=n;i++) {x=x s a[i]; s=" "}
return x}
BEGIN{srand()}
{an=split($1,a," ");
shuf(a,an);
bn=split($2,b," ");
delete m; delete c; j=0;
for(i=1;i<=bn;i++) m[b[i]];
# pull elements from a upto required sample size,
# not intersecting with the previous sample set
for(i=1;i<=an && j<bn;i++) if(!(a[i] in m)) c[++j]=a[i];
cn=asort(c);
print $1 FS join(c,cn)}' file
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 85 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 20 205 294 295
2 10 13 23 40 47 58 87 10 13 47 87
shuffle (standard algorithm) the input array, sample required number of elements, additional requirement is no intersection with the existing sample set. Helper structure map to keep existing sample set and used for in tests. The rest should be easy to read.
Assuming that there is a tab delimiting the two columns, and each column is a space delimited list:
awk 'BEGIN{srand()}
{n=split($1,a," ");
m=split($2,b," ");
printf "%s\t",$1;
for (i=1;i<=m;i++)
printf "%d%c", a[int(rand() * n) +1], (i == m) ? "\n" : " "
}' FS=\\t input
Try this:
# This can be an external file of course
# Note COL1 and COL2 seprated by hard TAB
cat <<EOF > d1.txt
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 6 94
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 205 284 307 406
2 10 13 40 47 58 2 13 40 87
EOF
# Loop to read each line, not econvert TAB to:, though could have used IFS
cat d1.txt | sed 's/ /:/' | while read LINE
do
# Get the 1st column data
COL1=$( echo ${LINE} | cut -d':' -f1 )
# Get col1 number of items
NUM_COL1=$( echo ${COL1} | wc -w )
# Get col2 number of items
NUM_COL2=$( echo ${LINE} | cut -d':' -f2 | wc -w )
# Now split col1 items into an array
read -r -a COL1_NUMS <<< "${COL1}"
COL2=" "
# THis loop runs once for each COL2 item
COUNT=0
while [ ${COUNT} -lt ${NUM_COL2} ]
do
# Generate a random number to use as teh random index for COL1
COL1_IDX=${RANDOM}
let "COL1_IDX %= ${NUM_COL1}"
NEW_NUM=${COL1_NUMS[${COL1_IDX}]}
# Check for duplicate
DUP_FOUND=$( echo "${COL2}" | grep ${NEW_NUM} )
if [ -z "${DUP_FOUND}" ]
then
# Not a duplicate, increment loop conter and do next one
let "COUNT = COUNT + 1 "
# Add the random COL1 item to COL2
COL2="${COL2} ${COL1_NUMS[${COL1_IDX}]}"
fi
done
# Sort COL2
COL2=$( echo ${COL2} | tr ' ' '\012' | sort -n | tr '\012' ' ' )
# Print
echo ${COL1} :: ${COL2}
done
Output:
5 6 14 22 23 25 27 84 85 88 89 94 95 98 100 :: 88 95
6 8 17 20 193 205 209 284 294 295 299 304 305 307 406 :: 20 299 304 305
2 10 13 40 47 58 :: 2 10 40 58
I have a big file look like the example:
chr1:16872433-16872504 54 112622
chr1:16872433-16872504 55 112110
chr1:16872433-16872504 56 110996
chr1:16872433-16872504 57 110306
chr1:16861773-16861845 20 38808
chr1:16861773-16861845 21 39768
chr1:16861773-16861845 22 40344
chr1:16861773-16861845 23 40637
chr1:16861773-16861845 24 41311
chr2:7990338-7990408 8 0
chr2:7990338-7990408 9 0
chr2:7990338-7990408 10 0
chr2:7990338-7990408 11 0
chr2:7990338-7990408 12 0
I want to extract every part starting with "chr1:16872433-16872504" and make a new .txt file.
how can I do that in bash? I tried grep command but I do not know how to make it conditional.
grep -E 'chr1:16872433-16872504' your.txt > new.txt
gives you the following output
chr1:16872433-16872504 54 112622
chr1:16872433-16872504 55 112110
chr1:16872433-16872504 56 110996
chr1:16872433-16872504 57 110306
as per your requirement ["chr1:16872433-16872504"]
I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?
I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.
Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt
Suppose I have made a text file in the following format:
1 4 4
2 3 4
2 431 431
2 473 473
4 44 44
10 36 36
20 34 34
10 5 5
5 5 2
100 63 63
110 112 112
60 1327 1327
70 75 75
80 27 27
60 14 14
150 16 16
200 129 129
Now I want to make a distance of a tab key between two different column values as follows:
1 4 4
2 3 4
2 431 431
2 473 473
4 44 44
10 36 36
20 34 34
10 5 5
5 5 2
100 63 63
110 112 112
60 1327 1327
70 75 75
80 27 27
60 14 14
150 16 16
200 129 129
Is there any way of doing this at a time using any text editor or any other way?Also, if I want to delete an entire column at a time, how will I do that?
You may use a regular expression that will match and capture digits, then will match 1 or more spaces, and then will match and capture digits again, then just replace the spaces with a tab. In Notepad++, use:
Find What: (\d+) +(\d+)
Replace With: $1\t$2
Details:
(\d+) - Group 1 (later referred to with $1 backreference from the replacement pattern): one or more digits
+ - one or more spaces
(\d+) - Group 2 (later referred to with $2 backreference from the replacement pattern): one or more digits