print all the lines according to the maching string - python-2.6

I have a file name new which contain the following data:
1111
2012-5-12
new
p0
2222
2012-10-12
old
p1
3333
2012-15-12
new
p0
4444
2012-5-11
new
p1
5555
2011-5-12
old
p0
In this file each id has following data like id, date,status and value(1111,2012-5-12,new,p0)....
In this file i have to print the data of all the ids which have status value "new"..
And my output should be like this:
1111
2012-5-12
new
p0
3333
2012-15-12
new
p0
4444
2012-5-11
new
p1
i tried with the following code :
f1 = open('new','r')
output = open('new1','w')
lines = f1.readlines()
n =0
for i, line in enumerate(lines):
if n > 3:
output.close()
file1=open('new1','r')
file2=open('new2','w')
lines= file1.readlines()
status = lines[2].strip()
if status == 'new':
for line in lines:
file2.write(line)
output = open('new1','w')
output.write(line)
n = 1
else:
output.write(line)
n = n + 1
new2 and new1 have following output:
(new2)=======
p0
2012-5-11
new
p1
0
(new1)===========
p1
2011-5-12
old
p0
program will take the first 4 line and write it new1..
Accordingly it will search for status == "new" if found then write all the four lines to new2 other wise again read the next four line upto the length of the file..
problem: I am not geeting the right data in new2 it should contain the data like this:
1111
2012-5-12
new
p0
3333
2012-15-12
new
p0
4444
2012-5-11
new
p1

with open('new', 'r') as f:
lines = f.readlines()
data = [lines[4 * i:4 * i + 4] for i in range(len(lines) / 4)]
new_data = [d for d in data if d[2] == 'new']
with open('new1', 'w') as f:
for d in new_data:
f.write('\n'.join(d))

Related

File processing: Combining multiple files with different number of columns and rows

I have multiple tab delimieted files where only the two first columns are in common. I'm trying to combine them in one tab delimited file .
Example: let's say we have 3 files (file1, file2, file3) that we want to combine into file4.
(row and column names are just for demonstration purposes and are not included in any of the files)
Input files =>
File1: 2 rows(r1,r2), 3 columns(c1,c2,c3)
c1 c2 c3
r1 a b c
r2 d e f
File2: 3 rows(r3,r4,r5), 3 columns(c1,c2,c4)
c1 c2 c4
r3 1 2 3
r4 4 5 6
r5 7 8 9
File3: 1 row(r6), 4 columns(c1, c2, c5, c6)
c1 c2 c5 c6
r6 w x y z
Output file =>
for all 3 files, the 2 first columns (c1, c2) have the same name
File4:
c1 c2 c3 c4 c5 c6
r1 a b c - - -
r2 d e f - - -
r3 1 2 - 3 - -
r4 4 5 - 6 - -
r5 7 8 - 9 - -
r6 w x - - y z
What I'm trying to do is: for each of the files add the needed empty columns so that all files have the same number of columns then reorder the columns with "awk" then use "cat" to stack them vertically. But I don't know if this is the best way or there is a more efficient way to do it.
Thanks,
The following essentially does the task. It essentially builds up a matrix entry which is indexed by the row and column names.
awk '(FNR==1) {
for(i=1;i<=NF;++i) {
if (!($i in columns)) { column_order[++cn] = $i; columns[$i] }
c[i+1]=$i
}
next
}
!($1 in rows) { row_order[++rn] = $1; rows[$1] }
{ for(i=2;i<=NF;++i) entry[$1,c[i]]=$i }
END {
s="";for(j=1;j<=cn;++j) s=s OFS column_order[j]; print s
for(i=1;i<=rn;++i) {
row_name=row_order[i]
s=row_name
for(j=1;j<=cn;++j) {
col_name = column_order[j]
s=s OFS ((row_name,col_name) in entry ? entry[row_name,col_name] : "-")
}
print s
}
}' file1 file2 file3 file4 ... filen

Sorting large text file in python

Sort the content of a file based on second field, e.g.
Input file:
Jervie,12,M
Jaimy,11,F
Tony,23,M
Janey,11,F
Output file:
Jaimy,11,F
Janey,11,F
Jervie,12,M
Tony,23,M
We need to use external sort.
Input file can be of size 4GB. RAM is 1GB.
I used this but it does not work as it treats all the content as int. Also I have doubt related to the buffer size in each turn of the external sort. How to decide on that?
This sorts file with integers only.
file = open("i2.txt","r")
temp_files = []
e = []
while True:
temp_file = tempfile.TemporaryFile()
e = list(islice(file,2))
if not e:
break
e.sort(key=lambda line: int(line.split()[0]))
temp_file.writelines(e)
temp_files.append(temp_file)
temp_file.flush()
temp_file.seek(0)
file.close()
with open('o.txt', 'w') as out:
out.writelines(imap('{}\n'.format, heapq.merge(*(imap(int, f) for f in temp_files))))
out.close()
I am able to create temporary files sorted on the second field, but how do I merge them based on that?
I did it with the following code :
Divide the big file into smaller files. Here it is assumed that max 4 lines can be read. So I initially divide the file into lines of 4 and sort them and write into temp files. Then read these files in pairs 2 lines from each file and merge them. Corner cases are not handled but, this should be a starter for others to think.
f = open("i1.txt", "r")
temp_files = []
e = []
while True:
temp_file = tempfile.NamedTemporaryFile()
e = list(islice(f, 4))
if not e:
temp_file.close()
break
# e.sort(key=lambda line:int(line.split()[1]))
e.sort(key=lambda line: int(line.split()[1]))
temp_file.writelines(e)
temp_files.append(temp_file)
temp_file.flush()
temp_file.seek(0)
f.close()
aux = []
z = 0
while len(temp_files) != 1:
while z < len(temp_files)-1:
tem = tempfile.NamedTemporaryFile()
t1 = temp_files[z]
t2 = temp_files[z+1]
t1.seek(0)
t2.seek(0)
n = 2
e1 = None
e2 = None
while True:
if not e1:
e1 = list(islice(t1, 2))
if not e2:
e2 = list(islice(t2, 2))
if not e1 and not e2:
break
elif e1 and not e2:
tem.writelines(imap('{}'.format,e1))
e1 = None
continue
elif not e1 and e2:
tem.writelines(imap('{}'.format,e2))
e2 = None
continue
i = 0
j = 0
while i<len(e1) and j<len(e2):
l1 = e1[i]
l2 = e2[j]
if int(l1.split()[1]) == int(l2.split()[1]):
tem.writelines(imap('{}'.format,[l1,l2]))
i+=1
j+=1
elif int(l1.split()[1]) < int(l2.split()[1]):
tem.writelines(imap('{}'.format,[l1]))
i+=1
else:
tem.writelines(imap('{}'.format,[l2]))
j+=1
if i>=len(e1):
e1 = None
else:
e1 = e1[i:]
if j>= len(e2):
e2 = None
else:
e2 = e2[j:]
z+=2
aux.append(tem)
t1.close()
t2.close()
tem.flush()
tem.seek(0)
temp_files = aux
z = 0
aux = []
with open("o.txt",'w') as out:
out.writelines(imap('{}'.format,temp_files[0]))
Try using out of the core processing with Blaze (http://blaze.readthedocs.io/en/latest/ooc.html)

Conditional Filter in GROUP BY in Pig

I have the following dataset in which I need to merge multiple rows into one if they have the same key. At the same time, I need to pick among the multiple tuples which gets grouped.
1 N1 1 10
1 N1 2 15
2 N1 1 10
3 N1 1 10
3 N1 2 15
4 N2 1 10
5 N3 1 10
5 N3 2 20
For example
A = LOAD 'data.txt' AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
DUMP G;
((1,N1),{(1,N1,1,10),(1,N1,2,15)})
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,1,10),(3,N1,2,15)})
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,1,10),(5,N3,2,20)})
Now, I want to pick if there are multiple tuples in collected bag, I want to filter only those which have f3==2. Here is the final data which I want:
((1,N1),{(1,N1,2,15)}) -- f3==2, f3==1 is removed from this set
((2,N1),{(2,N1,1,10)})
((3,N1),{(3,N1,2,15)}) -- f3==2, f3==1 is removed from this bag
((4,N2),{(4,N2,1,10)})
((5,N3),{(5,N3,2,10)})
Any idea how to achieve this?
I did with my way as specified in the comment above. Here is how I did it.
A = LOAD 'group.txt' USING PigStorage(',') AS (f1:int, f2:chararray, f3:int, f4:int);
G = GROUP A BY (f1, f2);
CNT = FOREACH G GENERATE group, COUNT($1) AS cnt, $1;
SPLIT CNT INTO
CNT1 IF (cnt > 1),
CNT2 IF (cnt == 1);
M1 = FOREACH CNT1 {
row = FILTER $2 BY (f3 == 2);
GENERATE FLATTEN(row);
};
M2 = FOREACH CNT2 GENERATE FLATTEN($2);
O = UNION M1, M2;
DUMP O;
(2,N1,1,10)
(4,N2,1,10)
(1,N1,2,15)
(3,N1,2,15)
(5,N3,2,20)

Vectorize matrix operation in R

I have a R x C matrix filled to the k-th row and empty below this row. What i need to do is to fill the remaining rows. In order to do this, i have a function that takes 2 entire rows as arguments, process these rows and output 2 fresh rows (these outputs will fill the empty rows of the matrix, in batches of 2). I have a fixed matrix containing all 'pairs' of rows to be processed, but my for loop is not helping performance:
# the processRows function:
processRows = function(r1, r2)
{
# just change a little bit the two rows and return it in a compact way
nr1 = r1 * 0.1
nr2 = -r2 * 0.1
matrix (c(nr1, nr2), ncol = 2)
}
# M is the matrix
# nrow(M) and k are even, so nLeft is even
M = matrix(1:48, ncol = 3)
# half to fill (can be more or less, but k is always even)
k = nrow(M)/2
# simulate empty rows to be filled
M[-(1:k), ] = 0
cat('before fill')
print(M)
# number of empty rows to fill
nLeft = nrow(M) - k
nextRow = k + 1
# each row in idxList represents a 'pair' of rows to be processed
# any pairwise combination of non-empty rows could happen
# make it reproducible
set.seed(1)
idxList = matrix (sample(1:k, k), ncol = 2, byrow = TRUE)
for ( i in 1 : (nLeft / 2))
{
row1 = M[idxList[i, 1],]
row2 = M[idxList[i, 2],]
# the two columns in 'results' will become 2 rows in M
results = processRows(row1, row2)
# fill the matrix
M[nextRow, ] = results[, 1]
nextRow = nextRow + 1
M[nextRow, ] = results[, 2]
nextRow = nextRow + 1
}
cat('after fill')
print(M)
Okay, here is your code first. We run this so that we have a copy of the "true" matrix, the one we hope to reproduce, faster.
#### Original Code (aka Gold Standard) ####
M = matrix(1:48, ncol = 3)
k = nrow(M)/2
M[-(1:k), ] = 0
nLeft = nrow(M) - k
nextRow = k + 1
idxList = matrix(1:k, ncol = 2)
for ( i in 1 : (nLeft / 2))
{
row1 = M[idxList[i, 1],]
row2 = M[idxList[i, 2],]
results = matrix(c(2*row1, 3*row2), ncol = 2)
M[nextRow, ] = results[, 1]
nextRow = nextRow + 1
M[nextRow, ] = results[, 2]
nextRow = nextRow + 1
}
Now here is the vectorized code. The basic idea is if you have 4 rows you are processing. Rather than passing them as vectors one at a time, do it at once. That is:
(1:3) * 2
(1:3) * 2
(1:3) * 2
(1:3) * 2
is the same (but slower) as:
c(1:3, 1:3, 1:3, 1:3) * 2
So first, we will use your same setup code, then create the rows to be processed as two long vectors (where all 4 original rows are just strung together as in my simple example above). Then, we take those results, and transform them into matrices with the appropriate dimensions. The last trick is to assign the results back in in just two steps. You can assign to multiple rows of a matrix at once, so we use seq() to get odd and even numbers so assign the first and second column of the results to, respectively.
#### Vectorized Code (testing) ####
M2 = matrix(1:48, ncol = 3)
k2 = nrow(M2)/2
M2[-(1:k2), ] = 0
nLeft2 = nrow(M2) - k2
nextRow2 = k2 + 1
idxList2 = matrix(1:k2, ncol = 2)
## create two long vectors of all rows to be processed
row12 <- as.vector(t(M2[idxList2[, 1],]))
row22 <- as.vector(t(M2[idxList2[, 2],]))
## get all results
results2 = matrix(c(2*row12, 3*row22), ncol = 2)
## add results back
M2[seq(nextRow2, nextRow2 + nLeft2-1, by = 2), ] <- matrix(results2[,1], nLeft2/2, byrow=TRUE)
M2[seq(nextRow2+1, nextRow2 + nLeft2, by = 2), ] <- matrix(results2[,2], nLeft2/2, byrow=TRUE)
## check that vectorized code matches your examples
all.equal(M, M2)
Which on my machine gives:
> all.equal(M, M2)
[1] TRUE

how to write bash script in ubuntu to normalize the index of text comparison

I had a input which is a result from text comparison. It is in a very simple format. It has 3 columns, position, original texts and new texts.
But some of the records looks like this
4 ATCG ATCGC
10 1234 123
How to write the short script to normalize it to
7 G GC
12 34 3
probably, the whole original texts and the whole new text is like below respectively
ACCATCGGA1234
ACCATCGCGA123
"Normalize" means "trying to move the position in the first column to the position that changes gonna occur", or "we would remove the common prefix ATG, add its length 3 to the first field; similarly on line 2 the prefix we remove is length 2"
This script
awk '
BEGIN {OFS = "\t"}
function common_prefix_length(str1, str2, max_len, idx) {
idx = 1
if (length(str1) < length(str2))
max_len = length(str1)
else
max_len = length(str2)
while (substr(str1, idx, 1) == substr(str2, idx, 1) && idx < max_len)
idx++
return idx - 1
}
{
len = common_prefix_length($2, $3)
print $1 + len, substr($2, len + 1), substr($3, len + 1)
}
' << END
4 ATCG ATCGC
10 1234 123
END
outputs
7 G GC
12 34 3

Resources