neo4j-shell on windows does not accept square brackets - windows

I have installed neo4j on windows 10 and try to copy my LOAD-CSV-Commands into the neo4j-shell. I open the Command-Prompt of neo4j and start the neo4j-shell.
The import of the nodes works fine but when i try to copy the following line into the neo4j-shell:
LOAD CSV WITH HEADERS FROM "file:///C:/neo4j/CC/autor-streitschrift.csv"
AS line FIELDTERMINATOR '|'
// WITH line LIMIT 50
MATCH (from {id:line.Autor}), (to {id:line.Streitschrift}) create from-[:AUTOR_VON]->to
RETURN count(*);
... the square-brackets are not coppied:
neo4j-sh (?)$ LOAD CSV WITH HEADERS FROM "file:///C:/neo4j/CC/autor-streitschrift.csv"
> AS line FIELDTERMINATOR '|'
> // WITH line LIMIT 50
> MATCH (from {id:line.Autor}), (to {id:line.Streitschrift}) create from-AUTOR_VON->to
> RETURN count(*);
62 ms
WARNING: Invalid input 'A': expected whitespace, [ or '-' (line 4, column 72 (offset: 195))
"MATCH (from {id:line.Autor}), (to {id:line.Streitschrift}) create from-AUTOR_VON->to"
^
neo4j-sh (?)$
I have reproduced this on a windows7-machine.
Thanks in advance for any hints,
Andreas

I realize it's awfully late, but in my case I was able to enter square brackets using Alt+91 and Alt+93. That's on Windows 7 using a german keyboard layout.

Related

python regex specific blocks of text from large text file

I'm new to python and this site so thank-you in advance for your... understanding. This is my first attempt at a python script.
I'm having what I think is a performance issue trying to solve this problem which is causing me to not get any data back.
This code works on a small text file of a couple pages but when I try to use it on my 35MB real data text file it just hits the CPU and hasn't returned any data (>24 hours now).
Here's a snippet of the real data from the 35MB text file:
D)dddld
d00d90d
dd
ddd
vsddfgsdfgsf
dfsdfdsf
aAAAAAa
221546
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
sadfsdfsdf
sdfff
ff
f
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
aaa
aaa
aaaaa
What I'm trying to copy into a new file:
29806916295
Meowing
fs:/mod/umbapp/umb/sentbox/221546.pdu
2013:10:4:22:11:31:4
29806916295
What's your cat doing?
fs:/mod/umbapp/umb/sentbox/10955.pdu
2013:10:4:22:10:15:4
My Python code is:
import re
with open('testdata.txt') as myfile:
content = myfile.read()
text = re.search(r'\d{11}.*\n.*\n.*(\d{4})\D+(\d{2})\D+(\d{1})\D+(\d{2})\D+(\d{2})\D+\d{2}\D+\d{1}', content, re.DOTALL).group()
with open("result.txt", "w") as myfile2:
myfile2.write(text)
Regex isn't the fastest way to search a string. You also compounded the problem by having a very big string (35MB). Reading an entire file into memory is generally not recommended because you may run into memory issues.
Judging from your regex pattern, it seems like you want to capture 4-line groups that start with an 11-digit string and end with some time-line string. Try this code:
import re
start_pattern = re.compile(r'^\d{11}$')
end_pattern = re.compile(r'^\d{4}\D+\d{2}\D+\d{1}\D+\d{2}\D+\d{2}\D+\d{2}\D+\d{1}$')
capturing = 0
capture = ''
with open('output.txt', 'w') as output_file:
with open('input.txt', 'r') as input_file:
for line in input_file:
if capturing > 0 and capturing <= 4:
capturing += 1
capture += line
elif start_pattern.match(line):
capturing = 1
capture = line
if capturing == 4:
if end_pattern.match(line):
output_file.write(capture + '\n')
else:
capturing = 0
It iterates over the input file, line by line. If it finds a line matching the start_pattern, it will read in 3 more. If the 4th line matches the end_pattern, it will write the whole group to the output file.

Python: Can I grab the specific lines from a large file faster?

I have two large files. One of them is an info file(about 270MB and 16,000,000 lines) like this:
1101:10003:17729
1101:10003:19979
1101:10003:23319
1101:10003:24972
1101:10003:2539
1101:10003:28242
1101:10003:28804
The other is a standard FASTQ format(about 27G and 280,000,000 lines) like this:
#ST-E00126:65:H3VJ2CCXX:7:1101:1416:1801 1:N:0:5
NTGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
#ST-E00126:65:H3VJ2CCXX:7:1101:10003:75641:N:0:5
TAAGATAGATAGCCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGCTCGTTATGG
+
AAAFFKKKKKKKKKFKKKKKKKFKKKKAFKKKKKAF7AAFFKFAAFFFKKF7FF<FKK
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifie. For each sequence,this part of the Line 1 is unique.
1101:1416:1801 and 1101:10003:75641
And I want to grab the Line 1 and the next three lines from the FASTQ file according to the info file. Here is my code:
import gzip
import re
count = 0
with open('info_path') as info, open('grab_path','w') as grab:
for i in info:
sample = i.strip()
with gzip.open('fq_path') as fq:
for j in fq:
count += 1
if count%4 == 1:
line = j.strip()
m = re.search(sample,j)
if m != None:
grab.writelines(line+'\n'+fq.next()+fq.next()+fq.next())
count = 0
break
And it works, but because both of these two files have millions of lines, it's inefficient(running one day only get 20,000 lines).
UPDATE at July 6th:
I find that the info file can be read into the memory(thank #tobias_k for reminding me), so I creat a dictionary that the keys are info lines and the values are all 0. After that, I read the FASTQ file every 4 line, use the identifier part as the key,if the value is 0 then return the 4 lines. Here is my code:
import gzip
dic = {}
with open('info_path') as info:
for i in info:
sample = i.strip()
dic[sample] = 0
with gzip.open('fq_path') as fq, open('grap_path',"w") as grab:
for j in fq:
if j[:10] == '#ST-E00126':
line = j.split(':')
match = line[4] +':'+line[5]+':'+line[6][:-2]
if dic.get(match) == 0:
grab.writelines(j+fq.next()+fq.next()+fq.next())
This way is much faster, it takes 20mins to get all the matched lines(about 64,000,000 lines). And I have thought about sorting the FASTQ file first by external sort. Splitting the file that can be read into the memory is ok, my trouble is how to keep the next three lines following the indentifier line while sorting. The Google's answer is to linear these four lines first, but it will take 40mins to do so.
Anyway thanks for your help.
You can sort both files by the identifier (the 1101:1416:1801) part. Even if files do not fit into memory, you can use external sorting.
After this, you can apply a simple merge-like strategy: read both files together and do the matching in the meantime. Something like this (pseudocode):
entry1 = readFromFile1()
entry2 = readFromFile2()
while (none of the files ended)
if (entry1.id == entry2.id)
record match
else if (entry1.id < entry2.id)
entry1 = readFromFile1()
else
entry2 = readFromFile2()
This way entry1.id and entry2.id are always close to each other and you will not miss any matches. At the same time, this approach requires iterating over each file once.

Sublime Text - Goto line and column

Currently, the Go to line shortcut (CTRL+G in windows/linux) only allows to navigate to a specific line.
It would be nice to optionally allow the column number to be specified after comma, e.g.
:30,11 to go to line 30, column 11
Is there any plugin or custom script to achieve this?
Update 3
This is now part of Sublime Text 3 starting in build number 3080:
Goto Anything supports :line:col syntax in addition to :line
For example, you can use :30:11 to go to line 30, column 11.
Update 1 - outdated
I just realized you've tagged this as sublime-text-3 and I'm using 2. It may work for you, but I haven't tested in 3.
Update 2 - outdated
Added some sanity checks and some modifications to GotoRowCol.py
Created github repo sublimetext2-GotoRowCol
Forked and submitted a pull request to commit addition to package_control_channel
Edit 3: all requirements of the package_control repo have been met. this package is now available in the package repository in the application ( install -> GotoRowCol to install ).
I too would like this feature. There's probably a better way to distribute this but I haven't really invested a lot of time into it. I read through some plugin dev tutorial really quick, and used some other plugin code to patch this thing together.
Select the menu option Tools -> New Plugin
A new example template will open up. Paste this into the template:
import sublime, sublime_plugin
class PromptGotoRowColCommand(sublime_plugin.WindowCommand):
def run(self, automatic = True):
self.window.show_input_panel(
'Enter a row and a column',
'1 1',
self.gotoRowCol,
None,
None
)
pass
def gotoRowCol(self, text):
try:
(row, col) = map(str, text.split(" "))
if self.window.active_view():
self.window.active_view().run_command(
"goto_row_col",
{"row": row, "col": col}
)
except ValueError:
pass
class GotoRowColCommand(sublime_plugin.TextCommand):
def run(self, edit, row, col):
print("INFO: Input: " + str({"row": row, "col": col}))
# rows and columns are zero based, so subtract 1
# convert text to int
(row, col) = (int(row) - 1, int(col) - 1)
if row > -1 and col > -1:
# col may be greater than the row length
col = min(col, len(self.view.substr(self.view.full_line(self.view.text_point(row, 0))))-1)
print("INFO: Calculated: " + str({"row": row, "col": col})) # r1.01 (->)
self.view.sel().clear()
self.view.sel().add(sublime.Region(self.view.text_point(row, col)))
self.view.show(self.view.text_point(row, col))
else:
print("ERROR: row or col are less than zero") # r1.01 (->)
Save the file. When the "Save As" dialog opens, it should be in the the Sublime Text 2\Packages\User\ directory. Navigate up one level to and create the folder Sublime Text 2\Packages\GotoRowCol\ and save the file with the name GotoRowCol.py.
Create a new file in the same directory Sublime Text 2\Packages\GotoRowCol\GotoRowCol.sublime-commands and open GotoRowCol.sublime-commands in sublime text. Paste this into the file:
[
{
"caption": "GotoRowCol",
"command": "prompt_goto_row_col"
}
]
Save the file. This should register the GotoRowCol plugin in the sublime text system. To use it, hit ctrl + shift + p then type GotoRowCol and hit ENTER. A prompt will show up at the bottom of the sublime text window with two number prepopulated, the first one is the row you want to go to, the second one is the column. Enter the values you desire, then hit ENTER.
I know this is a complex operation, but it's what I have right now and is working for me.

error in writing to a file

I have written a python script that calls unix sort using subprocess module. I am trying to sort a table based on two columns(2 and 6). Here is what I have done
sort_bt=open("sort_blast.txt",'w+')
sort_file_cmd="sort -k2,2 -k6,6n {0}".format(tab.name)
subprocess.call(sort_file_cmd,stdout=sort_bt,shell=True)
The output file however contains an incomplete line which produces an error when I parse the table but when I checked the entry in the input file given to sort the line looks perfect. I guess there is some problem when sort tries to write the result to the file specified but I am not sure how to solve it though.
The line looks like this in the input file
gi|191252805|ref|NM_001128633.1| Homo sapiens RIMS binding protein 3C (RIMBP3C), mRNA gnl|BL_ORD_ID|4614 gi|124487059|ref|NP_001074857.1| RIMS-binding protein 2 [Mus musculus] 103 2877 3176 846 941 1.0102e-07 138.0
In output file however only gi|19125 is printed. How do I solve this?
Any help will be appreciated.
Ram
Using subprocess to call an external sorting tool seems quite silly considering that python has a built in method for sorting items.
Looking at your sample data, it appears to be structured data, with a | delimiter. Here's how you could open that file, and iterate over the results in python in a sorted manner:
def custom_sorter(first, second):
""" A Custom Sort function which compares items
based on the value in the 2nd and 6th columns. """
# First, we break the line into a list
first_items, second_items = first.split(u'|'), second.split(u'|') # Split on the pipe character.
if len(first_items) >= 6 and len(second_items) >= 6:
# We have enough items to compare
if (first_items[1], first_items[5]) > (second_items[1], second_items[5]):
return 1
elif (first_items[1], first_items[5]) < (second_items[1], second_items[5]):
return -1
else: # They are the same
return 0 # Order doesn't matter then
else:
return 0
with open(src_file_path, 'r') as src_file:
data = src_file.read() # Read in the src file all at once. Hope the file isn't too big!
with open(dst_sorted_file_path, 'w+') as dst_sorted_file:
for line in sorted(data.splitlines(), cmp = custom_sorter): # Sort the data on the fly
dst_sorted_file.write(line) # Write the line to the dst_file.
FYI, this code may need some jiggling. I didn't test it too well.
What you see is probably the result of trying to write to the file from multiple processes simultaneously.
To emulate: sort -k2,2 -k6,6n ${tabname} > sort_blast.txt command in Python:
from subprocess import check_call
with open("sort_blast.txt",'wb') as output_file:
check_call("sort -k2,2 -k6,6n".split() + [tab.name], stdout=output_file)
You can write it in pure Python e.g., for a small input file:
def custom_key(line):
fields = line.split() # split line on any whitespace
return fields[1], float(fields[5]) # Python uses zero-based indexing
with open(tab.name) as input_file, open("sort_blast.txt", 'w') as output_file:
L = input_file.read().splitlines() # read from the input file
L.sort(key=custom_key) # sort it
output_file.write("\n".join(L)) # write to the output file
If you need to sort a file that does not fit in memory; see Sorting text file by using Python

How to substitute the first line and split the first string on the following line:

I am just new to scripting and I need some help. I have something like a bazillion files that look like this.
Assign F2 Height
3IleN 2.34025e+07
4PheN 2.05028e+07
6LysN 1.43672e+07
7ThrN 1.49120e+07
8LeuN 1.30838e+07
9ThrN 1.44298e+07
And i want it to look like this + save it in another file with the same name as the previous file however, with a "MOD" written at the beginning.
Number AA Height
3 IleN 6.20756e+07
4 PheN 5.26499e+07
7 ThrN 3.00216e+07
8 LeuN 3.26377e+07
9 ThrN 4.03901e+07
10 GlyN 2.73659e+07
12 ThrN 3.16319e+07
13 IleN 5.94604e+07
If you could please describe and explain the parameters used, that would be of great help.
Thanks!
The following should work for you:
sed 's/^\([0-9]*\)/\1 /' filename

Resources