I have two sets of codes to count the number of sentences in one text file. The two options generate different results and Option 2(Stanza) is very slow. Is Option 2(Stanza) more accurate? How should I speedup Option 2(Stanza)? Thanks a lot!
Option 1 (Regular expression): The following codes takes 2 seconds and the output is 1444.
import requests
from bs4 import BeautifulSoup
import re
sentence_regex = re.compile(r"\b[A-Z](?:[^\.!?]|\.\d)*[\.!?]")
def identify_sentences(input_text:str):
"""Returns all sentences in the input text"""
sentences = re.findall(sentence_regex, input_text)
return sentences
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
sentences=identify_sentences(text)
len(sentences)
Option 2(Stanza): The following codes takes 6 minutes and the output is 2481.
import requests
from bs4 import BeautifulSoup
import stanza
nlp=stanza.Pipeline(lang='en', processors='tokenize, pos, ner')
r=requests.get("https://www.sec.gov/Archives/edgar/data/861439/0000912057-94-000263.txt", headers={"User-Agent": "b2g"})
content=r.content.decode('utf8')
soup=BeautifulSoup(content, "html5lib")
text=soup.text
doc=nlp(text)
sentences=doc.sentences
len(sentences)
Two answers:
If all you're wanting to do is to split text into sentences, then your pipeline should be simply nlp=stanza.Pipeline(lang='en', processors='tokenize') and that will be much faster than the pipeline you show that also runs a part-of-speech tagger and named entity recognizer.
But, yes, running Stanza is way slower than simply doing matching against a single regex. There should be many places where it works differently and better, because exclamation marks, question marks, and especially periods often occur in the middle of English sentences (e.g., here!). You'll have to decide for yourself whether the better accuracy is worth it to you.
Related
I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)]
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.
Following code is used to preprocess text with a custom lemmatizer function:
%%time
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from gensim.utils import simple_preprocess, lemmatize
from gensim.parsing.preprocessing import STOPWORDS
STOPWORDS = list(STOPWORDS)
def preprocessor(s):
result = []
for token in lemmatize(s, stopwords=STOPWORDS, min_length=2):
result.append(token.decode('utf-8').split('/')[0])
return result
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
%%time
X_train, X_test, y_train, y_test = train_test_split([preprocessor(x) for x in data.text],
data.label, test_size=0.2, random_state=0)
#10.8 seconds
Question:
Can the speed of the lemmatization process be improved?
On a large corpus of about 80,000 documents, it currently takes about two hours. The lemmatize() function seems to be the main bottleneck, as a gensim function such as simple_preprocess is quite fast.
Thanks for your help!
You may want to refactor your code to make it easier to time each portion separately. lemmatize() might be part of your bottleneck, but other significant contributors might also be: (1) composing large documents, one-token-at-a-time, via list .append(); (2) the utf-8 decoding.
Separately, the gensim lemmatize() relies on the parse() function from the Pattern library; you could try an alternative lemmatization utility, like those in NLTK or Spacy.
Finally, as lemmatization may be an inherently costly operation, and it might be the case that the same source data gets processed many times in your pipeline, you might want to engineer your process so that the results are re-written to disk, then re-used on subsequent runs – rather than always done "in-line".
The dataset that I have is separated on different files grouped on samples that know each other, i.e., they were created on similar conditions on a similar time.
The balance of the train-test dataset is important so the samples have to be on train or test, but cannot be separated. So KFold it is not simple to use on my scikit-learn code.
Right now, I am using something similar to LOO making something like:
train ~> cat ./dataset/!(1.txt)
test ~> cat ./dataset/1.txt
Which is not confortable and not very useful if I want to make folds on test of several files and make a "real" CV.
How would be possible to make a good CV to check real overfitting?
Looking to this answer, I've realized that pandas can concatenate dataframes. I checked that the process is 15-20% slower than cat command-line but makes able to do folds as I was expecting.
Anyway, I am quite sure that there should be any other better way than this one:
import glob
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
allFiles = glob.glob("./dataset/*.txt")
kf = KFold(len(allFiles), n_folds=3, shuffle=True)
for train_files, cv_files in kf:
dataTrain = pd.concat((pd.read_csv(allFiles[idTrain], header=None) for idTrain in train_files))
dataTest = pd.concat((pd.read_csv(allFiles[idTest], header=None) for idTest in cv_files))
I have written a Python 2.7 script that reads a CSV file and then does some standard deviation calculations . It works absolutely fine however it is very very slow. A CSV I tried with 100 million lines took around 28 hours to complete. I did some googling and it appears that maybe using the pandas module might makes this quicker .
I have posted part of the code below, since i am a pretty novice when it comes to python , i am unsure if using pandas would actually help at all and if it did would the function need to be completely re-written.
Just some context for the CSV file, it has 3 columns, first column is an IP address, second is a url and the third is a timestamp.
def parseCsvToDict(filepath):
with open(csv_file_path) as f:
ip_dict = dict()
csv_data = csv.reader(f)
f.next() # skip header line
for row in csv_data:
if len(row) == 3: #Some lines in the csv have more/less than the 3 fields they should have so this is a cheat to get the script working ignoring an wrong data
current_ip, URI, current_timestamp = row
epoch_time = convert_time(current_timestamp) # convert each time to epoch
if current_ip not in ip_dict.keys():
ip_dict[current_ip] = dict()
if URI not in ip_dict[current_ip].keys():
ip_dict[current_ip][URI] = list()
ip_dict[current_ip][URI].append(epoch_time)
return(ip_dict)
Once the above function has finished the data is parsed to another function that calculates the standard deviation for each IP/URL pair (using numpy.std).
Do you think that using pandas may increase the speed and would it require a complete rewrite or is it easy to modify the above code?
The following should work:
import pandas as pd
colnames = ["current_IP", "URI", "current_timestamp", "dummy"]
df = pd.read_csv(filepath, names=colnames)
# Remove incomplete and redundant rows:
df = df[~df.current_timestamp.isnull() & df.dummy.isnull()]
Notice this assumes you have enough RAM. In your code, you are already assuming you have enough memory for the dictionary, but the latter may be significatively smaller than the memory used by the above, for two reasons.
If it is because most lines are dropped, then just parse the csv by chunks: arguments skiprows and nrows are your friends, and then pd.concat
If it is because IPs/URLs are repeated, then you will want to transform IPs and URLs from normal columns to indices: parse by chunks as above, and on each chunk do
indexed = df.set_index(["current_IP", "URI"]).sort_index()
I expect this will indeed give you a performance boost.
EDIT: ... including a performance boost to the calculation of the standard deviation (hint: df.groupby())
I will not be able to give you an exact solution, but here are a couple of ideas.
Based on your data, you read 100000000. / 28 / 60 / 60 approximately 1000 lines per second. Not really slow, but I believe that just reading such a big file can cause a problem.
So take a look at this performance comparison of how to read a huge file. Basically a guy suggests that doing this:
file = open("sample.txt")
while 1:
lines = file.readlines(100000)
if not lines:
break
for line in lines:
pass # do something
can give you like 3x read boost. I also suggest you to try defaultdict instead of your if k in dict create [] otherwise append.
And last, not related to python: working in data-analysis, I have found an amazing tool for working with csv/json. It is csvkit, which allows to manipulate csv data with ease.
In addition to what Salvador Dali said in his answer: If you want to keep as much of the current code of your script, you may find that PyPy can speed up your program:
“If you want your code to run faster, you should probably just use PyPy.” — Guido van Rossum (creator of Python)
Still using bloody OpenOffice Writer to customize my sale_order.rml report.
In my sale order I have 6 order lines with 6 different lead time to delivery. I need to show the maximum out of the six values.
After many attempt I have abandoned using the reduce function as it works erratically or not at all most of the time. I have never seen anything like this.
So I thought I'd give a try using max encapsulating a loop such as:
[[ max(repeatIn(so.order_line.delay,'d')) ]]
My maximum lead time being 20, I would expect to see 20 (yes well that would be too easy, wouldn't it!).
It returns
{'d': 20.0}
At least it contains the value I am after.
But; if I try and manipulate this result, it disappears altogether.
I have tried:
int(re.findall(r'[0-9]+', max(repeatIn(so.order_line.delay,'d')))[0])
which works great from the python window, but returns absolutely nothing in OpenERP.
I import the re from my sale_order.py file, which I have recompiled into sale_order.pyo:
import time
import re
from datetime import datetime, timedelta
from report import report_sxw
class order(report_sxw.rml_parse):
def __init__(self, cr, uid, name, context=None):
super(order, self).__init__(cr, uid, name, context=context)
self.localcontext.update({
'time': time,
'datetime': datetime,
'timedelta': timedelta,
're': re,
})
I have of course restarted the server many times. My test install sits on windows.
So can anyone tell me what I am doing wrong, because I can make it work from Python but not from OpenOffice Writer!
Thanks for your help!
EDIT 1:
The format
{'d': 20.0}
is, according to python, a dictionary. Still in Python, to extract the integer from a dictionary it is possible to do it like so:
>>> dict={'d': 20.0}
>>> print(dict['d'])
20.0
But how can I transpose this to OpenERP writer???
I have manage to get the result I wanted by importing functools and declaring the reduce function within the parameters of the sale_order.py file.
I then simply used a combination of reduce and max function and it works exactly as expected.
The correct syntax is as follow:
repeatIn(objects,'o')
reduce(lambda x, y: max(x, y.delay), o.order_line, 0)
Nothing else is required.
Enjoy!