R: tm Textmining package: Doc-Level metadata generation is slow - performance

I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files).
This for-loop works but it is very slow,
Performance seems to degrade as a function f ~ 1/n_docs.
for (i in seq(from= 1, to=length(corpus), by=1)){
if(opts$options$verbose == TRUE || i %% 50 == 0){
print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
DublinCore(corpus[[i]], "title") = csv[[i,10]]
DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]] #institutions
This may do something to the corpus variable but I don't know what.
But when I put it inside a tm_map() (similar to lapply() function), it runs much faster, but the changes are not made persistent:
i = 0
corpus = tm_map(corpus, function(x){
i <<- i + 1
if(opts$options$verbose == TRUE){
print(paste(i, " ", substr(x, 1, 140), sep = " "))
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
Variable corpus has empty metadata fields after exiting the tm_map function. It should be filled. I have a few other things to do with the collection.
The R documentation for the meta() function says this:
meta(crude[[1]], tag = "Topics")
meta(crude[[1]], tag = "Comment") <- "A short comment."
meta(crude[[1]], tag = "Topics") <- NULL
DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
DublinCore(crude[[1]], tag = "Format") <- "XML"
meta(crude, type = "corpus")
meta(crude, "labels") <- 21:40
I tried many of these calls (with var "corpus" instead of "crude"), but they do not seem to work.
Someone else once seemed to have had the same problem with a similar data set (forum post from 2009, no response)

Here's a bit of benchmarking...
With the for loop :
expr.for <- function() {
for (i in seq(from= 1, to=length(corpus), by=1)){
DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398
With tm_map :
corpus <- crude
expr.map <- function() {
tm_map(corpus, function(x) {
meta(x, "title") = LETTERS[round(runif(26))]
meta(x, "Publisher" ) = LETTERS[round(runif(26))]
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482
So the tm_map version, as you noticed, seems to be about 4 times faster.
In your question you say that the changes in the tm_map version are not persistent, it is because you don't return x at the end of your anonymous function. In the end it should be :
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]


Ticker labels after plot

I would really appreciate some help adding a ticker label after my plot from the code below. Below is an extract of the main code - I've only used the lines for 1 ticker but in reality, I'll have 32 symbols in total so I've omitted the unnecessary code duplication for the other tickers for the purposes of this query.
Any improvements on the rest of the code would also be appreciated. I'm using ticker.new as 1) I don't need the inputs and the plot only seemed to display properly when the session was listed as extended (even though the chart was already set to extended)
indicator('NASDAQ Trend', overlay=true)
s01 = ticker.new("NASDAQ", "AAL", session.extended)
screener_func() =>
MA1 = ta.ema(close, 5)
MA2 = ta.ema(close, 8)
MA3 = ta.ema(close, 13)
MA4 = ta.ema(close, 21)
MA5 = ta.ema(close, 34)
MA_Stack_Up = (MA1 > MA2) and (MA2 > MA3) and (MA3 > MA4) and (MA4 > MA5)
Uptrend = MA_Stack_Up
Reversal = ((MA1 < MA2) and (MA2 > MA3)) or ((MA1 > MA2) and (MA2 < MA3))
Bar_Color = Uptrend ? color.new(color.green, 25) : Reversal ? color.new(color.yellow, 25) : color.new(color.red, 25)
// Security call
[TS01]= request.security(s01, timeframe.period, screener_func())
// PLOTS //
l_width = 3
shape = plot.style_circles
plot(1, color=TS01, style=shape, linewidth=l_width)
L1= label.new(bar_index, 1, text=s01, style=label.style_none, textcolor=color.new(color.white, 0), size=size.small)
My problem is the resulting label is ={"session":"extended","symbol":"NASDAQ:AAL"}. Ideally, the label should be just AAL
You can write a function that extracts the part you are interested from that string. If your string ALWAYS has the same format ={"session":"extended","symbol":"NASDAQ:AAL"}.
Step 1: Split the string using : as a delimeter. You will then have 4 sub strings.
AAL"} <-- This is what you want (index: 3)
Step 2: Remove the last two chars. Since the last two chars will always be "}, return a string until last two chars.
getName(_str) =>
string[] _pair = str.split(_str, ":")
string[] _chars = str.split(array.get(_pair, 3), "") // Index 3
int _len = array.size(_chars) - 2 // Don't get the last two chars
string[] _substr = array.new_string(0)
_substr := array.slice(_chars, 0, _len)
string _return = array.join(_substr, "")
Then call this function when you create a label:
L1= label.new(bar_index, 1, text=getName(s01), style=label.style_none, textcolor=color.new(color.white, 0), size=size.small)
No sense in doing all those string gymnastics when the ticker is available. Try this:
indicator('NASDAQ Trend', overlay=true)
// hey look – it's now foldable....
screener_func() =>
MA1 = ta.ema(close, 5)
MA2 = ta.ema(close, 8)
MA3 = ta.ema(close, 13)
MA4 = ta.ema(close, 21)
MA5 = ta.ema(close, 34)
// CONDITIONS - no sense in allocating extra variables when hidden in function
Uptrend = (MA1 > MA2) and (MA2 > MA3) and (MA3 > MA4) and (MA4 > MA5)
Reversal = ((MA1 < MA2) and (MA2 > MA3)) or ((MA1 > MA2) and (MA2 < MA3))
// DETERMINE COLOR CODING - just return the results
// vs assigning to a variable and then return the variable...
[Uptrend ? color.new(color.green, 25) : Reversal ? color.new(color.yellow, 25) : color.new(color.red, 25)]
// }
// Define tickers and set up labels on last bar
// Note: request security limits the max number of tickers to 40
var tickers = array.from("AA","AAL","AAPL")
// use the index of the symbol in the ticker array as price level to plot...
if barstate.islast
for item in tickers
label.new(bar_index+ 5, array.indexof(tickers,item), item, style=label.style_none, textcolor=color.new(color.white, 0), size=size.small)
// Since the ticker symbol must be know at complile time aka simple string,
// we have to call each symbol separately - what a pain...
// style variations here in case we want to change
var aStyle = plot.style_circles
var aLinewidth = 3
// AA <-- Yep this is definitely not NASDAQ material...
temp1 = ticker.new("NYSE", "AA", session.extended)
[temp2] = request.security(temp1, timeframe.period, screener_func())
plot(0, color= temp2, style=aStyle, linewidth=aLinewidth)
// AAL
temp3 = ticker.new("NASDAQ", "AAL", session.extended)
[temp4] = request.security(temp3, timeframe.period, screener_func())
plot(1, color= temp4, style=aStyle, linewidth=aLinewidth)
temp5 = ticker.new("NASDAQ", "AAPL", session.extended)
[temp6] = request.security(temp5, timeframe.period, screener_func())
plot(2, color= temp6, style=aStyle, linewidth=aLinewidth)
output image

Fastest way to search for a row in a large Google Sheet using/in Google Apps Script

GAS is quite powerful and you could write a full fledged web-app using a Google Sheet as the DB back-end. There are many reasons not to do this but I figure in some cases it is okay.
I think the biggest issue will be performance issues when looking for rows based on some criteria in a sheet with a lot of rows. I know there are many ways to "query" a sheet but I can't find reliable information on which is the fastest.
One of the complexities is that many people can edit a sheet which means there are a variable number of situations you'd have to account for. For the sake of simplicity, I want to assume the sheet:
Is locked down so only one person can see it
The first column has the row number (=row())
The most basic query is finding a row where a specific column equals some value.
Which method would be the fastest?
I have a sheet with ~19k rows and ~38 columns, filled with all sorts of unsorted real-world data. That is almost 700k rows so I figured it would be a good sheet to time a few methods and see which is the fastest.
method 1: get sheet as a 2D array then go through each row
method 2: get sheet as a 2D array, sort it, then using a binary search algorithm to find the row
method 3: make a UrlFetch call to Google visualization query and don't provide last row
method 4: make a UrlFetch call to Google visualization query and provide last row
Here are the my query functions.
function method1(spreadsheetID, sheetName, columnIndex, query)
// get the sheet values excluding header,
var rowValues = SpreadsheetApp.openById(spreadsheetID).getSheetByName(sheetName).getSheetValues(2, 1, -1, -1);
// loop through each row
for(var i = 0, numRows = rowValues.length; i < numRows; ++i)
// return it if found
if(rowValues[i][columnIndex] == query) return rowValues[i]
return false;
function method2(spreadsheetID, sheetName, columnIndex, query)
// get the sheet values excluding header
var rowValues = SpreadsheetApp.openById(spreadsheetID).getSheetByName(sheetName).getSheetValues(2, 1, -1, -1);
// sort it
rowValues.sort(function(a, b){
if(a[columnIndex] < b[columnIndex]) return -1;
if(a[columnIndex] > b[columnIndex]) return 1;
return 0;
// search using binary search
var foundRow = matrixBinarySearch(rowValues, columnIndex, query, 0, rowValues.length - 1);
// return if found
if(foundRow != -1)
return rowValues[foundRow];
return false;
function method3(spreadsheetID, sheetName, queryColumnLetterStart, queryColumnLetterEnd, queryColumnLetterSearch, query)
// SQL like query
myQuery = "SELECT * WHERE " + queryColumnLetterSearch + " = '" + query + "'";
// the query URL
// don't provide last row in range selection
var qvizURL = 'https://docs.google.com/spreadsheets/d/' + spreadsheetID + '/gviz/tq?tqx=out:json&headers=1&sheet=' + sheetName + '&range=' + queryColumnLetterStart + ":" + queryColumnLetterEnd + '&tq=' + encodeURIComponent(myQuery);
// fetch the data
var ret = UrlFetchApp.fetch(qvizURL, {headers: {Authorization: 'Bearer ' + ScriptApp.getOAuthToken()}}).getContentText();
// remove some crap from the return string
return JSON.parse(ret.replace("/*O_o*/", "").replace("google.visualization.Query.setResponse(", "").slice(0, -2));
function method4(spreadsheetID, sheetName, queryColumnLetterStart, queryColumnLetterEnd, queryColumnLetterSearch, query)
// find the last row in the sheet
var lastRow = SpreadsheetApp.openById(spreadsheetID).getSheetByName(sheetName).getLastRow();
// SQL like query
myQuery = "SELECT * WHERE " + queryColumnLetterSearch + " = '" + query + "'";
// the query URL
var qvizURL = 'https://docs.google.com/spreadsheets/d/' + spreadsheetID + '/gviz/tq?tqx=out:json&headers=1&sheet=' + sheetName + '&range=' + queryColumnLetterStart + "1:" + queryColumnLetterEnd + lastRow + '&tq=' + encodeURIComponent(myQuery);
// fetch the data
var ret = UrlFetchApp.fetch(qvizURL, {headers: {Authorization: 'Bearer ' + ScriptApp.getOAuthToken()}}).getContentText();
// remove some crap from the return string
return JSON.parse(ret.replace("/*O_o*/", "").replace("google.visualization.Query.setResponse(", "").slice(0, -2));
My binary search algorithm:
function matrixBinarySearch(matrix, columnIndex, query, firstIndex, lastIndex)
// find the value using binary search
// https://www.w3resource.com/javascript-exercises/javascript-array-exercise-18.php
// first make sure the query string is valid
// if it is less than the smallest value
// or larger than the largest value
// it is not valid
if(query < matrix[firstIndex][columnIndex] || query > matrix[lastIndex][columnIndex]) return -1;
// if its the first row
if(query == matrix[firstIndex][columnIndex]) return firstIndex;
// if its the last row
if(query == matrix[lastIndex][columnIndex]) return lastIndex;
// now start doing binary search
var middleIndex = Math.floor((lastIndex + firstIndex)/2);
while(matrix[middleIndex][columnIndex] != query && firstIndex < lastIndex)
if(query < matrix[middleIndex][columnIndex])
lastIndex = middleIndex - 1;
else if(query > matrix[middleIndex][columnIndex])
firstIndex = middleIndex + 1;
middleIndex = Math.floor((lastIndex + firstIndex)/2);
return matrix[middleIndex][columnIndex] == query ? middleIndex : -1;
This is the function I used to test them all:
// each time this function is called it will try one method
// the first time it is called it will try method1
// then method2, then method3, then method4
// after it does method4 it will start back at method1
// we will use script properties to save which method is next
// we also want to use the same query string for each batch so we'll save that in script properties too
function testIt()
// get the sheet where we're staving run times
var runTimesSheet = SpreadsheetApp.openById("...").getSheetByName("times");
// we want to see true speed tests and don't want server side caching so we a copy of our data sheet
// make a copy of our data sheet and get its ID
var tempSheetID = SpreadsheetApp.openById("...").copy("temp sheet").getId();
// get script properties
var scriptProperties = PropertiesService.getScriptProperties();
// the counter
var searchCounter = Number(scriptProperties.getProperty("searchCounter"));
// index of search list we want to query for
var searchListIndex = Number(scriptProperties.getProperty("searchListIndex"));
// if we're at 0 then we need to get the index of the query string
if(searchCounter == 0)
searchListIndex = Math.floor(Math.random() * searchList.length);
scriptProperties.setProperty("searchListIndex", searchListIndex);
// query string
var query = searchList[searchListIndex];
// save relevant data
var timerRow = ["method" + (searchCounter + 1), searchListIndex, query, 0, "", "", "", ""];
// run the appropriate method
case 0:
// start time
var start = (new Date()).getTime();
// run the query
var ret = method1(tempSheetID, "Extract", 1, query);
// end time
timerRow[3] = ((new Date()).getTime() - start) / 1000;
// if we found the row save its values in the timer output so we can confirm it was found
timerRow[4] = ret[0];
timerRow[5] = ret[1];
timerRow[6] = ret[2];
timerRow[7] = ret[3];
case 1:
var start = (new Date()).getTime();
var ret = method2(tempSheetID, "Extract", 1, query);
timerRow[3] = ((new Date()).getTime() - start) / 1000;
timerRow[4] = ret[0];
timerRow[5] = ret[1];
timerRow[6] = ret[2];
timerRow[7] = ret[3];
case 2:
var start = (new Date()).getTime();
var ret = method3(tempSheetID, "Extract", "A", "AL", "B", query);
timerRow[3] = ((new Date()).getTime() - start) / 1000;
timerRow[4] = ret.table.rows[0].c[0].v;
timerRow[5] = ret.table.rows[0].c[1].v;
timerRow[6] = ret.table.rows[0].c[2].v;
timerRow[7] = ret.table.rows[0].c[3].v;
case 3:
var start = (new Date()).getTime();
var ret = method3(tempSheetID, "Extract", "A", "AL", "B", query);
timerRow[3] = ((new Date()).getTime() - start) / 1000;
timerRow[4] = ret.table.rows[0].c[0].v;
timerRow[5] = ret.table.rows[0].c[1].v;
timerRow[6] = ret.table.rows[0].c[2].v;
timerRow[7] = ret.table.rows[0].c[3].v;
// delete the temp file
// save run times
// start back at 0 if we're the end
if(++searchCounter == 4) searchCounter = 0;
// save the search counter
scriptProperties.setProperty("searchCounter", searchCounter);
I have a global variable searchList that is an array of various query strings -- some are in the sheet, some are not.
I ran testit on a trigger to run every minute. After 152 iterations I had 38 batches. Looking at the result, this is what I see for each method:
| Method | Minimum Seconds | Maximum Seconds | Average Seconds |
| method1 | 8.24 | 36.94 | 11.86 |
| method2 | 9.93 | 23.38 | 14.09 |
| method3 | 1.92 | 5.48 | 3.06 |
| method4 | 2.20 | 11.14 | 3.36 |
So it appears that, at least for my data-set, is using Google visualization query is the fastest.

nltk similarity performance issue?

nltk have nice word2word similarity function which measures similarity by how close the terms are to the common hypernym. Although that similarity function is not applicable to the situation where 2 terms differ from pos tag to pos tag, it still is great.
However, I found that it is so slow... It was 10x times slower than just term matching. Is there anyway the nltk similarity function become faster?
I have tested with this code below:
from nltk import stem, RegexpStemmer
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag
import time
file1 = open('./tester.csv', 'r')
def similarityCal(word1, word2):
synset1 = wordnet.synsets(word1)
synset2 = wordnet.synsets(word2)
if len(synset1) != 0 and len(synset2) != 0:
wordFromList1 = synset1[0]
wordFromList2 = synset2[0]
return wordFromList1.wup_similarity(wordFromList2)
return 0
start_time = time.time()
file1lines = file1.readlines()
stopwords = stopwords.words('english')
previousLine = ""
currentLine = ""
cntOri = 0
cntExp = 0
for line1 in file1lines:
currentLine = line1.lower().strip()
if previousLine == "":
previousLine = currentLine
for tag1 in pos_tag(currentLine.split(" ")):
tmpStr1 = tag1[0];
if tmpStr1 not in stopwords and len(tmpStr1) > 1:
if tmpStr1 in previousLine:
print("termMatching word", tmpStr1);
cntOri = cntOri + 1
for tag2 in pos_tag(previousLine.split(" ")):
tmpStr2 = tag2[0];
if tag1[1].startswith("NN") and tag2[1].startswith("NN") or tag1[1].startswith("VB") and tag2[1].startswith("VB"):
value = similarityCal(tmpStr1, tmpStr2)
if type(value) is float and value > 0.8:
print(tmpStr1, " similar to " , tmpStr2 , " ", value)
cntExp = cntExp + 1
previousLine = currentLine
end_time = time.time()
print ("time taken : ",end_time - start_time, " // ", cntOri, " | ", cntExp)
I just comment out similarity function to compare the performance.
And I have used samples from this site:
Any ideas?

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?
I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

Accessing position information in a scala combinatorparser kills performance

I wrote a new combinator for my parser in scala.
Its a variation of the ^^ combinator, which passes position information on.
But accessing the position information of the input element really cost performance.
In my case parsing a big example need around 3 seconds without position information, with it needs over 30 seconds.
I wrote a runnable example where the runtime is about 50% more when accessing the position.
Why is that? How can I get a better runtime?
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.combinator.Parsers
import scala.util.matching.Regex
import scala.language.implicitConversions
object FooParser extends RegexParsers with Parsers {
var withPosInfo = false
def b: Parser[String] = regexB("""[a-z]+""".r) ^^# { case (b, x) => b + " ::" + x.toString }
def regexB(p: Regex): BParser[String] = new BParser(regex(p))
class BParser[T](p: Parser[T]) {
def ^^#[U](f: ((Int, Int), T) => U): Parser[U] = Parser { in =>
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val inwo = in.drop(start - offset)
p(inwo) match {
case Success(t, in1) =>
var a = 3
var b = 4
{ // takes a lot of time
a = inwo.pos.line
b = inwo.pos.column
Success(f((a, b), t), in1)
case ns: NoSuccess => ns
def main(args: Array[String]) = {
val r = "foo"*50000000
var now = System.nanoTime
parseAll(b, r)
var us = (System.nanoTime - now) / 1000
println("without: %d us".format(us))
withPosInfo = true
now = System.nanoTime
parseAll(b, r)
us = (System.nanoTime - now) / 1000
println("with : %d us".format(us))
without: 2952496 us
with : 4591070 us
Unfortunately, I don't think you can use the same approach. The problem is that line numbers end up implemented by scala.util.parsing.input.OffsetPosition which builds a list of every line break every time it is created. So if it ends up with string input it will parse the entire thing on every call to pos (twice in your example). See the code for CharSequenceReader and OffsetPosition for more details.
There is one quick thing you can do to speed this up:
val ip = inwo.pos
a = ip.line
b = ip.column
to at least avoid creating pos twice. But that still leaves you with a lot of redundant work. I'm afraid to really solve the problem you'll have to build the index as in OffsetPosition yourself, just once, and then keep referring to it.
You could also file a bug report / make an enhancement request. This is not a very good way to implement the feature.
