repetitive treatment on spark streaming - spark-streaming

I'm new to Spark, I want to make a treatment on files in streaming.
I have files csv which arrive non-stop:
Example csv file:
world world
count world
world earth
count world
and I want to do two treatment on them :
the first treatment is for a result like this :
(world,2,2) // word is twice repeated for the first column and distinct (world,earth) for second therefore (2,2)
(count,2,1) // word is twice repeated for the first column and not distinct (world,world) for second therefore (2,1)
the second result
I want to get that result after each hour.in our example:
(world,1) // 1=2/2
(count,2) //2=2/1
this is my code :
val conf = new SparkConf()
.setAppName("File Count")
.setMaster("local[2]")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(10m))
val file = ssc.textFileStream("hdfs://192.168.1.31:8020/user/sparkStreaming/input")
var result = file.map(x => (x.split(" ")(0)+";"+x.split(" ")(1), 1)).reduceByKey((x,y) => x+y)
val window = result.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(60), Seconds(20))
val result1 = window.map(x => x.toString )
val result2 = result1.map(line => line.split(";")(0)+","+line.split(",")(1))
val result3 = result2.map(line => line.substring(1, line.length-1))
val result4 = result3.map(line => (line.split(",")(0),line.split(",")(1).toInt ) )
val result5 = result4.reduceByKey((x,y) => x+y )
val result6 = result3.map(line => (line.split(",")(0), 1 ))
val result7 = result6.reduceByKey((x,y) => x+y )
val result8 = result7.join(result5) // (world,2,2)
val finalResult = result8.mapValues(x => x._1.toFloat / x._2 ) // (world,1), I want this result after every one hour
ssc.start()
ssc.awaitTermination()
Thanks in Advance!!!

Related

How to convert an RDD (that read in a directory of text files) into dataFrame in Apache Spark in Scala?

I'm developing a Scala feature extracting app using Apache Spark TF-IDF. I need to read in from a directory of text files. I'm trying to convert an RDD to a dataframe but I'm getting the error "value toDF() is not a member of org.apache.spark.rdd.RDD[streamedRDD]". This is what I have right now ...
I have spark-2.2.1 & Scala 2.1.11. Thanks in advance.
Code:
// Creating the Spark context that will interface with Spark
val conf = new SparkConf()
.setMaster("local")
.setAppName("TextClassification")
val sc = new SparkContext(conf)
// Load documents (one per line)
val data = sc.wholeTextFiles("C:/Users/*")
val text = data.map{case(filepath,text) => text}
val id = data.map{case(filepath, text) => text.split("#").takeRight(1)(0)}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class dataStreamed(id: String, input: String)
val tweetsDF = data
.map{case (filepath, text) =>
val id = text.split("#").takeRight(1)(0)
val input = text.split(":").takeRight(2)(0)
dataStreamed(id, input)}
.as[dataStreamed]
.toDF()
.cache()
// -------------------- TF-IDF --------------------
// From spark.apache.org
// URL http://spark.apache.org/docs/latest/ml-features.html#tf-idf
val tokenizer = new Tokenizer().setInputCol("input").setOutputCol("words")
val wordsData = tokenizer.transform(tweetsDF)
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("rawFeatures")
val tf = hashingTF.transform(wordsData).cache() // Hashed words
// Compute for the TFxIDF
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val tfidf = idf.fit(tf)
Data: (Text files like these in a folder is what I need read in)
https://www.dropbox.com/s/cw3okhaosu7i1md/cars.txt?dl=0
https://www.dropbox.com/s/29tgqg7ifpxzwwz/Italy.txt?dl=0
The problem here is that map function returns a type of Dataset[Row] which you assign to tweetsDF. It should be:
case class dataStreamed(id: String, input: String)
def test() = {
val sparkConf = new SparkConf().setAppName("TextClassification").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
// Load documents (one per line)
val data = spark.sparkContext.wholeTextFiles("C:\\tmp\\stackoverflow\\*")
val dataset = spark.createDataset(data)
val tweetsDF = dataset
.map{case (id : String, input : String) =>
val file = id.split("#").takeRight(1)(0)
val content = input.split(":").takeRight(2)(0)
dataStreamed(file, content)}
.as[dataStreamed]
tweetsDF.printSchema()
tweetsDF.show(10)
}
First data will be an RDD(String, String) then I create a new Dataset with spark.createDataset in order to be able to use map properly together with the case class. Please note that you must define dataStreamedclass out of your method (test in this case)
Good luck
We can do this with couple of commands/functions:
Invoke the spark/scala shell, you can use driver-memory, executor-memory, executor-cores etc as suits your job
spark-shell
Read the text file from HDFS
val text_rdd = sc.textFile("path/to/file/on/hdfs")
Convert the text rdd to DataFrame
val text_df = text_rdd.toDF
Save as plan text format in HDFS
text_df.saveAsTextFile("path/to/hdfs")
Save as splittable compressed format in HDFS
text_df.coalesce(1).write.parquet("path/to/hdfs")

How to split an akka ByteString at a two byte delimiter

I'm creating a WebSocket protocol of my own, and thought to have a text key/value header part, ending at two consecutive newlines, followed by a binary tail.
Turns out, splitting a ByteString in half (at the two newlines) is really tedious. There is no built-in .split method, for one. And no .indexOf for finding a binary fingerprint.
What would you use for this? Is there an easier way for me to build such a protocol?
References:
akka ByteString
Using akka-http 10.1.0-RC1, akka 2.5.8
One approach would be to first create sliding pairs from an indexedSeq of the ByteString, then split the ByteString using the identified indexes of the delimiter-pair, as in the following example:
import akka.util.ByteString
val bs = ByteString("aa\nbb\n\nxyz")
// bs: akka.util.ByteString = ByteString(97, 97, 10, 98, 98, 10, 10, 120, 121, 122)
val delimiter = 10
// Create sliding pairs from indexedSeq of the ByteString
val slidingList = bs.zipWithIndex.sliding(2).toList
// slidingList: List[scala.collection.immutable.IndexedSeq[(Byte, Int)]] = List(
// Vector((97,0), (97,1)), Vector((97,1), (10,2)), Vector((10,2), (98,3)),
// Vector((98,3), (98,4)), Vector((98,4), (10,5)), Vector((10,5), (10,6)),
// Vector((10,6), (120,7)), Vector((120,7), (121,8)), Vector((121,8), (122,9))
// )
// Get indexes of the delimiter-pair
val dIndex = slidingList.filter{
case Vector(x, y) => x._1 == delimiter && y._1 == delimiter
}.flatMap{
case Vector(x, y) => Seq(x._2, y._2)
}
// Split the ByteString list
val (bs1, bs2) = ( bs.splitAt(dIndex(0))._1, bs.splitAt(dIndex(1))._2.tail )
// bs1: akka.util.ByteString = ByteString(97, 97, 10, 98, 98)
// bs2: akka.util.ByteString = ByteString(120, 121, 122)
I came up with this. Haven't tested it in practise, yet.
#tailrec
def peelMsg(bs: ByteString, accHeaderLines: Seq[String]): Tuple2[Seq[String],ByteString] = {
val (a: ByteString, tail: ByteString) = bs.span(_ != '\n')
val b: ByteString = tail.drop(1)
if (a.isEmpty) { // end marker - empty line
Tuple2(accHeaderLines,b)
} else {
val acc: Seq[String] = accHeaderLines :+ a.utf8String // append
peelMsg(b,acc)
}
}
val (headerLines: Seq[String], value: ByteString) = peelMsg(bs,Seq.empty)
My code, for now:
// Find the index of the (first) double-newline
//
val n: Int = {
val bsLen: Int = bs.length
val tmp: Int = bs.zipWithIndex.find{
case ('\n',i) if i<bsLen-1 && bs(i+1)=='\n' => true
case _ => false
}.map(_._2).getOrElse{
throw new RuntimeException("No delimiter found")
}
tmp
}
val (bs1: ByteString, bs2: ByteString) = bs.splitAt(n) // headers, \n\n<binary>
Influenced by #leo-c's answer, but using a normal .find instead of the sliding window. Realised that since ByteString allows random access, I can combine a streaming search with that condition.

Spark LinearRegressionSummary "normal" summary

According to LinearRegressionSummary (Spark 2.1.0 JavaDoc), p-values are only available for the "normal" solver.
This value is only available when using the "normal" solver.
What the hell is the "normal" solver?
I'm doing this:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
.
.
.
val (trainingData, testData): (DataFrame, DataFrame) =
com.acme.pta.accuracy.Util.splitData(output, testProportion)
.
.
.
val lr =
new org.apache.spark.ml.regression.LinearRegression()
.setSolver("normal").setMaxIter(maxIter)
val pipeline = new Pipeline()
.setStages(Array(lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.2, 0.4, 0.8, 0.9))
.addGrid(lr.regParam, Array(0,6, 0.3, 0.1, 0.01))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds) // Use 3+ in practice
val cvModel: CrossValidatorModel = cv.fit(trainingData)
val pipelineModel: PipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel: LinearRegressionModel =
pipelineModel.stages(0).asInstanceOf[LinearRegressionModel]
val modelSummary = lrModel.summary
Holder.log.info("lrModel.summary: " + modelSummary)
try {
Holder.log.info("feature p values: ")
// Exception occurs on line below.
val featuresAndPValues = features.zip(lrModel.summary.pValues)
featuresAndPValues.foreach(
(featureAndPValue: (String, Double)) =>
Holder.log.info(
"feature: " + featureAndPValue._1 + ": " + featureAndPValue._2))
} catch {
case _: java.lang.UnsupportedOperationException
=> Holder.log.error("Cannot compute p-values")
}
I am still getting the UnsupportedOperationException.
The exception message is:
No p-value available for this LinearRegressionModel
Is there something else I need to be doing? I'm using
"org.apache.spark" %% "spark-mllib" % "2.1.1"
Is pValues supported in that version?
Updated
tl;dr
Solution 1
In normal LinearRegression pValues and other "normal" statistics are only present when one of the parameters elasticNetParam or regParam is zero. So you can change
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
or
.addGrid( lr.regParam, Array( 0.0 ) )
Solution 2
Make custom version of LinearRegression which would explicitly use
"normal" solver for regression.
Cholesky solver for WeightedLeastSquares.
I made this class as an extension to ml.regression package.
package org.apache.spark.ml.regression
import scala.collection.mutable
import org.apache.spark.SparkException
import org.apache.spark.internal.Logging
import org.apache.spark.ml.feature.Instance
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.optim.WeightedLeastSquares
import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
import org.apache.spark.ml.util._
import org.apache.spark.mllib.linalg.VectorImplicits._
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
class CholeskyLinearRegression ( override val uid: String )
extends Regressor[ Vector, CholeskyLinearRegression, LinearRegressionModel ]
with LinearRegressionParams with DefaultParamsWritable with Logging {
import CholeskyLinearRegression._
def this() = this(Identifiable.randomUID("linReg"))
def setRegParam(value: Double): this.type = set(regParam, value)
setDefault(regParam -> 0.0)
def setFitIntercept(value: Boolean): this.type = set(fitIntercept, value)
setDefault(fitIntercept -> true)
def setStandardization(value: Boolean): this.type = set(standardization, value)
setDefault(standardization -> true)
def setElasticNetParam(value: Double): this.type = set(elasticNetParam, value)
setDefault(elasticNetParam -> 0.0)
def setMaxIter(value: Int): this.type = set(maxIter, value)
setDefault(maxIter -> 100)
def setTol(value: Double): this.type = set(tol, value)
setDefault(tol -> 1E-6)
def setWeightCol(value: String): this.type = set(weightCol, value)
def setSolver(value: String): this.type = set(solver, value)
setDefault(solver -> Auto)
def setAggregationDepth(value: Int): this.type = set(aggregationDepth, value)
setDefault(aggregationDepth -> 2)
override protected def train(dataset: Dataset[_]): LinearRegressionModel = {
// Extract the number of features before deciding optimization solver.
val numFeatures = dataset.select(col($(featuresCol))).first().getAs[Vector](0).size
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
val instances: RDD[Instance] =
dataset
.select( col( $(labelCol) ), w, col( $(featuresCol) ) )
.rdd.map {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
}
// if (($(solver) == Auto &&
// numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) == Normal) {
// For low dimensional data, WeightedLeastSquares is more efficient since the
// training algorithm only requires one pass through the data. (SPARK-10668)
val optimizer = new WeightedLeastSquares(
$(fitIntercept),
$(regParam),
elasticNetParam = $(elasticNetParam),
$(standardization),
true,
solverType = WeightedLeastSquares.Cholesky,
maxIter = $(maxIter),
tol = $(tol)
)
val model = optimizer.fit(instances)
val lrModel = copyValues(new LinearRegressionModel(uid, model.coefficients, model.intercept))
val (summaryModel, predictionColName) = lrModel.findSummaryModelAndPredictionCol()
val trainingSummary = new LinearRegressionTrainingSummary(
summaryModel.transform(dataset),
predictionColName,
$(labelCol),
$(featuresCol),
summaryModel,
model.diagInvAtWA.toArray,
model.objectiveHistory
)
lrModel
.setSummary( Some( trainingSummary ) )
lrModel
}
override def copy(extra: ParamMap): CholeskyLinearRegression = defaultCopy(extra)
}
object CholeskyLinearRegression
extends DefaultParamsReadable[CholeskyLinearRegression] {
override def load(path: String): CholeskyLinearRegression = super.load(path)
val MAX_FEATURES_FOR_NORMAL_SOLVER: Int = WeightedLeastSquares.MAX_NUM_FEATURES
/** String name for "auto". */
private[regression] val Auto = "auto"
/** String name for "normal". */
private[regression] val Normal = "normal"
/** String name for "l-bfgs". */
private[regression] val LBFGS = "l-bfgs"
/** Set of solvers that LinearRegression supports. */
private[regression] val supportedSolvers = Array(Auto, Normal, LBFGS)
}
All you have to do is to paste it to the separate file in the project and change LinearRegression to CholeskyLinearRegression in your code.
val lr = new CholeskyLinearRegression() // new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
It works with non-zero params and gives pValues. Tested on following params grid.
val paramGrid = new ParamGridBuilder()
.addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
.build()
Full investigation
I initially thought that the main issue is with the model being not fully preserved. Trained model is not preserved after fitting in CrossValidator. It is understandable because of memory consumption. There is an ongoing debate on how should it be resolved. Issue in JIRA.
You can see in the commented section that I tried to extract parameters from the best model in order to run it again. Then I found out that the model summary is ok, it's just for some parameters diagInvAtWa has length of 1 and basically a zero.
For ridge regression or Tikhonov regularization (elasticNet = 0) and any regParam pValues and other "normal" statistics can be computed but for Lasso method and something in between (elastic net) not. Same goes for regParam = 0: with any elasticNet pValues were computed.
Why is that
LinearRegression uses Weighted Least Square optimizer for "normal" solver with solverType = WeightedLeastSquares.Auto. This optimizer has two options for solvers: QuasiNewton or Cholesky. The former is selected only when both regParam and elasticNetParam are non-zeroes.
val solver = if (
( solverType == WeightedLeastSquares.Auto &&
elasticNetParam != 0.0 &&
regParam != 0.0 ) ||
( solverType == WeightedLeastSquares.QuasiNewton ) ) {
...
new QuasiNewtonSolver(fitIntercept, maxIter, tol, effectiveL1RegFun)
} else {
new CholeskySolver
}
So in your parameters grid the QuasiNewtonSolver will be always used because there are no combinations of regParam and elasticNetParam where one of them is zero.
We know that in order to get pValues and other "normal" statistics such as t-statistic or std. error of coefficients the diagonal of matrix (A^T * W * A)^-1 (diagInvAtWA) must not be a vector with only one zero. This condition is set in definition of pValues.
diagInvAtWA is a vector of diagonal elements of packed upper triangular matrix (solution.aaInv).
val diagInvAtWA = solution.aaInv.map { inv => ...
For Cholesky solver it is calculated but for QuasiNewton not. Second parameter for NormalEquationSolution is this matrix.
You technically could make your own version of LinearRegression with
Reproduction
In this example I used data sample_linear_regression_data.txt from here.
Full code of reproduction
import org.apache.spark._
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.evaluation.{RegressionEvaluator, BinaryClassificationEvaluator}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.regression.{LinearRegressionModel, LinearRegression}
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, ParamGridBuilder}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.ml.param.ParamMap
object Main {
def main( args: Array[ String ] ): Unit = {
val spark =
SparkSession
.builder()
.appName( "SO" )
.master( "local[*]" )
.config( "spark.driver.host", "localhost" )
.getOrCreate()
import spark.implicits._
val data =
spark
.read
.format( "libsvm" )
.load( "./sample_linear_regression_data.txt" )
val Array( training, test ) =
data
.randomSplit( Array( 0.9, 0.1 ), seed = 12345 )
val maxIter = 10;
val lr = new LinearRegression()
.setSolver( "normal" )
.setMaxIter( maxIter )
val paramGrid = new ParamGridBuilder()
// .addGrid( lr.elasticNetParam, Array( 0.2, 0.4, 0.8, 0.9 ) )
.addGrid( lr.elasticNetParam, Array( 0.0 ) )
.addGrid( lr.regParam, Array( 0.6, 0.3, 0.1, 0.01 ) )
.build()
val pipeline = new Pipeline()
.setStages( Array( lr ) )
val cv = new CrossValidator()
.setEstimator( pipeline )
.setEvaluator( new RegressionEvaluator )
.setEstimatorParamMaps( paramGrid )
.setNumFolds( 2 ) // Use 3+ in practice
val cvModel =
cv
.fit( training )
val pipelineModel: PipelineModel =
cvModel
.bestModel
.asInstanceOf[ PipelineModel ]
val lrModel: LinearRegressionModel =
pipelineModel
.stages( 0 )
.asInstanceOf[ LinearRegressionModel ]
// Technically there is a way to use exact ParamMap
// to build a new LR but for the simplicity I'll
// get and set them explicitly
// lrModel.params.foreach( ( param ) => {
// println( param )
// } )
// val bestLr = new LinearRegression()
// .setSolver( "normal" )
// .setMaxIter( maxIter )
// .setRegParam( lrModel.getRegParam )
// .setElasticNetParam( lrModel.getElasticNetParam )
// val bestLrModel = bestLr.fit( training )
val modelSummary =
lrModel
.summary
println( "lrModel pValues: " + modelSummary.pValues.mkString( ", " ) )
spark.stop()
}
}
Original
There are three solver algorithms available:
l-bfgs - Limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm which is a limited-memory quasi-Newton optimization method.
normal - using Normal Equation as an analytical solution to the linear regression problem. It is basically a weighted least squares approach or reweighted least squares approach.
auto - solver algorithm is selected automatically. The Normal Equations solver will be used when possible, but this will automatically fall back to iterative optimization methods when needed
The coefficientStandardErrors, tValues and pValues are only available when using the "normal" solver because they are all based on diagInvAtWA - a diagonal of matrix (A^T * W * A)^-1.

Scala/functional/without libs - check if string permutation of other

How could you check to see if one string is a permutation of another using scala/functional programming with out complex pre-built functions like sorted()?
I'm a Python dev and what I think trips me up the most is that you can't just iterate through a dictionary of character counts comparing to another dictionary of character counts, then just exit when there isn't a match, you can't just call break.
Assume this is the starting point, based on your description:
val a = "aaacddba"
val b = "aabaacdd"
def counts(s: String) = s.groupBy(identity).mapValues(_.size)
val aCounts = counts(a)
val bCounts = counts(b)
This is the simplest way:
aCounts == bCounts // true
This is precisely what you described:
def isPerm(aCounts: Map[Char,Int], bCounts: Map[Char,Int]): Boolean = {
if (aCounts.size != bCounts.size)
return false
for ((k,v) <- aCounts) {
if (bCounts.getOrElse(k, 0) != v)
return false
}
return true
}
This is your method, but more scala-ish. (It also breaks as soon as a mismatch is found, because of how foreach is implemented):
(aCounts.size == bCounts.size) &&
aCounts.forall { case (k,v) => bCounts.getOrElse(k, 0) == v }
(Also, Scala does have break.)
Also, also: you should read the answer to this question.
Another option using recursive function, which will also 'break' immediately once mismatch is detected:
import scala.annotation.tailrec
#tailrec
def isPerm1(a: String, b: String): Boolean = {
if (a.length == b.length) {
a.headOption match {
case Some(c) =>
val i = b.indexOf(c)
if (i >= 0) {
isPerm1(a.tail, b.substring(0, i) + b.substring(i + 1))
} else {
false
}
case None => true
}
} else {
false
}
}
Out of my own curiosity I also create two more versions which use char counts map for matching:
def isPerm2(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
cntsA == cntsB
}
and
def isPerm3(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
(cntsA == cntsB) && cntsA.forall { case (k, v) => cntsB.getOrElse(k, 0) == v }
}
and roughly compare their performance by:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
// Match
time((1 to 10000).foreach(_ => isPerm1("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("apple"*100,"elppa"*100)))
// Mismatch
time((1 to 10000).foreach(_ => isPerm1("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("xpple"*100,"elppa"*100)))
and the result is:
Match cases
isPerm1 = 2337999406ns
isPerm2 = 383375133ns
isPerm3 = 382514833ns
Mismatch cases
isPerm1 = 29573489ns
isPerm2 = 381622225ns
isPerm3 = 417863227ns
As can be expected, the char counts map speeds up positive cases but can slow down negative cases (overhead on building the char counts map).

R: tm Textmining package: Doc-Level metadata generation is slow

I have a list of documents to process, and for each record I want to attach some metadata to the document "member" inside the "corpus" data structure that tm, the R package, generates (from reading in text files).
This for-loop works but it is very slow,
Performance seems to degrade as a function f ~ 1/n_docs.
for (i in seq(from= 1, to=length(corpus), by=1)){
if(opts$options$verbose == TRUE || i %% 50 == 0){
print(paste(i, " ", substr(corpus[[i]], 1, 140), sep = " "))
}
DublinCore(corpus[[i]], "title") = csv[[i,10]]
DublinCore(corpus[[i]], "Publisher" ) = csv[[i,16]] #institutions
}
This may do something to the corpus variable but I don't know what.
But when I put it inside a tm_map() (similar to lapply() function), it runs much faster, but the changes are not made persistent:
i = 0
corpus = tm_map(corpus, function(x){
i <<- i + 1
if(opts$options$verbose == TRUE){
print(paste(i, " ", substr(x, 1, 140), sep = " "))
}
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
})
Variable corpus has empty metadata fields after exiting the tm_map function. It should be filled. I have a few other things to do with the collection.
The R documentation for the meta() function says this:
Examples:
data("crude")
meta(crude[[1]])
DublinCore(crude[[1]])
meta(crude[[1]], tag = "Topics")
meta(crude[[1]], tag = "Comment") <- "A short comment."
meta(crude[[1]], tag = "Topics") <- NULL
DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous"
DublinCore(crude[[1]], tag = "Format") <- "XML"
DublinCore(crude[[1]])
meta(crude[[1]])
meta(crude)
meta(crude, type = "corpus")
meta(crude, "labels") <- 21:40
meta(crude)
I tried many of these calls (with var "corpus" instead of "crude"), but they do not seem to work.
Someone else once seemed to have had the same problem with a similar data set (forum post from 2009, no response)
Here's a bit of benchmarking...
With the for loop :
expr.for <- function() {
for (i in seq(from= 1, to=length(corpus), by=1)){
DublinCore(corpus[[i]], "title") = LETTERS[round(runif(26))]
DublinCore(corpus[[i]], "Publisher" ) = LETTERS[round(runif(26))]
}
}
microbenchmark(expr.for())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.for() 21.50504 22.40111 23.56246 23.90446 70.12398
With tm_map :
corpus <- crude
expr.map <- function() {
tm_map(corpus, function(x) {
meta(x, "title") = LETTERS[round(runif(26))]
meta(x, "Publisher" ) = LETTERS[round(runif(26))]
x
})
}
microbenchmark(expr.map())
# Unit: milliseconds
# expr min lq median uq max
# 1 expr.map() 5.575842 5.700616 5.796284 5.886589 8.753482
So the tm_map version, as you noticed, seems to be about 4 times faster.
In your question you say that the changes in the tm_map version are not persistent, it is because you don't return x at the end of your anonymous function. In the end it should be :
meta(x, tag = "Heading") = csv[[i,10]]
meta(x, tag = "publisher" ) = csv[[i,16]]
x

Resources