Pyspark Streaming transform error - spark-streaming

Hi I am new to Pyspark Streaming.
numbers0 = sc.parallelize([1,2,3,4,5])
numbers1 = sc.parallelize([2,3,4,5,6])
numbers2 = sc.parallelize([3,4,5,6,7])
stream0 = ssc.queueStream([numbers0, numbers1, numbers2])
stream0.pprint()
ssc.start()
ssc.awaitTermination(20)
ssc.stop()
This works fine but as soon as I do the following I get an error:
stream1 = stream0.transform(lambda x: x.mean())
stream1.pprint()
ssc.start()
ssc.awaitTermination(20)
ssc.stop()
What I want is stream that only consists of the mean of my previous stream.
Does anyone know what I must do?

The error you are getting when calling transform is because it requires an RDD-to-RDD function as stated in Spark's documentation for the transform operation. When mean is called on an RDD, it does not return a new RDD and therefore the error.
Now, from what I understand you want to calculate the mean of each RDD that consists the DStream. The DStream is created with the queueStream and since the named parameter oneAtATime is left to default, your program will consume one RDD at every batch interval.
To calculate the mean for each RDD, you would normally do this inside a forEachRDD output operation like this
# Create stream0 as you do in your example
def calculate_mean(rdd):
mean_value = rdd.mean()
# do other stuff with mean_value like saving it to a database or just print it
stream0.forEachRDD(calculate_mean)
# Start and stop the Streaming Context

Related

how to iterate over a list of values returning from ops to jobs in dagster

I am new to the dagster world and working on ops and jobs concepts. \
my requirement is to read a list of data from config_schema and pass it to #op function and return the same list to jobs. \
The code is show as below
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
return lst
#job
def write_db():
tableNames_frozenList=read_tableNames()
print(f'-------------->',type(tableNames_frozenList))
print(f'-------------->{tableNames_frozenList}')
when it accepts the list in #op function, it is showing as a frozenlist type but when i tried to return to jobs it conver it into <class 'dagster._core.definitions.composition.InvokedNodeOutputHandle'> data type
My requirement is to fetch the list of data and iterate over the list and perform some operatiosn on individual data of a list using #ops
Please help to understand this
Thanks in advance !!!
When using ops / graphs / jobs in Dagster it's very important to understand that the code defined within a #graph or #job definition is only executed when your code is loaded by Dagster, NOT when the graph is actually executing. The code defined within a #graph or #job definition is essentially a compilation step that only serves to define the dependencies between ops - there shouldn't be any general-purpose python code within those definitions. Whatever operations you want to perform on data flowing through your job should take place within the #op definitions. So if you wanted to print the values of your list that is be input via a config schema, you might do something like
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
context.log.info(f'-------------->',type(tableNames_frozenList'))
context.log.info(f'-------------->{tableNames_frozenList}')
here's an example using two ops to do this data flow:
#op(config_schema={"table_name":list})
def read_tableNames(context):
lst=context.op_config['table_name']
return lst
#op
def print_tableNames(context, table_names):
context.log.info(f'-------------->',type(table_names)
#job
def simple_flow():
print_tableNames(read_tableNames())
Have a look at some of the Dagster tutorials for more examples

How can I extract, edit and replot a data matrix in Abaqus?

Good afternoon,
We´ve been working on an animal model (skull) applying a series of forces and evaluating the resultant stresses in Abaqus. We got some of those beautiful and colourful (blue-to-red) contour-plots. Now, we´d like to obtain a similar image but coloured by a new matrix, which will be the result of some methematical transformations.
So, how can I extract the data matrix used to set those colour patterns (I guess with X-, Y-, Z-, and von Mises-values or so), apply my transformation, and replot the data to get a new (comparable) figure with the new values?
Thanks a lot and have a great day!
I've never done it myself but I know that this is possible. You can start with the documentation (e.g. here and here).
After experimenting using GUI you can check out the corresponding python code which should be automatically recorded in the abaqus.rpy file at your working directory (or at C:\temp). Working it trhough you could get something like:
myodb = session.openOdb('my_fem.odb') # or alternatively `session.odbs['my_fem.odb']` if it is already loaded into the session
# Define a temporary step for accessing your transformed output
tempStep = myodb.Step(name='TempStep', description='', domain=TIME, timePeriod=1.0)
# Define a temporary frame to storeyour transformed output
tempFrame = tempStep.Frame(frameId=0, frameValue=0.0, description='TempFrame')
# Define a new field output
s1f2_S = myodb.steps['Step-1'].frames[2].fieldOutputs['S'] # Stress tensor at the second frame of the 'Step-1' step
s1f1_S = myodb.steps['Step-1'].frames[1].fieldOutputs['S'] # Stress tensor at the first frame of the 'Step-1' step
tmpField = s1f2_S - s1f1_S
userField = tempFrame.FieldOutput(
name='Field-1', description='s1f2_S - s1f1_S', field=tmpField
)
Now, to display your new Field Output using python you can do the following:
session.viewports['Viewport: 1'].odbDisplay.setFrame(
step='TempStep', frame=0
)
For more information on used methods and objects, you can consult with the documentation "Abaqus Scripting Reference Guide":
Step(): Odb commands -> OdbStep object -> Step();
Frame(): Odb commands -> OdbFrame object -> Frame();
FieldOutput object: Odb commands -> FieldOutput object;

How can i make dataset/dataframe from compressed(.zip) local file in apache spark

I have large compressed(.zip) files around 10 GB each. I need to read content of file inside zip without unzipping it and want to apply transformations.
System.setProperty("HADOOP_USER_NAME", user)
println("Creating SparkConf")
val conf = new SparkConf().setAppName("DFS Read Write Test")
println("Creating SparkContext")
val sc = new SparkContext(conf)
var textFile = sc.textFile(filePath)
println("Count...."+textFile.count())
var df = textFile.map(some code)
`
When i passing a any .txt,.log,.md etc.. above is working fine. But when pass .zip files the it giving Zero Count.
Why it is giving count Zero
Please suggest me correct way of doing this, If am totally wrong.
You have to perform this task like this, it's a different operation then simply loading other kind of files which spark supports.
val rdd = sc.newAPIHadoopFile("file.zip", ZipFileInputFormat.class,Text.class, Text.class, new Job().getConfiguration());

Getting a field value from pipe in outside the pipe in Hadoop Cascading

Regarding above subject, is there any way to get the value of a field from a pipe. And use that value outside the pipe's scope in Hadoop Cascading? The data has delimiter as '|':
first_name|description
Binod|nothing
Rohit|nothing
Ramesh|abc
From above pipe I need to get a value from the description, whatever that is 'nothing' or 'abc'.
Hadoop Cascading is developed with a concept of creating real case scenario by flowing data between pipe and executing parallely it over Map-Reduce Hadoop system.
Execution of java program is unnecessary to depend with rest of the cascading flow (from creating source tap to sink tap), and what Hadoop Cascading does is: it executes those two different processes in different independent JVM instances and they will be unable to share their values.
Following code and its output shows brief hints:
System.out.println("Before Debugging");
m_eligPipe = new Each(m_eligPipe, new Fields("first_name"), new Debug("On Middle", true));
System.out.println("After Debugging");
Expected ouput:
Before Debugging
On Middle: ['first_name']
On Middle: ['Binod']
On Middle: ['Rohit']
On Middle: ['Ramesh']
After Debugging
Actual output:
Before Debugging
After Debugging
...
...
On Middle: ['first_name']
On Middle: ['Binod']
On Middle: ['Rohit']
On Middle: ['Ramesh']
I don't understand what you are trying to say. Do you to mean to extract the value of field ${description} outside the scope of the pipe. If possible something like this in pseudo code.
str = get value of description in inputPipe (which is in the scope of the job rather than function or buffer)
I assume this is what you want: you have a pipe with one field, that is the concatenation of ${first_name} and ${description}. And you want the output to be a pipe with field that is ${description}.
If so, this is what I'd do: implement a function that extracts description and have your flow execute it.
You function (let's call it ExtractDescriptionFunction) should override method operate with something like this:
#Override
public void operate(FlowProcess flowProcess, FunctionCall<Tuple> functionCall) {
TupleEntry arguments = functionCall.getArguments();
String concatenation = arguments.getString("$input_field_name");
String[] values = concatenation.split("\\|"); // you might want to have some data sanity check here
String description = values[1];
Tuple tuple = functionCall.getContext();
tuple.set(0, description);
functionCall.getOutputCollector().add(tuple);
}
Then, in your flow definition, add this:
Pipe outputPipe = new Each(inputPipe, new ExtractDescriptionFunction());
Hope this helps.

Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

A help for the implementation best practice is needed.
The operating environment is as follows:
Log data file arrives irregularly.
The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB.
The number of records of a data file is from 13 lines to 22000 lines. The average is about 2700 lines.
Data file must be post-processed before aggregation.
Post-processing algorithm can be changed.
Post-processed file is managed separately with original data file, since the post-processing algorithm might be changed.
Daily aggregation is performed. All post-processed data file must be filtered record-by-record and aggregation(average, max min…) is calculated.
Since aggregation is fine-grained, the number of records after the aggregation is not so small. It can be about half of the number of the original records.
At a point, the number of the post-processed file can be about 200,000.
A data file should be able to be deleted individually.
In a test, I tried to process 160,000 post-processed files by Spark starting with sc.textFile() with glob path, it failed with OutOfMemory exception on the driver process.
What is the best practice to handle this kind of data?
Should I use HBase instead of plain files to save post-processed data?
I wrote own loader. It solved our problem with small files in HDFS. It uses Hadoop CombineFileInputFormat.
In our case it reduced the number of mappers from 100000 to approx 3000 and made job significantly faster.
https://github.com/RetailRocket/SparkMultiTool
Example:
import ru.retailrocket.spark.multitool.Loaders
val sessions = Loaders.combineTextFile(sc, "file:///test/*")
// or val sessions = Loaders.combineTextFile(sc, conf.weblogs(), size = 256, delim = "\n")
// where size is split size in Megabytes, delim - line break character
println(sessions.count())
I'm pretty sure the reason your getting OOM is because of handling so many small files. What you want is to combine the input files so you don't get so many partitions. I try to limit my jobs to about 10k partitions.
After textFile, you can use .coalesce(10000, false) ... not 100% sure that will work though because it's been a while since I've done it, please let me know. So try
sc.textFile(path).coalesce(10000, false)
You can use this
First You can get a Buffer/List of S3 Paths / Same for HDFS or Local Path
If you're trying with Amazon S3 then :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D

Resources