Trying to get spark streaming to read data stream from website, what is the socket? - hadoop

I am trying to get this data http://stream.meetup.com/2/rsvps into spark stream
They are JSON objects, I know the lines will be strings, I just want it to work before I try JSON.
I am not sure what to put as the port, I assume that is the problem.
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("Spark Streaming");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("http://stream.meetup.com/2/rsvps", 80);
lines.print();
jssc.start();
jssc.awaitTermination();
Here is my error
java.net.UnknownHostException: http://stream.meetup.com/2/rsvps
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:208)

The socketTextStream isn't designed to work as an http client. As you noticed, you will need to create a custom receiver, one potential place to start is based on the receiver created as part of the meetup streaming data source (see https://github.com/actions/meetup-stream/blob/master/src/main/scala/receiver/MeetupReceiver.scala ).

Here is a custom UrlReceiver which follows spark documentation on custom receivers:
class UrlReceiver(urlStr: String) extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) with Logging {
override def onStart() = {
new Thread("Url Receiver") {
override def run() = {
val urlConnection: URLConnection = new URL(urlStr).openConnection
val bufferedReader: BufferedReader = new BufferedReader(
new InputStreamReader(urlConnection.getInputStream)
)
var msg = bufferedReader.readLine
while (msg != null) {
if (!msg.isEmpty) {
store(msg)
}
msg = bufferedReader.readLine
}
bufferedReader.close()
}
}.start()
}
override def onStop() = {
// nothing to do
}
}
Then use it like this:
val lines = sc.receiverStream(new UrlReceiver("http://stream.meetup.com/2/rsvps"))

Related

TPL Performance With Apache Kafka

Halo,
I'm unable to get any improved performance with TPL DataFlow and wondering if I'm using it incorrectly.
The application below does the following:
Pulls message from a Kafka topic
Parses this message into an Foo object with ParseData()
Serializes this Foo into JSON
Then publishes the JSON to a new Kafka topic.
Some single threaded stats:
ParseData can parse strings into Foo at 100 msg/sec (single threaded test)
SerializeMessage can do 200 Foos/sec (single threaded test)
Consuming Kafka messages (skipping all the parsing/serializing) can handle over 2000 msgs/sec
Based on this, I had hopes to leverage TPL for improving throughput. My max throughput should be close to the Kafka limit of 2000 msgs/sec.
However, I'm not seeing any improvements in throughput and I'm running the application on a machine with 12 physical cores (24 w HT). When I print out the size of the queue for each block, the transformBlock is always around 1000, but the others are under 10 which leads me to believe that the transformBlock isn't leveraging multi-core system.
Have I setup TPL DataFlow to leverage paralellism correctly?
app = new App();
await app.Start(new[]{"consume-topic"}, cancelSource);
// App class
async Task Start(IEnumerable<string> topics, CancellationTokenSource cancelSource) {
transformBlock = new TransformBlock<string, Foo>(TransformKafkaMessage,
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 8,
BoundedCapacity = 1000,
SingleProducerConstrained = true,
});
serializeBlock = new TransformBlock<Foo, string>(SerializeMessage,
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 4,
BoundedCapacity = 1000,
SingleProducerConstrained = true,
});
publishBlock = new ActionBlock<JsonMessage>(PublishJson,
new ExecutionDataflowBlockOptions
{
MaxDegreeOfParallelism = 1,
BoundedCapacity = 1000,
SingleProducerConstrained = true
});
// Setup the pipeline
transformBlock.LinkTo(serializeBlock);
serializeBlock.LinkTo(publishBlock);
// Start Kafka Listener loop
consumer.Subscribe(topics);
while(true) {
var result = consumer.Consume(cancelSource.Token);
await ProcessMessage<Ignore, string>(result);
}
}
// send the content of the kafka message to transform block
async Task ProcessMessage<TKey, TValue>(ConsumeResult<TKey, string> msg) {
var result = await transformBlock.SendAsync(msg.Value);
}
// Convert the raw string data into an object
Foo TransformKafkaMessage(string data) {
// Note this ParseData() function can process about 100 items per sec
// in local single threaded testing
Foo foo = ParseData(data);
return foo;
}
// Serialize the new Foo into JSON
string SerializeMessage(Foo foo) {
// The serializer can process about 200 msgs/sec (single threaded test)
var json = foo.Serialize();
return json;
}
// publish new message back to Kafka
void PublishJson(string json) {
// Create a Confluent.Kafka Message
var kafkaMessage = new Message<Null, string> {
Value = json
};
producer.Produce("produce-topic", kafkaMessage);
}

AWS Lambda logging through Serilog UDP sink and logstash silently fails

We have a .NET Core 2.1 AWS Lambda that I'm trying to hook into our existing logging system.
I'm trying to log through Serilog using a UDP sink to our logstash instance for ingestion into our ElasticSearch logging database that is hosted on a private VPC. Running locally through a console logs fine, both to the console itself and through UDP into Elastic. However, when it runs as a lambda, it only logs to the console (i.e CloudWatch), and doesn't output anything indicating that anything is wrong. Possibly because UDP is stateless?
NuGet packages and versions:
Serilog 2.7.1
Serilog.Sinks.Udp 5.0.1
Here is the logging code we're using:
public static void Configure(string udpHost, int udpPort, string environment)
{
var udpFormatter = new JsonFormatter(renderMessage: true);
var loggerConfig = new LoggerConfiguration()
.Enrich.FromLogContext()
.MinimumLevel.Information()
.Enrich.WithProperty("applicationName", Assembly.GetExecutingAssembly().GetName().Name)
.Enrich.WithProperty("applicationVersion", Assembly.GetExecutingAssembly().GetName().Version.ToString())
.Enrich.WithProperty("tags", environment);
loggerConfig
.WriteTo.Console(outputTemplate: "[{Level:u}]: {Message}{N---ewLine}{Exception}")
.WriteTo.Udp(udpHost, udpPort, udpFormatter);
var logger = loggerConfig.CreateLogger();
Serilog.Log.Logger = logger;
Serilog.Debugging.SelfLog.Enable(Console.Error);
}
// this is output in the console from the lambda, but doesn't appear in the Database from the lambda
// when run locally, appears in both
Serilog.Log.Logger.Information("Hello from Serilog!");
...
// at end of lambda
Serilog.Log.CloseAndFlush();
And here is our UDP input on logstash:
udp {
port => 5000
tags => [ 'systest', 'serilog-nested' ]
codec => json
}
Does anyone know how I might go about resolving this? Or even just seeing what specifically is wrong so that I can start to find a solution.
Things tried so far include:
Pinging logstash from the lambda - impossible, lambda doesn't have ICMP
Various things to try and get the UDP sink to output errors, as seen above, various attempts at that. Even putting in a completely fake address yields no error though
Adding the lambda to a VPC where I know logging is possible from
Sleeping around at the end of the lambda. SO that the logs have time to go through before the lambda exits
Checking the logstash logs to see if anything looks odd. It doesn't really. And the fact that local runs get through fine makes me think it's not that.
Using UDP directly. It doesn't seem to reach the server. I'm not sure if that's connectivity issues or just UDP itself from a lambda.
Lots of cursing and swearing
In line with my comment above you can create a log subscription and stream to ES like so, I'm aware that this is NodeJS so it's not quite the right answer but you might be able to figure it out from here:
/* eslint-disable */
// Eslint disabled as this is adapted AWS code.
const zlib = require('zlib')
const { Client } = require('#elastic/elasticsearch')
const elasticsearch = new Client({ ES_CLUSTER_DETAILS })
/**
* This is an example function to stream CloudWatch logs to ElasticSearch.
* #param event
* #param context
* #param callback
*/
export default (event, context, callback) => {
context.callbackWaitsForEmptyEventLoop = true
const payload = new Buffer(event.awslogs.data, 'base64')
zlib.gunzip(payload, (err, result) => {
if (err) {
return callback(null, err)
}
const logObject = JSON.parse(result.toString('utf8'))
const elasticsearchBulkData = transform(logObject)
const params = { body: [] }
params.body.push(elasticsearchBulkData)
esClient.bulk(params, (err, resp) => {
if (err) {
callback(null, 'success')
return
}
})
callback(null, 'success')
})
}
function transform(payload) {
if (payload.messageType === 'CONTROL_MESSAGE') {
return null
}
let bulkRequestBody = ''
payload.logEvents.forEach((logEvent) => {
const timestamp = new Date(1 * logEvent.timestamp)
// index name format: cwl-YYYY.MM.DD
const indexName = [
`cwl-${process.env.NODE_ENV}-${timestamp.getUTCFullYear()}`, // year
(`0${timestamp.getUTCMonth() + 1}`).slice(-2), // month
(`0${timestamp.getUTCDate()}`).slice(-2), // day
].join('.')
const source = buildSource(logEvent.message, logEvent.extractedFields)
source['#id'] = logEvent.id
source['#timestamp'] = new Date(1 * logEvent.timestamp).toISOString()
source['#message'] = logEvent.message
source['#owner'] = payload.owner
source['#log_group'] = payload.logGroup
source['#log_stream'] = payload.logStream
const action = { index: {} }
action.index._index = indexName
action.index._type = 'lambdaLogs'
action.index._id = logEvent.id
bulkRequestBody += `${[
JSON.stringify(action),
JSON.stringify(source),
].join('\n')}\n`
})
return bulkRequestBody
}
function buildSource(message, extractedFields) {
if (extractedFields) {
const source = {}
for (const key in extractedFields) {
if (extractedFields.hasOwnProperty(key) && extractedFields[key]) {
const value = extractedFields[key]
if (isNumeric(value)) {
source[key] = 1 * value
continue
}
const jsonSubString = extractJson(value)
if (jsonSubString !== null) {
source[`$${key}`] = JSON.parse(jsonSubString)
}
source[key] = value
}
}
return source
}
const jsonSubString = extractJson(message)
if (jsonSubString !== null) {
return JSON.parse(jsonSubString)
}
return {}
}
function extractJson(message) {
const jsonStart = message.indexOf('{')
if (jsonStart < 0) return null
const jsonSubString = message.substring(jsonStart)
return isValidJson(jsonSubString) ? jsonSubString : null
}
function isValidJson(message) {
try {
JSON.parse(message)
} catch (e) { return false }
return true
}
function isNumeric(n) {
return !isNaN(parseFloat(n)) && isFinite(n)
}
One of my colleagues helped me get most of the way there, and then I managed to figure out the last bit.
I updated Serilog.Sinks.Udp to 6.0.0
I updated the UDP setup code to use the AddressFamily.InterNetwork specifier, which I don't believe was available in 5.0.1.
I removed enriching our log messages with "tags", since I believe it being present on the UDP endpoint somehow caused some kind of clash and I've seen it stop logging without a trace before.
And voila!
Here's the new logging setup code:
loggerConfig
.WriteTo.Udp(udpHost, udpPort, AddressFamily.InterNetwork, udpFormatter)
.WriteTo.Console(outputTemplate: "[{Level:u}]: {Message}{NewLine}{Exception}");

Breaking on exception: String expected

When I run my code I get:
Breaking on exception: String expected
What I am trying to do is connect to my server using a websocket. However, it seems that no matter if my server is online or not the client still crashes.
My code:
import 'dart:html';
WebSocket serverConn;
int connectionAttempts;
TextAreaElement inputField = querySelector("#inputField");
String key;
void submitMessage(Event e) {
if (serverConn.readyState == WebSocket.OPEN) {
querySelector("#chatLog").text = inputField.value;
inputField.value = "";
}
}
void recreateConnection(Event e) {
connectionAttempts++;
if (connectionAttempts <= 5) {
inputField.value = "Connection failed, reconnecting. Attempt" + connectionAttempts.toString() + "out of 5";
serverConn = new WebSocket("ws://127.0.0.1:8887");
serverConn.onClose.listen(recreateConnection);
serverConn.onError.listen(recreateConnection);
} else {
inputField.value = "Connections ran out, please refresh site";
}
}
void connected(Event e) {
serverConn.sendString(key);
if (serverConn.readyState == WebSocket.OPEN) {
inputField.value = "CONNECTED!";
inputField.readOnly = false;
}
}
void main() {
serverConn = new WebSocket("ws://127.0.0.1:8887");
serverConn.onClose.listen(recreateConnection);
serverConn.onError.listen(recreateConnection);
serverConn.onOpen.listen(connected);
//querySelector("#inputField").onInput.listen(submitMessage);
querySelector("#sendInput").onClick.listen(submitMessage);
}
My Dart Editor says nothing about where the problem comes from nor does it give any warning until run-time.
You need to initialize int connectionAttempts; with a valid value;
connectionAttempts++; fails with an exception on null.
You also need an onMessage handler to receive messages.
serverConn.onMessage.listen((MessageEvent e) {
recreateConnection should register an onOpen handler as well.
After serverConn = new WebSocket the listener registered in main() will not work
If you register a listener where only one single event is expected you can use first instead of listen
serverConn.onOpen.first.then(connected);
According to #JAre s comment.
Try to use a hardcoded string
querySelector("#chatLog").text = 'someValue';
to ensure this is not the culprit.

xuggler: no video in encoded 3gp file

i am trying to encode videos into 3gp format using xuggler, i somehow got it to work, work as in the program stopped throwing errors and exceptions, but the new file that is created does not have any video. Now there is no error or exception for me to work with so i have stuck a wall.
EDIT: Note the audio is working as it shud.
This is the code for the main function where the listeners are configured
IMediaReader reader = ToolFactory.makeReader("/home/hp/mms/b.flv");
IMediaWriter writer = ToolFactory.makeWriter("/home/hp/mms/xuggle/a_converted.3gp", reader);
IMediaDebugListener debugListener = ToolFactory.makeDebugListener();
writer.addListener(debugListener);
ConvertVideo convertor = new ConvertVideo(new File("/home/hp/mms/b.flv"), new File("/home/hp/mms/xuggle/a_converted.3gp"));
// convertor.addListener(writer);
reader.addListener(writer);
writer.addListener(convertor);
while (reader.readPacket() == null)
;
And this is the code for the convertor that i wrote.
public ConvertVideo(File inputFile, File outputFile)
{
this.outputFile = outputFile;
reader = ToolFactory.makeReader(inputFile.getAbsolutePath());
reader.addListener(this);
}
private IVideoResampler videoResampler = null;
private IAudioResampler audioResampler = null;
#Override
public void onAddStream(IAddStreamEvent event)
{
if (writer == null)
{
writer = ToolFactory.makeWriter(outputFile.getAbsolutePath(), reader);
}
int streamIndex = event.getStreamIndex();
IStreamCoder streamCoder = event.getSource().getContainer().getStream(streamIndex).getStreamCoder();
if (streamCoder.getCodecType() == ICodec.Type.CODEC_TYPE_AUDIO)
{
streamCoder.setFlag(IStreamCoder.Flags.FLAG_QSCALE, false);
writer.addAudioStream(streamIndex, 0, 1, 8000);
}
else if (streamCoder.getCodecType() == ICodec.Type.CODEC_TYPE_VIDEO)
{
streamCoder.setFlag(IStreamCoder.Flags.FLAG_QSCALE, false);
streamCoder.setCodec(ICodec.findEncodingCodecByName("h263"));
writer.addVideoStream(streamIndex, 0, VIDEO_WIDTH, VIDEO_HEIGHT);
}
super.onAddStream(event);
}
// //
#Override
public void onVideoPicture(IVideoPictureEvent event)
{
IVideoPicture pic = event.getPicture();
if (videoResampler == null)
{
videoResampler = IVideoResampler.make(VIDEO_WIDTH, VIDEO_HEIGHT, pic.getPixelType(), pic.getWidth(), pic.getHeight(), pic.getPixelType());
}
IVideoPicture out = IVideoPicture.make(pic.getPixelType(), VIDEO_WIDTH, VIDEO_HEIGHT);
videoResampler.resample(out, pic);
IVideoPictureEvent asc = new VideoPictureEvent(event.getSource(), out, event.getStreamIndex());
super.onVideoPicture(asc);
out.delete();
}
#Override
public void onAudioSamples(IAudioSamplesEvent event)
{
IAudioSamples samples = event.getAudioSamples();
if (audioResampler == null)
{
audioResampler = IAudioResampler.make(1, samples.getChannels(), 8000, samples.getSampleRate());
}
if (event.getAudioSamples().getNumSamples() > 0)
{
IAudioSamples out = IAudioSamples.make(samples.getNumSamples(), samples.getChannels());
audioResampler.resample(out, samples, samples.getNumSamples());
AudioSamplesEvent asc = new AudioSamplesEvent(event.getSource(), out, event.getStreamIndex());
super.onAudioSamples(asc);
out.delete();
}
}
I just cant seem to figure out where the problem is. I wud be thankful if someone wud plz point me in the right direction.
EDIT: If i see the properties of my newly encoded video, its audio properties are set and its video properties are not i.e in video properties, dimension= 0 x 0, frame rate= N/A and codec= h.263. The problem here is the 0 x 0 dimension.
well i found the answer, well not exactly the answer but a way to do what i was doing. Right now i am not quiet sure why my code was not working but the hre u can find the solution that worked for me. the author here just makes a seperate resizer class, adds it as a listener to the reader,. It has the onPictureEvent method overridden. Then he makes another class MyVideoListener and overrides the onAddStream method and adds it as a listener to the writer. and then he links the two parts by adding writer as a listener to resizer. works like a a charm.

How do I close an OracleConnection in .NET

Say I have these two objects:
OracleConnection connection = new OracleConnection(connectionString);
OracleCommand command = new OracleCommand(sql, connection);
To close the connection or Oracle, do I have to call command.Dispose(), connection.Dispose(), or both?
Is this good enough:
using(connection)
{
OracleDataReader reader = cmd.ExecuteReader();
// whatever...
}
using (OracleConnection connection = new OracleConnection(connectionString))
{
using (OracleCommand command = new OracleCommand(sql, connection))
{
using (OracleDataReader reader = cmd.ExecuteReader())
{
}
}
}
If it implements IDisposable, and if you create it, then put it in a using block.
Both answers are pretty much on target. You always want to call .Dispose() on any IDisposeable object. By wrapping in a "using" you tall the compiler to always impliment a try/finialy block for you.
1 point of note, if you want to avoid the nesting, you can write the same code like this:
using (OracleConnection connection = new OracleConnection(connectionString))
using (OracleCommand command = new OracleCommand(sql, connection))
using (OracleDataReader reader = cmd.ExecuteReader())
{
// do something here
}
This is good enough. using statement will wrap the dispose statement, so even if the exception is thrown, you are safe, it's my preferred way to dispose the resource.
using(OracleConnection connection = new OracleConnection(connectionString); )
{
//Create a command object
using(OracleCommand command = new OracleCommand(sql, connection))
{
using(OracleDataReader reader = cmd.ExecuteReader())
{
}
}
// whatever...
}
I think by use "using", you are ask the compiler to inject a try ... finally block , and in finally block, it will close the disposable object for you.
using will ensure your connection is closed. You could also pass in CommandBehavior.CloseConnection to your command's ExecuteReader method to close it before Dispose is called.

Resources