How to Read Files in Flink FlatMapFunction - hadoop

I am building a Flink pipeline and based on live input data need to read records from archive files in a RichFlatMapFunction (e.g. each day I want to read files from the previous day and week). I'm wondering what is the best way to do that?
I could use the Hadoop APIs directly, so that is what I'm trying next.
That would be something like this:
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
class LoadHistory(
var basePath: String,
var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {
// see
// https://programmerall.com/article/34422316834/
// https://stackoverflow.com/questions/37085528/hadoop-with-binary-files
// https://data-flair.training/blogs/hdfs-data-read-operation
val fileSystem = FileSystem.get(new conf.Configuration())
def formatPath(pathTemplate: String, gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???
override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
val pathStr = formatPath(pathTemplate, value._1, value._2)
val path = new Path(pathStr)
if (!fileSystem.exists(path)) {
return
}
val in: FSDataInputStream = fileSystem.open(path)
if (pathStr.endsWith(".protobuf")) {
// TODO read file
} else {
assert(pathStr.endsWith(".lz4"))
// TODO read file
}
}
}
I'm new with Hadoop, so I figure I'll need to configure it before reading data from cloud storage (e.g. replace new Configuration() with something meaningful). I know Flink uses Hadoop to read files internally, so I am wondering if I can access the configuration or configured HadoopFileSystem object being used by Flink at runtime.
Previously I tried starting a Flink batch job inside the FlatMapFunction (ending with env.collect), but it seems to have resulted in thread-locking (job 2 won't start until job 1 is done).

I dug into the Flink source code a little bit and found a way to get an initialized org.apache.flink.core.fs.FileSystem object from a org.apache.flink.core.fs.Path. Then that can be used to read the files:
import org.apache.flink.core.fs.Path;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.FSDataInputStream;
class LoadHistory(
var basePath: String,
var pathTemplate: String,
) extends RichFlatMapFunction[(TypeAlias.GridId, TypeAlias.Timestamp), ArchiveRecord] {
val fileSystem = new Path(basePath).getFileSystem()
def formatPath(gridId: TypeAlias.GridId, archiveDate: TypeAlias.Timestamp): String = ???
override def flatMap(value: (TypeAlias.GridId, TypeAlias.Timestamp), out: Collector[ArchiveRecord]): Unit = {
val pathStr = formatPath(value._1, value._2)
val path = new Path(pathStr)
if (!fileSystem.exists(path)) {
return
}
val in: FSDataInputStream = fileSystem.open(path)
if (pathStr.endsWith(".protobuf")) {
// TODO read file
} else {
assert(pathStr.endsWith(".lz4"))
// TODO read file
}
}
}

Related

Capturing ElasticsearchSink Exceptions in Flink

I've recently been encountering some issues that I've noticed in the logs
of my Flink job that handles writing to an Elasticsearch index. I was
hoping to leverage some of the metrics that Flink exposes (or piggyback on
them) to update metric counters when I encounter specific kinds of errors.
val builder = ElasticsearchSink.Builder(...)
builder.setFailureHandler { actionRequest, throwable, _, _ ->
// Log error here (and update metrics via metricGroup.counter(...)
}
return builder.build()
Currently, I don't have any "context" when the callback for the setFailureHandler occurs, and while I can log it, ideally I'd like to expose a metric to track how frequently this is occurring:
builder.setFailureHandler ( actionRequest, throwable, _, _ ->
elasticExceptionsCounter.inc()
}
One additional wrinkle here is that my specific scenario relies on dynamically creating and handling these sinks via a router like the following:
class DynamicElasticsearchSink<ElementT, RouteT, SinkT : ElasticsearchSinkBase<ElementT, out AutoCloseable>>(
private val sinkRouter: ElasticsearchSinkRouter<ElementT, RouteT, SinkT>
) : RichSinkFunction<ElementT>(), CheckpointedFunction {
// Store a reference to all of the current routes
private val sinkRoutes: MutableMap<RouteT, SinkT> = ConcurrentHashMap()
private lateinit var configuration: Configuration
override fun open(parameters: Configuration) {
configuration = parameters
}
override fun invoke(value: ElementT, context: SinkFunction.Context) {
val route = sinkRouter.getRoute(value)
var sink = sinkRoutes[route]
if (sink == null) {
// Build a new sink for this key and cache it for later use based on incoming records
sink = sinkRouter.createSink(route, value)
sink.runtimeContext = runtimeContext
sink.open(configuration)
sinkRoutes[route] = sink
}
sink.invoke(value, context)
}
// Omitted for brevity
}
and the sinkRouter.createSink() looks like the following:
override fun createSink(cacheKey: String, element: JsonObject): ElasticsearchSink<JsonObject> {
return buildSinkFromRoute(element)
}
private fun buildSinkFromRoute(element: JsonObject): ElasticsearchSink<JsonObject> {
val builder = ElasticsearchSink.Builder(
buildHostsFromElement(element),
ElasticsearchRoutingFunction()
)
// Various configuration omitted for brevity
builder.setFailureHandler { actionRequest, throwable, _, _ ->
// Here's where I'd like to capture the failures and record them as metrics
}
return builder.build()
}
Is there a way to support this currently or what options are available for handing this?

In Spring Webflux how to go from an `OutputStream` to a `Flux<DataBuffer>`?

I'm building a tarball dynamically, and would like to stream it back directly, which should be 100% possible with a .tar.gz.
The below code is the closest thing I could get to a dataBuffer, through lots of googling. Basically, I need something that implements an OutputStream and provides, or publishes, to a Flux<DataBuffer> so that I can return that from my method, and have streaming output, instead of buffering the entire tarball in ram (which I'm pretty sure is what is happening here). I'm using apache Compress-commons, which has a wonderful API, but it's all OutputStream based.
I suppose another way to do it would be to directly write to the response, but I don't think that would be properly reactive? Not sure how to get an OutputStream out of some sort of Response object either.
This is kotlin btw, on Spring Boot 2.0
#GetMapping("/cookbook.tar.gz", "/cookbook")
fun getCookbook(): Mono<DefaultDataBuffer> {
log.info("Creating tarball of cookbooks: ${soloConfig.cookbookPaths}")
val transformation = Mono.just(soloConfig.cookbookPaths.stream()
.toList()
.flatMap {
Files.walk(Paths.get(it)).map(Path::toFile).toList()
})
.map { files ->
//Will make one giant databuffer... but oh well? TODO: maybe use some kind of chunking.
val buffer = DefaultDataBufferFactory().allocateBuffer()
val outputBufferStream = buffer.asOutputStream()
//Transform my list of stuff into an archiveOutputStream
TarArchiveOutputStream(GzipCompressorOutputStream(outputBufferStream)).use { taos ->
taos.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU)
log.info("files to compress: ${files}")
for (file in files) {
if (file.isFile) {
val entry = "cookbooks/" + file.name
log.info("Adding ${entry} to tarball")
taos.putArchiveEntry(TarArchiveEntry(file, entry))
FileInputStream(file).use { fis ->
fis.copyTo(taos) //Copy that stuff!
}
taos.closeArchiveEntry()
}
}
}
buffer
}
return transformation
}
I puzzled through this, and have an effective solution. You implement an OutputStream and take those bytes and publish them into a stream. Be sure to override close, and send an onComplete. Works great!
#RestController
class SoloController(
val soloConfig: SoloConfig
) {
val log = KotlinLogging.logger { }
#GetMapping("/cookbooks.tar.gz", "/cookbooks")
fun streamCookbook(serverHttpResponse: ServerHttpResponse): Flux<DataBuffer> {
log.info("Creating tarball of cookbooks: ${soloConfig.cookbookPaths}")
val publishingOutputStream = PublishingOutputStream(serverHttpResponse.bufferFactory())
//Needs to set up cookbook path as a parent directory, and then do `cookbooks/$cookbook_path/<all files>` for each cookbook path given
Flux.just(soloConfig.cookbookPaths.stream().toList())
.doOnNext { paths ->
//Transform my list of stuff into an archiveOutputStream
TarArchiveOutputStream(GzipCompressorOutputStream(publishingOutputStream)).use { taos ->
taos.setLongFileMode(TarArchiveOutputStream.LONGFILE_GNU)
paths.forEach { cookbookDir ->
if (Paths.get(cookbookDir).toFile().isDirectory) {
val cookbookDirFile = Paths.get(cookbookDir).toFile()
val directoryName = cookbookDirFile.name
val entryStart = "cookbooks/${directoryName}"
val files = Files.walk(cookbookDirFile.toPath()).map(Path::toFile).toList()
log.info("${files.size} files to compress")
for (file in files) {
if (file.isFile) {
val relativePath = file.toRelativeString(cookbookDirFile)
val entry = "$entryStart/$relativePath"
taos.putArchiveEntry(TarArchiveEntry(file, entry))
FileInputStream(file).use { fis ->
fis.copyTo(taos) //Copy that stuff!
}
taos.closeArchiveEntry()
}
}
}
}
}
}
.subscribeOn(Schedulers.parallel())
.doOnComplete {
publishingOutputStream.close()
}
.subscribe()
return publishingOutputStream.publisher
}
class PublishingOutputStream(bufferFactory: DataBufferFactory) : OutputStream() {
val publisher: UnicastProcessor<DataBuffer> = UnicastProcessor.create(Queues.unbounded<DataBuffer>().get())
private val bufferPublisher: UnicastProcessor<Byte> = UnicastProcessor.create(Queues.unbounded<Byte>().get())
init {
bufferPublisher
.bufferTimeout(4096, Duration.ofMillis(100))
.doOnNext { intList ->
val buffer = bufferFactory.allocateBuffer(intList.size)
buffer.write(intList.toByteArray())
publisher.onNext(buffer)
}
.doOnComplete {
publisher.onComplete()
}
.subscribeOn(Schedulers.newSingle("publisherThread"))
.subscribe()
}
override fun write(b: Int) {
bufferPublisher.onNext(b.toByte())
}
override fun close() {
bufferPublisher.onComplete() //which should trigger the clean up of the whole thing
}
}
}

Why zmq ( inproc:// )-connection's order matters, unlike for ( tcp:// )?

When launching a zmq server and client, in any random order, communicating over the tcp:// transport-class, they are smart enough to connect/reconnect regardless of the order.
However, when trying to run the same over the inproc:// transport-class, I see that it works only if the client starts after the server. How can we avoid this?
MCVE-code :
Here are some kotlin MCVE-code examples, to reproduce the claim (this is a modified version of the well known weather example)
server.kt - run this to run the server standalone
package sandbox.zmq
import org.zeromq.ZMQ
import org.zeromq.ZMQ.Context
import sandbox.util.Util.sout
import java.util.*
fun main(args: Array<String>) {
server(
context = ZMQ.context(1),
// publishTo = "tcp://localhost:5556"
publishTo = "tcp://localhost:5557"
)
}
fun server(context: Context, publishTo: String) {
val publisher = context.socket(ZMQ.PUB)
publisher.bind(publishTo)
// Initialize random number generator
val srandom = Random(System.currentTimeMillis())
while (!Thread.currentThread().isInterrupted) {
// Get values that will fool the boss
val zipcode: Int
val temperature: Int
val relhumidity: Int
zipcode = 10000 + srandom.nextInt(10)
temperature = srandom.nextInt(215) - 80 + 1
relhumidity = srandom.nextInt(50) + 10 + 1
// Send message to all subscribers
val update = String.format("%05d %d %d", zipcode, temperature, relhumidity)
println("server >> $update")
publisher.send(update, 0)
Thread.sleep(500)
}
publisher.close()
context.term()
}
client.kt - run this for the client standalone
package sandbox.zmq
import org.zeromq.ZMQ
import org.zeromq.ZMQ.Context
import java.util.*
fun main(args: Array<String>) {
client(
context = ZMQ.context(1),
readFrom = "tcp://localhost:5557"
)
}
fun client(context: Context, readFrom: String) {
// Socket to talk to server
println("Collecting updates from weather server")
val subscriber = context.socket(ZMQ.SUB)
// subscriber.connect("tcp://localhost:");
subscriber.connect(readFrom)
// Subscribe to zipcode, default is NYC, 10001
subscriber.subscribe("".toByteArray())
// Process 100 updates
var update_nbr: Int
var total_temp: Long = 0
update_nbr = 0
while (update_nbr < 10000) {
// Use trim to remove the tailing '0' character
val string = subscriber.recvStr(0).trim { it <= ' ' }
println("client << $string")
val sscanf = StringTokenizer(string, " ")
val zipcode = Integer.valueOf(sscanf.nextToken())
val temperature = Integer.valueOf(sscanf.nextToken())
val relhumidity = Integer.valueOf(sscanf.nextToken())
total_temp += temperature.toLong()
update_nbr++
}
subscriber.close()
}
inproc.kt - run this and modify which sample is called for the inproc:// scenarios
package sandbox.zmq
import org.zeromq.ZMQ
import kotlin.concurrent.thread
fun main(args: Array<String>) {
// clientFirst()
clientLast()
}
fun println(string: String) {
System.out.println("${Thread.currentThread().name} : $string")
}
fun clientFirst() {
val context = ZMQ.context(1)
val client = thread {
client(
context = context,
readFrom = "inproc://backend"
)
}
// use this to maintain order
Thread.sleep(10)
val server = thread {
server(
context = context,
publishTo = "inproc://backend"
)
}
readLine()
client.interrupt()
server.interrupt()
}
fun clientLast() {
val context = ZMQ.context(1)
val server = thread {
server(
context = context,
publishTo = "inproc://backend"
)
}
// use this to maintain order
Thread.sleep(10)
val client = thread {
client(
context = context,
readFrom = "inproc://backend"
)
}
readLine()
client.interrupt()
server.interrupt()
}
Why zmq inproc:// connection order matters, unlike for tcp://?
Well, this is a by-design behaviour
Given the native ZeroMQ API warns about this by-design behaviour ( since ever ), the issue is not a problem, but an intended property.
Plus one additional property has to be also met:
The name [ meant an_endpoint_name in .connect("inproc://<_an_endpoint_name_>")] must have been previously created by assigning it to at least one socketwithin the same ØMQ context as the socket being connected.
Newer versions of the native ZeroMQ API ( post 4.0 ), if indeed deployed under one's respective language binding / wrapper, may allow to release the former of these requirements:
Since version 4.0 the order of zmq_bind() and zmq_connect() does not matter just like for the tcp transport type.
How can we avoid this?
Well, a much harder part ...
if not already got an easy way above the ZeroMQ native API v4.2+, one may roll up one's sleeves and either re-factor the pre-4.x language wrapper / binding, so as to make the engine get there, or, may be, test if Martin SUSTRIK's second lovely child, the nanomsg could fit the scene for achieving this.

cocoalumberjack log to one file

I am developing mac application in that application I need to log to folder, where already some other application is also logging,so need to create only one file in that folder, when file rolling occurs the whole contents in that log folder are deleting .this code I am using .I don't want delete contents in log folder and is it possible to use only file with constant name .Please help me.
// Configure CocoaLumberjack
DDLog.addLogger(DDASLLogger.sharedInstance())
DDLog.addLogger(DDTTYLogger.sharedInstance())
// Initialize File Logger
let manager : BaseLogFileManager = BaseLogFileManager(logsDirectory:K.LogFileDir)
let fileLogger: DDFileLogger = DDFileLogger(logFileManager: manager) // File Logger
fileLogger.maximumFileSize = 1024*1024*20
fileLogger.doNotReuseLogFiles = false
fileLogger.logFileManager.maximumNumberOfLogFiles = 1
DDLog.addLogger(fileLogger)
class BaseLogFileManager : DDLogFileManagerDefault
{
override var newLogFileName: String! { get {
return K.LogFileName
}}
override func isLogFile(fileName: String!) -> Bool
{
return true
}
}
Work around is to disable rolling frequency, don't assign maximum size or rollingFrequency and check size using NSFileManager. If file size is greater than specific limit, remove and create new file.
// Configure CocoaLumberjack
DDLog.addLogger(DDASLLogger.sharedInstance())
DDLog.addLogger(DDTTYLogger.sharedInstance())
// Initialize File Logger
let manager : BaseLogFileManager = BaseLogFileManager(logsDirectory:K.LogFileDir)
let fileLogger: DDFileLogger = DDFileLogger(logFileManager: manager) // File Logger
do {
let attr : NSDictionary? = try NSFileManager.defaultManager().attributesOfItemAtPath(K.LogFileDir+"/"+K.LogFileName)
if let _attr = attr {
if _attr.fileSize() > 1024*1024*10
{
NSFileManager.defaultManager().createFileAtPath(K.LogFileDir+"/"+K.LogFileName, contents: NSData(), attributes: nil)
}
}
} catch {
print("Error: \(error)")
}
fileLogger.doNotReuseLogFiles = false
fileLogger.logFileManager.maximumNumberOfLogFiles = 1
DDLog.addLogger(fileLogger)

Akka HTTP REST API for producing to Kafka Performance

I'm building an API with Akka that should produce to a Kafka bus. I have been load testing the application using Gatling. Noticed that when more than 1000 users are created in Gatling, the API starts to struggle. On average, about 170 requests per second are handled, which seems like very little to me.
The API's main entry point is this:
import akka.actor.{Props, ActorSystem}
import akka.http.scaladsl.Http
import akka.http.scaladsl.model._
import akka.pattern.ask
import akka.http.scaladsl.server.Directives
import akka.http.scaladsl.unmarshalling.Unmarshaller
import akka.stream.ActorMaterializer
import com.typesafe.config.{Config, ConfigFactory}
import play.api.libs.json.{JsObject, Json}
import scala.concurrent.{Future, ExecutionContext}
import akka.http.scaladsl.server.Directives._
import akka.util.Timeout
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
case class PostMsg(msg:JsObject)
case object PostSuccess
case class PostFailure(msg:String)
class Msgapi(conf:Config) {
implicit val um:Unmarshaller[HttpEntity, JsObject] = {
Unmarshaller.byteStringUnmarshaller.mapWithCharset { (data, charset) =>
Json.parse(data.toArray).asInstanceOf[JsObject]
}
}
implicit val system = ActorSystem("MsgApi")
implicit val timeout = Timeout(5 seconds)
implicit val materializer = ActorMaterializer()
val router = system.actorOf(Props(new RouterActor(conf)))
val route = {
path("msg") {
post {
entity(as[JsObject]) {obj =>
if(!obj.keys.contains("key1") || !obj.keys.contains("key2") || !obj.keys.contains("key3")){
complete{
HttpResponse(status=StatusCodes.BadRequest, entity="Invalid json provided. Required fields: key1, key2, key3 \n")
}
} else {
onSuccess(router ? PostMsg(obj)){
case PostSuccess => {
complete{
Future{
HttpResponse(status = StatusCodes.OK, entity = "Post success")
}
}
}
case PostFailure(msg) =>{
complete{
Future{
HttpResponse(status = StatusCodes.InternalServerError, entity=msg)
}
}
}
case _ => {
complete{
Future{
HttpResponse(status = StatusCodes.InternalServerError, entity = "Unknown Server error occurred.")
}
}
}
}
}
}
}
}
}
def run():Unit = {
Http().bindAndHandle(route, interface = conf.getString("http.host"), port = conf.getInt("http.port"))
}
}
object RunMsgapi {
def main(Args: Array[String]):Unit = {
val conf = ConfigFactory.load()
val api = new Msgapi(conf)
api.run()
}
}
The router actor is as follows:
import akka.actor.{ActorSystem, Props, Actor}
import akka.http.scaladsl.server.RequestContext
import akka.routing.{Router, SmallestMailboxRoutingLogic, ActorRefRoutee}
import com.typesafe.config.Config
import play.api.libs.json.JsObject
class RouterActor(conf:Config) extends Actor{
val router = {
val routees = Vector.tabulate(conf.getInt("kafka.producer-number"))(n => {
val r = context.system.actorOf(Props(new KafkaProducerActor(conf, n )))
ActorRefRoutee(r)
})
Router(SmallestMailboxRoutingLogic(), routees)
}
def receive = {
case PostMsg(msg) => {
router.route(PostMsg(msg), sender())
}
}
}
And finally, the kafka producer actor:
import akka.actor.Actor
import java.util.Properties
import com.typesafe.config.Config
import kafka.message.NoCompressionCodec
import kafka.utils.Logging
import org.apache.kafka.clients.producer._
import play.api.libs.json.JsObject
import scala.concurrent.duration._
import scala.concurrent.{ExecutionContext, Future, Await}
import ExecutionContext.Implicits.global
import scala.concurrent.{Future, Await}
import scala.util.{Failure, Success}
class KafkaProducerActor(conf:Config, id:Int) extends Actor with Logging {
var topic: String = conf.getString("kafka.topic")
val codec = NoCompressionCodec.codec
val props = new Properties()
props.put("bootstrap.servers", conf.getString("kafka.bootstrap-servers"))
props.put("acks", conf.getString("kafka.acks"))
props.put("retries", conf.getString("kafka.retries"))
props.put("batch.size", conf.getString("kafka.batch-size"))
props.put("linger.ms", conf.getString("kafka.linger-ms"))
props.put("buffer.memory", conf.getString("kafka.buffer-memory"))
props.put("key.serializer", conf.getString("kafka.key-serializer"))
props.put("value.serializer", conf.getString("kafka.value-serializer"))
val producer = new KafkaProducer[String, String](props)
def receive = {
case PostMsg(msg) => {
// push the msg to Kafka
try{
val res = Future{
producer.send(new ProducerRecord[String, String](topic, msg.toString()))
}
val result = Await.result(res, 1 second).get()
sender ! PostSuccess
} catch{
case e: Exception => {
println(e.printStackTrace())
sender ! PostFailure("Kafka push error")
}
}
}
}
}
The idea being that in application.conf I can easily specify how many producers there should be, allowing better horizontal scaling.
Now, however, it seems that the api or router is actually the bottleneck. As a test, I disabled the Kafka producing code, and replaced it with a simple: sender ! PostSuccess. With 3000 users in Gatling, I still had 6% of requests failing due to timeouts, which seems like a very long time to me.
The Gatling test I am executing is the following:
import io.gatling.core.Predef._ // 2
import io.gatling.http.Predef._
import scala.concurrent.duration._
class BasicSimulation extends Simulation { // 3
val httpConf = http // 4
.baseURL("http://localhost:8080") // 5
.acceptHeader("text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8") // 6
.doNotTrackHeader("1")
.acceptLanguageHeader("en-US,en;q=0.5")
.acceptEncodingHeader("gzip, deflate")
.userAgentHeader("Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0")
.header("Content-Type", "application/json")
val scn = scenario("MsgLoadTest")
.repeat(100)(
pace(2 seconds)
.exec(http("request_1")
.post("/msg").body(StringBody("""{ "key1":"something", "key2": "somethingElse", "key3":2222}""")).asJSON)
)
setUp( // 11
scn.inject(rampUsers(3000) over (5 seconds)) // 12
).protocols(httpConf) // 13
}
update
Following some pointers from cmbaxter, I tried some things (see discussion in comments), and profiled the application using visualvm during the gatling load test. I don't quite know how to interpret these results though. It seems that a lot of time is spent in the ThreadPoolExecutor, but this might be ok?
Two screenshots from the profiling are below:
To exclude the Kafka producer, I removed the logic from the Actor. I was still getting performance issues. So, as a final test, I reworked the API to simply give a direct answer when a POST came in:
val route = {
path("msg") {
post {
entity(as[String]) { obj =>
complete(
HttpResponse(status = StatusCodes.OK, entity = "OK")
)
}
}
}
}
and I implemented the same route in Spray, to compare performance. The results were clear. Akka HTTP (at least in this current test setup) does not come close to Spray's performance. Perhaps there is some tweaking that can be done for Akka HTTP? I have attached two screenshots of response time graphs for 3000 concurrent users in Gatling, making a post request.
Akka HTTP
Spray
I would eliminate the KafkaProducerActor and router completely and call a Scala wrapped version of producer.send directly. Why create a possible bottleneck if not necessary? I could very well imagine the global execution context or the actor system becoming a bottleneck in your current setup.
Something like this should do the trick:
class KafkaScalaProducer(val producer : KafkaProducer[String, String](props)) {
def send(topic: String, msg : String) : Future[RecordMetadata] = {
val promise = Promise[RecordMetadata]()
try {
producer.send(new ProducerRecord[String, String](topic, msg), new Callback {
override def onCompletion(md : RecordMetadata, e : java.lang.Exception) {
if (md == null) promise.success(md)
else promise.failure(e)
}
})
} catch {
case e : BufferExhaustedException => promise.failure(e)
case e : KafkaException => promise.failure(e)
}
promise.future
}
def close = producer.close
}
(note: I have not actually tried this code. It should be interpreted as pseudo-code)
I would then simply transform the result of the future to a HttpResponse.
After that it's a question of tweaking configuration. Your bottleneck is now either the Kafka Producer or Akka Http.

Resources