How to split an akka ByteString at a two byte delimiter - websocket

I'm creating a WebSocket protocol of my own, and thought to have a text key/value header part, ending at two consecutive newlines, followed by a binary tail.
Turns out, splitting a ByteString in half (at the two newlines) is really tedious. There is no built-in .split method, for one. And no .indexOf for finding a binary fingerprint.
What would you use for this? Is there an easier way for me to build such a protocol?
References:
akka ByteString
Using akka-http 10.1.0-RC1, akka 2.5.8

One approach would be to first create sliding pairs from an indexedSeq of the ByteString, then split the ByteString using the identified indexes of the delimiter-pair, as in the following example:
import akka.util.ByteString
val bs = ByteString("aa\nbb\n\nxyz")
// bs: akka.util.ByteString = ByteString(97, 97, 10, 98, 98, 10, 10, 120, 121, 122)
val delimiter = 10
// Create sliding pairs from indexedSeq of the ByteString
val slidingList = bs.zipWithIndex.sliding(2).toList
// slidingList: List[scala.collection.immutable.IndexedSeq[(Byte, Int)]] = List(
// Vector((97,0), (97,1)), Vector((97,1), (10,2)), Vector((10,2), (98,3)),
// Vector((98,3), (98,4)), Vector((98,4), (10,5)), Vector((10,5), (10,6)),
// Vector((10,6), (120,7)), Vector((120,7), (121,8)), Vector((121,8), (122,9))
// )
// Get indexes of the delimiter-pair
val dIndex = slidingList.filter{
case Vector(x, y) => x._1 == delimiter && y._1 == delimiter
}.flatMap{
case Vector(x, y) => Seq(x._2, y._2)
}
// Split the ByteString list
val (bs1, bs2) = ( bs.splitAt(dIndex(0))._1, bs.splitAt(dIndex(1))._2.tail )
// bs1: akka.util.ByteString = ByteString(97, 97, 10, 98, 98)
// bs2: akka.util.ByteString = ByteString(120, 121, 122)

I came up with this. Haven't tested it in practise, yet.
#tailrec
def peelMsg(bs: ByteString, accHeaderLines: Seq[String]): Tuple2[Seq[String],ByteString] = {
val (a: ByteString, tail: ByteString) = bs.span(_ != '\n')
val b: ByteString = tail.drop(1)
if (a.isEmpty) { // end marker - empty line
Tuple2(accHeaderLines,b)
} else {
val acc: Seq[String] = accHeaderLines :+ a.utf8String // append
peelMsg(b,acc)
}
}
val (headerLines: Seq[String], value: ByteString) = peelMsg(bs,Seq.empty)

My code, for now:
// Find the index of the (first) double-newline
//
val n: Int = {
val bsLen: Int = bs.length
val tmp: Int = bs.zipWithIndex.find{
case ('\n',i) if i<bsLen-1 && bs(i+1)=='\n' => true
case _ => false
}.map(_._2).getOrElse{
throw new RuntimeException("No delimiter found")
}
tmp
}
val (bs1: ByteString, bs2: ByteString) = bs.splitAt(n) // headers, \n\n<binary>
Influenced by #leo-c's answer, but using a normal .find instead of the sliding window. Realised that since ByteString allows random access, I can combine a streaming search with that condition.

Related

How to convert a range to a delimited string in Java 8+

How can you convert a range in Java (either using java.util.stream.LongStream or java.util.stream.IntStream) to a delimited string in Java?
I have tried:
String str = LongStream.range(16, 30)
.boxed()
.map(String::valueOf)
.collect(Collectors.joining(","));
System.out.println(str);
This prints:
16,17,18,19,20,21,22,23,24,25,26,27,28,29
The same can be used with IntStream. Is there a more convenient conversion of a range to a delimited string?
Seriously, just for the fun of it. Using guava:
String result = ContiguousSet.create(
Range.closedOpen(16, 31), DiscreteDomain.integers())
.asList()
.toString();
Or
String result = String.join(",",
IntStream.rangeClosed(16, 30).mapToObj(String::valueOf).toArray(String[]::new));
Or:
String result = String.join(",",
() -> IntStream.rangeClosed(16, 31).mapToObj(x -> (CharSequence) String.valueOf(x)).iterator());
Or (seems like I got carried away a bit with this):
String result = IntStream.rangeClosed(16, 31)
.boxed()
.collect(
Collector.of(
() -> new Object() {
StringBuilder sb = new StringBuilder();
},
(obj, i) -> obj.sb.append(i).append(",")
,
(left, right) -> {
left.sb.append(right.sb.toString());
return left;
},
x -> {
x.sb.setLength(x.sb.length() - 1);
return x.sb.toString();
})
);
And after Holger's good points, here is even a simpler version:
StringBuilder sb = IntStream.range(16, 30)
.collect(
StringBuilder::new,
(builder, i) -> builder.append(i).append(", "),
StringBuilder::append);
if (sb.length() != 0) {
sb.setLength(sb.length() - 2);
}
String result = sb.toString();
With IntStream.mapToObj:
String s = IntStream.range(16, 30)
.mapToObj(String::valueOf)
.collect(Collectors.joining(","));
If your sense of convenience implies less code, there is no need to explicitly map the Long to String. They can be simply mapped while collecting process. Here's the code to do so:
List<String> str = LongStream.range(16, 30)
.boxed()
.collect(Collectors.mapping(l -> String.valueOf(l), Collectors.toList()));
System.out.println(str.toString());
The result is:
[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
In response to the comment to whether boxing should be avoided, it has to happen anyway. The range() in this case returns an object of LongStream which has to be converted to a stream of Long.
According to #Jubobs The mapToObject() effectively returns stream of Long from the LongStream object.

Scala/functional/without libs - check if string permutation of other

How could you check to see if one string is a permutation of another using scala/functional programming with out complex pre-built functions like sorted()?
I'm a Python dev and what I think trips me up the most is that you can't just iterate through a dictionary of character counts comparing to another dictionary of character counts, then just exit when there isn't a match, you can't just call break.
Assume this is the starting point, based on your description:
val a = "aaacddba"
val b = "aabaacdd"
def counts(s: String) = s.groupBy(identity).mapValues(_.size)
val aCounts = counts(a)
val bCounts = counts(b)
This is the simplest way:
aCounts == bCounts // true
This is precisely what you described:
def isPerm(aCounts: Map[Char,Int], bCounts: Map[Char,Int]): Boolean = {
if (aCounts.size != bCounts.size)
return false
for ((k,v) <- aCounts) {
if (bCounts.getOrElse(k, 0) != v)
return false
}
return true
}
This is your method, but more scala-ish. (It also breaks as soon as a mismatch is found, because of how foreach is implemented):
(aCounts.size == bCounts.size) &&
aCounts.forall { case (k,v) => bCounts.getOrElse(k, 0) == v }
(Also, Scala does have break.)
Also, also: you should read the answer to this question.
Another option using recursive function, which will also 'break' immediately once mismatch is detected:
import scala.annotation.tailrec
#tailrec
def isPerm1(a: String, b: String): Boolean = {
if (a.length == b.length) {
a.headOption match {
case Some(c) =>
val i = b.indexOf(c)
if (i >= 0) {
isPerm1(a.tail, b.substring(0, i) + b.substring(i + 1))
} else {
false
}
case None => true
}
} else {
false
}
}
Out of my own curiosity I also create two more versions which use char counts map for matching:
def isPerm2(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
cntsA == cntsB
}
and
def isPerm3(a: String, b: String): Boolean = {
val cntsA = a.groupBy(identity).mapValues(_.size)
val cntsB = b.groupBy(identity).mapValues(_.size)
(cntsA == cntsB) && cntsA.forall { case (k, v) => cntsB.getOrElse(k, 0) == v }
}
and roughly compare their performance by:
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
// Match
time((1 to 10000).foreach(_ => isPerm1("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("apple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("apple"*100,"elppa"*100)))
// Mismatch
time((1 to 10000).foreach(_ => isPerm1("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm2("xpple"*100,"elppa"*100)))
time((1 to 10000).foreach(_ => isPerm3("xpple"*100,"elppa"*100)))
and the result is:
Match cases
isPerm1 = 2337999406ns
isPerm2 = 383375133ns
isPerm3 = 382514833ns
Mismatch cases
isPerm1 = 29573489ns
isPerm2 = 381622225ns
isPerm3 = 417863227ns
As can be expected, the char counts map speeds up positive cases but can slow down negative cases (overhead on building the char counts map).

Efficient Sequence conversion to String

I have a string sequence Seq[String] which represents stdin input lines.
Those lines map to a model entity, but it is not guaranteed that 1 line = 1 entity instance.
Each entity is delimited with a special string that will not occur anywhere else in the input.
My solution was something like:
val entities = lines.mkString.split(myDelimiter).map(parseEntity)
parseEntity implementation is not relevant, it gets a String and maps to a case class which represents the model entity
The problem is with a given input, I get an OutOfMemoryException on the lines.mkString. Would a fold/foldLeft/foldRight be more efficient? Or do you have any better alternative?
You can solve this using akka streams and delimiter framing. See this section of the documentation for the basic approach.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Framing, Source}
import akka.util.ByteString
val example = (0 until 100).mkString("delimiter").grouped(8).toIndexedSeq
val framing = Framing.delimiter(ByteString("delimiter"), 1000)
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
Source(example)
.map(ByteString.apply)
.via(framing)
.map(_.utf8String)
.runForeach(println)
The conversion to and from ByteString is a bit annoying, but Framing.delimiter is only defined for ByteString.
If you are fine with a more pure functional approach, fs2 will also offer primitives to solve this problem.
Something that worked for me if you are reading from a stream (your mileage may vary). Slightly modified version of Scala LineIterator:
class EntityIterator(val iter: BufferedIterator[Char]) extends AbstractIterator[String] with Iterator[String] {
private[this] val sb = new StringBuilder
def getc() = iter.hasNext && {
val ch = iter.next
if (ch == '\n') false // Replace with your delimiter here
else {
sb append ch
true
}
}
def hasNext = iter.hasNext
def next = {
sb.clear
while (getc()) { }
sb.toString
}
}
val entities =
new EnityIterator(scala.io.Source.fromInputStream(...).iter.buffered)
entities.map(...)

Accessing position information in a scala combinatorparser kills performance

I wrote a new combinator for my parser in scala.
Its a variation of the ^^ combinator, which passes position information on.
But accessing the position information of the input element really cost performance.
In my case parsing a big example need around 3 seconds without position information, with it needs over 30 seconds.
I wrote a runnable example where the runtime is about 50% more when accessing the position.
Why is that? How can I get a better runtime?
Example:
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.combinator.Parsers
import scala.util.matching.Regex
import scala.language.implicitConversions
object FooParser extends RegexParsers with Parsers {
var withPosInfo = false
def b: Parser[String] = regexB("""[a-z]+""".r) ^^# { case (b, x) => b + " ::" + x.toString }
def regexB(p: Regex): BParser[String] = new BParser(regex(p))
class BParser[T](p: Parser[T]) {
def ^^#[U](f: ((Int, Int), T) => U): Parser[U] = Parser { in =>
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val inwo = in.drop(start - offset)
p(inwo) match {
case Success(t, in1) =>
{
var a = 3
var b = 4
if(withPosInfo)
{ // takes a lot of time
a = inwo.pos.line
b = inwo.pos.column
}
Success(f((a, b), t), in1)
}
case ns: NoSuccess => ns
}
}
}
def main(args: Array[String]) = {
val r = "foo"*50000000
var now = System.nanoTime
parseAll(b, r)
var us = (System.nanoTime - now) / 1000
println("without: %d us".format(us))
withPosInfo = true
now = System.nanoTime
parseAll(b, r)
us = (System.nanoTime - now) / 1000
println("with : %d us".format(us))
}
}
Output:
without: 2952496 us
with : 4591070 us
Unfortunately, I don't think you can use the same approach. The problem is that line numbers end up implemented by scala.util.parsing.input.OffsetPosition which builds a list of every line break every time it is created. So if it ends up with string input it will parse the entire thing on every call to pos (twice in your example). See the code for CharSequenceReader and OffsetPosition for more details.
There is one quick thing you can do to speed this up:
val ip = inwo.pos
a = ip.line
b = ip.column
to at least avoid creating pos twice. But that still leaves you with a lot of redundant work. I'm afraid to really solve the problem you'll have to build the index as in OffsetPosition yourself, just once, and then keep referring to it.
You could also file a bug report / make an enhancement request. This is not a very good way to implement the feature.

Extending Seq.sortBy in Scala

Say I have a list of names.
case class Name(val first: String, val last: String)
val names = Name("c", "B") :: Name("b", "a") :: Name("a", "B") :: Nil
If I now want to sort that list by last name (and if that is not enough, by first name), it is easily done.
names.sortBy(n => (n.last, n.first))
// List[Name] = List(Name(a,B), Name(c,B), Name(b,a))
But what, if I‘d like to sort this list based on some other collation for strings?
Unfortunately, the following does not work:
val o = new Ordering[String]{ def compare(x: String, y: String) = collator.compare(x, y) }
names.sortBy(n => (n.last, n.first))(o)
// error: type mismatch;
// found : java.lang.Object with Ordering[String]
// required: Ordering[(String, String)]
// names.sortBy(n => (n.last, n.first))(o)
is there any way that allow me to change the ordering without having to write an explicit sortWith method with multiple if–else branches in order to deal with all cases?
Well, this almost does the trick:
names.sorted(o.on((n: Name) => n.last + n.first))
On the other hand, you can do this as well:
implicit val o = new Ordering[String]{ def compare(x: String, y: String) = collator.compare(x, y) }
names.sortBy(n => (n.last, n.first))
This locally defined implicit will take precedence over the one defined on the Ordering object.
One solution is to extend the otherwise implicitly used Tuple2 ordering. Unfortunately, this means writing out Tuple2 in the code.
names.sortBy(n => (n.second, n.first))(Ordering.Tuple2(o, o))
I'm not 100% sure what methods you think collator should have.
But you have the most flexibility if you define the ordering on the case class:
val o = new Ordering[Name]{
def compare(a: Name, b: Name) =
3*math.signum(collator.compare(a.last,b.last)) +
math.signum(collator.compare(a.first,b.first))
}
names.sorted(o)
but you can also provide an implicit conversion from a string ordering to a name ordering:
def ostring2oname(os: Ordering[String]) = new Ordering[Name] {
def compare(a: Name, b: Name) =
3*math.signum(os.compare(a.last,b.last)) + math.signum(os.compare(a.first,b.first))
}
and then you can use any String ordering to sort Names:
def oo = new Ordering[String] {
def compare(x: String, y: String) = x.length compare y.length
}
val morenames = List("rat","fish","octopus")
scala> morenames.sorted(oo)
res1: List[java.lang.String] = List(rat, fish, octopus)
Edit: A handy trick, in case it wasn't apparent, is that if you want to order by N things and you're already using compare, you can just multiply each thing by 3^k (with the first-to-order being multiplied by the largest power of 3) and add.
If your comparisons are very time-consuming, you can easily add a cascading compare:
class CascadeCompare(i: Int) {
def tiebreak(j: => Int) = if (i!=0) i else j
}
implicit def break_ties(i: Int) = new CascadeCompare(i)
and then
def ostring2oname(os: Ordering[String]) = new Ordering[Name] {
def compare(a: Name, b: Name) =
os.compare(a.last,b.last) tiebreak os.compare(a.first,b.first)
}
(just be careful to nest them x tiebreak ( y tiebreak ( z tiebreak w ) ) ) so you don't do the implicit conversion a bunch of times in a row).
(If you really need fast compares, then you should write it all out by hand, or pack the orderings in an array and use a while loop. I'll assume you're not that desperate for performance.)

Resources