How to store large values (10MB) using key-value store database? - key-value-store

Consider a system that requires the store of mostly small values (1-1000 bytes), but in some cases, it needs to store a large value (10MB). It contains a list of hashes (think a list of public keys for EDDSA signature)
In Ethereum, the storage is done via PATRICIA TIRE, but I dont think that is necessary. Does anyone know of a good way to handle such large values with a key/store database?

Here is some code to store large binary object (aka. blobs) with foundationdb:
from uuid import uuid4
from uuid import UUID
from collections import namedtuple
from hashlib import blake2b as hasher
from more_itertools import sliced
import found
class BStoreException(found.BaseFoundException):
pass
BSTORE_SUFFIX_HASH = [b'\x01']
BSTORE_SUFFIX_BLOB = [b'\x02']
BStore = namedtuple('BStore', ('name', 'prefix_hash', 'prefix_blob',))
def make(name, prefix):
prefix = list(prefix)
out = BStore(name, tuple(prefix + BSTORE_SUFFIX_HASH), tuple(prefix + BSTORE_SUFFIX_BLOB))
return out
async def get_or_create(tx, bstore, blob):
hash = hasher(blob).digest()
key = found.pack((bstore.prefix_hash, hash))
maybe_uid = await found.get(tx, key)
if maybe_uid is not None:
return UUID(bytes=maybe_uid)
# Otherwise create the hash entry and store the blob with a new uid
# TODO: Use a counter and implement a garbage collector, and implement
# bstore.delete
uid = uuid4()
found.set(tx, key, uid.bytes)
for index, slice in enumerate(sliced(blob, found.MAX_SIZE_VALUE)):
found.set(tx, found.pack((bstore.prefix_blob, uid, index)), bytes(slice))
return uid
async def get(tx, bstore, uid):
key = found.pack((bstore.prefix_blob, uid))
out = b''
async for _, value in found.query(tx, key, found.next_prefix(key)):
out += value
if out == b'':
raise BStoreException('BLOB should be in database: uid={}'.format(uid))
return out
Taken from https://github.com/amirouche/asyncio-foundationdb/blob/main/found/bstore.py

Related

SqlAlchemy query filter for Enum class property of Column(Enum(...))

I have a declarative class that has an Enum column, and the Enum has a property that returns True/False based on the specific enumerated name or value. It would simplify life if I could do a query with a filter based on this property, such as the following (see implementation below):
session.query(MyTable).filter(MyTable.letter.is_vowel)
using something like the below straightforward attempt at an expression fails with
AttributeError: Neither 'InstrumentedAttribute' object nor 'Comparator' object associated with MyTable.letter has an attribute 'is_vowel'
The below implementation is too simple to allow for construction of the necessary query. Is there a way to do this? I thought maybe something in a Comparator might work, or maybe there's something more sophisticated that would do it?
import enum
from sqlalchemy import (
Column,
Enum,
Integer,
)
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.ext.hybrid import hybrid_property
MyDeclarativeBase = declarative_base()
class Letter(enum.Enum):
A = 1
B = 2
C = 3
D = 4
E = 5
# and so on...
#property
def is_vowel(self):
return self.name in 'AEIOU'
class MyTable(MyDeclarativeBase):
__tablename__ = 'my_table'
id = Column(Integer, primary_key=True, autoincrement=True)
letter = Column(Enum(Letter), nullable=False)
#hybrid_property
def is_vowel(self):
"""Return True if the row's letter is a vowel."""
return self.letter.is_vowel
#is_vowel.expression
def is_vowel(cls):
return cls.letter.is_vowel

use generic types in python to refactor code

I have been trying several ways to refactor the following code as these classes are recurring in my app:
class CreateRecord(Mutation):
record = Field(lambda: Unit)
class Arguments:
input = CreateInput(required=True)
def mutate(self, info, input):
data = input_to_dictionary(input)
data['createdAt'] = datetime.utcnow()
# data['createdBy'] = <user> # TODO: <user> input
record = UnitModel(**data)
db_session.add(record)
db_session.commit()
return CreateRecord(record=record)
class UpdateRecord(Mutation):
record = Field(lambda: Unit)
class Arguments:
input = UpdateInput(required=True)
def mutate(self, info, input):
data = input_to_dictionary(input)
data['updatedAt'] = datetime.utcnow()
# data['updatedBy'] = <user> # TODO: <user> input
record = db_session.query(UnitModel).filter_by(id=data['id'])
record.update(data)
db_session.commit()
record = db_session.query(UnitModel).filter_by(id=data['id']).first()
return UpdateRecord(record=record)
class DeleteRecord(Mutation):
record = Field(lambda: Unit)
class Arguments:
input = DeleteInput(required=True)
def mutate(self, info, input):
data = input_to_dictionary(input)
data['deletedAt'] = datetime.utcnow()
# data['deletedBy'] = <user> # TODO: <user> input
data['isDeleted'] = True
record = db_session.query(UnitModel).filter_by(id=data['id'])
record.update(data)
db_session.commit()
record = db_session.query(UnitModel).filter_by(id=data['id']).first()
return DeleteRecord(record=record)
I was thinking of using generic types but I'm kinda' stuck on how to implement it. I've tried creating a master class and in the
def mutate:
method I'd just check if it's a create, update or delete action but I still want to work with generic types before I do that.
Any help is highly appreciated. TIA.
I've solved this particular problem for myself with a mixin class that includes the following method:
from graphene.utils.str_converters import to_snake_case
class MutationResponseMixin(object):
#classmethod
def get_operation_type(cls):
"""
Determine the CRUD type from the mutation class name.
Uses mutation's class name to determine correct operation.
( create / update / delete )
"""
return to_snake_case(cls.__name__).split('_')[0]
This allows me to include a mutation method in the mixin that is shared by create, update, and delete methods and takes conditional action based on value of get_operation_type.
I also needed a way to determine the base record from the mixin's mutation (which in your case would be UnitModel) so my case I ended up declarding it explicitly as an attribute of each mutation class.

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

Fast alternative to SortedMap.mapValues

The following code demonstrates the problem I have (in the real case SortedMap keys are Joda DateTime, and maps contain several thousands of elements).
import java.io.{ByteArrayOutputStream, ObjectOutputStream}
import scala.collection.immutable.SortedMap
object Main extends App {
val s = SortedMap(1 -> "A", 2 -> "B", 3 -> "C")
def f(s: String) = s
val sMap = s.map(kv => kv._1 -> f(kv._2)) // slow: rebuilds Map, as keys could change
val sMapValues = s.mapValues(f) // fast, but creates a view only
val so = new ByteArrayOutputStream
val oos = new ObjectOutputStream(so)
oos.writeObject(s) // works
oos.writeObject(sMap) // works
oos.writeObject(sMapValues) // does not work - view only
oos.close()
so.close()
}
The problem is while mapValues has a good performance for SortedMap, the result is not a real collection but a view, and as such it cannot be serialized. The simple solution of mapping both keys and values works, but is slow, as the tree representation is rebuilt, map does not know I am not changing the keys.
Is there any fast alternative to SortedMap.mapValues, which outputs a serializable result?
Try transform:
val sMapValues = s.transform((_,v) => f(v))
Although the key and the value are provided to the transformation lambda, the result is applied only to the value, the key is unchanged.

Writing to hadoop distributed file system multiple times with Spark

I've created a spark job that reads in a textfile everyday from my hdfs and extracts unique keys from each line in the text file. There are roughly 50000 keys in each text file. The same data is then filtered by the extracted key and saved to the hdfs.
I want to create a directory in my hdfs with the structure: hdfs://.../date/key that contains the filtered data. The problem is that writing to the hdfs takes a very very long time because there are so many keys.
The way it's written right now:
val inputData = sparkContext.textFile(""hdfs://...", 2)
val keys = extractKey(inputData) //keys is an array of approx 50000 unique strings
val cleanedData = cleanData(inputData) //cleaned data is an RDD of strings
keys.map(key => {
val filteredData = cleanedData.filter(line => line.contains(key))
filteredData.repartition(1).saveAsTextFile("hdfs://.../date/key")
})
Is there a way to make this faster? I've thought about repartitioning the data into the number of keys extracted but then I can't save in the format hdfs://.../date/key. I've also tried groupByKey but I can't save the values because they aren't RDDs.
Any help is appreciated :)
def writeLines(iterator: Iterator[(String, String)]) = {
val writers = new mutalbe.HashMap[String, BufferedWriter] // (key, writer) map
try {
while (iterator.hasNext) {
val item = iterator.next()
val key = item._1
val line = item._2
val writer = writers.get(key) match {
case Some(writer) => writer
case None =>
val path = arg(1) + key
val outputStream = FileSystem.get(new Configuration()).create(new Path(path))
writer = new BufferedWriter(outputStream)
}
writer.writeLine(line)
} finally {
writers.values.foreach(._close())
}
}
val inputData = sc.textFile()
val keyValue = inputData.map(line => (key, line))
val partitions = keyValue.partitionBy(new MyPartition(10))
partitions.foreachPartition(writeLines)
class MyPartitioner(partitions: Int) extends Partitioner {
override def numPartitions: Int = partitions
override def getPartition(key: Any): Int = {
// make sure lines with the same key in the same partition
(key.toString.hashCode & Integer.MAX_VALUE) % numPartitions
}
}
I think the approach should be similar to Write to multiple outputs by key Spark - one Spark job. The partition number has nothing to do with the directory number. To implement it, you may need to override generateFileNameForKeyValue with your customized version to save to different directory.
Regarding scalability, it is not an issue of spark, it is hdfs instead. But no matter how you implemented, as long as the requirements is not changed, it is unavoidable. But I think Hdfs is probably OK with 50,000 file handlers
You are specifying just 2 partitions for the input, and 1 partition for the output. One effect of this is severely limiting the parallelism of these operations. Why are these needed?
Instead of computing 50,000 filtered RDDs, which is really slow too, how about just grouping by the key directly? I get that you want to output them into different directories but that is really causing the bottlenecks here. Is there perhaps another way to architect this that simply lets you read (key,value) results?

Resources