How to persist Spark MLlib model in a database? - apache-spark-mllib

I have a MultilayerPerceptronClassificationModel set up and trained (in the same way as in this tutorial) and now I want to persist it in order to reuse the Neural Network next time I need to classify some data. The model has load and save methods to be persisted and restored in the file. But is there a way to save (and later - to load) the model in the database? (in my case it is CassandraDB).

Ok, I found the answer by myself. Not sure that this is the best solution, but it works fine for me.
MultilayerPerceptronClassificationModel (and, as far as I can see, every model of MLlib package) implements Serializable interface. So it could be serialized/deserialized as ByteArray.
Let's make a table for storing the model in Cassandra DB:
CREATE TABLE models (
uid TEXT,
name TEXT,
model BLOB,
PRIMARY KEY (uid)
);
Now we can write the model to the DB:
def saveModel(model: MultilayerPerceptronClassificationModel) = {
val baos = new ByteArrayOutputStream()
val oos = new ObjectOutputStream(baos)
oos.writeObject(model)
oos.flush()
oos.close()
sc.parallelize(Seq((model.uid, "my-neural-network-model", baos.toByteArray)))
.saveToCassandra("mykeyspace", "models", SomeColumns("uid", "name", "model"))
}
and read the model back:
def loadModel(): MultilayerPerceptronClassificationModel = {
sc.cassandraTable("mykeyspace", "models").map { r =>
val bis = new ByteArrayInputStream(r.getBytes("model").array())
val ois = new ObjectInputStream(bis)
ois.readObject.asInstanceOf[MultilayerPerceptronClassificationModel]
}.first()
}

Related

Recursive linq expressions to get non NULL parent value?

I wrote a simple recursive function to climb up the tree of a table that has ID and PARENTID.
But when I do that I get this error
System.InvalidOperationException: 'The instance of entity type 'InternalOrg' cannot be tracked because another instance with the same key value for {'Id'} is already being tracked. When attaching existing entities, ensure that only one entity instance with a given key value is attached.
Is there another way to do this or maybe done in one LINQ expression ?
private InternalOrgDto GetInternalOrgDto(DepartmentChildDto dcDto)
{
if (dcDto.InternalOrgId != null)
{
InternalOrg io = _internalOrgRepo.Get(Convert.ToInt32(dcDto.InternalOrgId));
InternalOrgDto ioDto = new InternalOrgDto
{
Id = io.Id,
Abbreviation = io.Abbreviation,
Code = io.Code,
Description = io.Description
};
return ioDto;
}
else
{
//manually get parent department
Department parentDepartment = _departmentRepo.Get(Convert.ToInt32(dcDto.ParentDepartmentId));
DepartmentChildDto parentDepartmenDto = ObjectMapper.Map<DepartmentChildDto>(parentDepartment);
return GetInternalOrgDto(parentDepartmenDto);
}
}
Is there a way to get a top-level parent from a given child via Linq? Not that I am aware of. You can do it recursively similar to what you have done, though I would recommend simplifying the query to avoid loading entire entities until you get what you want. I'm guessing from your code that only top level parent departments would have an InternalOrg? Otherwise this method would recurse up the parents until it found one. This could be sped up a bit like:
private InternalOrgDto GetInternalOrgDto(DepartmentChildDto dcDto)
{
var internalOrgid = dcDto.InternalOrgId
?? FindInternalOrgid(dcDto.ParentDepartmentId)
?? throw new InternalOrgNotFoundException();
InternalOrgDto ioDto = _context.InternalOrganizations
.Where(x => x.InternalOrgId == internalOrgId.Value)
.Select(x => new InternalOrgDto
{
Id = x.Id,
Abbreviation = x.Abbreviation,
Code = x.Code,
Description = x.Description
}).Single();
return ioDto;
}
private int? FindInternalOrgid(int? departmentId)
{
if (!departmentId.HasValue)
return (int?) null;
var details = _context.Departments
.Where(x => x.DepartmentId == departmentId.Value)
.Select(x => new
{
x.InternalOrgId,
x.ParentDepartmentId
}).Single();
if (details.InternalOrgId.HasValue)
return details.InternalOrgId;
return findInternalOrgId(details.parentDepartmentId);
}
The key considerations here are to avoid repository methods that return entities or sets of entities, especially where you don't need everything about an entity. By leveraging the IQueryable provided by EF through Linq we can project down to just the data we need rather than returning every field. The database server can accommodate this better via indexing and help avoid things like locks. If you are using repositories to enforce low level domain rules or to enable unit testing then the repositories can expose IQueryable<TEntity> rather than IEnumerable<TEntity> or even TEntity to enable projection and other EF Linq goodness.
Another option to consider where I have hierarchal data where the relationships are important and I want to quickly find all related entities to a parent, or get to a specific level, one option is to store a breadcrumb with each record which is updated if that item is ever moved. The benefit is that these kinds of checks become very trivial to do, the risk is that anywhere/anything that can modify data relationships could leave the breadcrumb trail in an invalid state.
For example, if I have a Department ID 22 which belongs to Department 8 which belongs to Department 2 which is a top-level department, 22's breadcrumb trail would be: "2,8". If the breadcrumbs are empty we have a top-level entity. (and no parent Id) We can parse the breadcrumbs using a simple string.Split() operation. This avoids the recursive trips to the DB entirely. Though you may want a maintenance job running behind the scenes to periodically inspect recently modified data to ensure their breadcrumb trails are accurate and alerting you if any get broken. (Either by faulty code or such)

Keep an object as a value in Redis using its HashTable

I am new to Redis and I'm trying to write a simple project that collects information from SQL database and caches into Redis. As I'm more comfortable with C#, I've chosen StackExchange.Redis to do that.
Let's say I have a table in my db with a schema like this Persons(ID, Name, Address, Age, BirthDate).
I have a Person class in my project with corresponding fields.
I have also a function GetPersonByID(ID), that requests the Redis, if a key with the ID doesn't exist it executes another function called GetPersonByID_SQL(ID), when an sql query is being executed, after getting information from db it creates an object Person, adds that object to Redis(using hashTable) and returns the object. If the key existed in Redis the function just gets information from Redis, creates an object Person, maps the corresponding values values to fields and returns that object.
Here is the code of how I do this.
public static Person GetPersonByID(string ID)
{
redis = ConnectionMultiplexer.Connect("127.0.0.1");
IDatabase db = redis.GetDatabase();
Person p;
if (!db.KeyExists(key))
{
p = Person.GetPersonByID_SQL(ID);
db.HashSet(key, "Name", p.Name);
db.HashSet(key, "Address", p.Address);
db.HashSet(key, "Age", p.Age);
db.HashSet(key, "BirthDate", p.BirthDate);
}
else
{
HashEntry[] values = db.HashGetAll(key);
p = new Person();
for (int i = 0; i < values.Length; i++)
{
HashEntry hashEntry = values[i];
switch (hashEntry.Name)
{
case "Name": p.Name = hashEntry.Value; break;
case "Address": p.Address = hashEntry.Value; break;
case "Age": p.Age = hashEntry.Value; break;
case "BirthDate": p.BirthDate = hashEntry.Value; break;
}
}
}
return p;
}
My question is, Is there any way I can Bind automatically the value of Redis (that is in my way a HashTable) to my existing object?
My question is, Is there any way I can Bind automatically the value of Redis (that is in my way a HashTable) to my existing object?
No, this is not a feature of StackExchange.Redis - values can only be stored as the most basic types (strings and numbers). You're expected to do the conversions to more complex values yourself.
So you can either store your person object as multiple hash fields (as you've done in your code), or you can store your person object as a single serialized string against a key (you can use the STRING data structure for this instead). Then you can perform the necessary deserializations when you want to retrieve a value.
If you always want to retrieve every value from your Person data structure, I would recommend going for the second option. You'll only then need to use a single command command for each get/set:
// SET
_redis.StringSet(key, serializedPerson);
// GET
string serializedPerson = _redis.StringGet(key);

Entity Framework Core - Upsert entities from other database encounters tracking problems

I have a flatfile from a different database. I import it and map it to my application's entities. Because the flatfile does not contain ids I cannot be sure the entries I handle are not duplicates of what has already been added to my database earlier or to my context at this moment.
The error message I get is:
The instance of entity type 'Car' cannot be tracked because another
instance with the same key value for {'Make', 'Model'} is already
being tracked. When attaching existing entities, ensure that only one
entity instance with a given key value is attached. Consider using
'DbContextOptionsBuilder.EnableSensitiveDataLogging' to see the
conflicting key values.
An example:
Data rows from flatfile
Volvo V70 Steve
Volvo V70 John
Having mapped these rows and trying to put them in db
foreach(var row in flatFileRows){
Car existingCar = null;
if(dbContext.Cars.Any(c => c.Make == row.Make && c.Model == row.Model)){
existingCar = dbContext.Cars
.SingleOrDefault(c => c.Make == row.Make && c.Model == row.Model);
}
//I also do the same for existingDriver
var car = existingCar != null
? existingCar
: new Car()
{
Make = row.Make,
Model = row.Model,
Drivers = new List<Driver>();
};
var driver = new Driver()
{
CarId = existingCar != null ? exsitingCar.Id : 0,
Name = row.Name
};
car.Drivers.Add(driver);
dbContext.Cars.Update(car); //Second time we hit this the error is thrown
}
dbContext.SaveChanges();
Make and Model are set to keys in the schema because I don't want duplicate entries of the car models.
The above example is simplified.
What I want is to check if I already put a car in the db with these attributes and then build according to my schema from that entity. I don't care to track any entries, disconnected or otherwise, because I just need to populate the database.

findByPropertyAndReleation not giving me the expected Entity

I'm importing historical football (or soccer, if you're from the US) data into a Neo4j database using a spring boot application (2.1.6.RELEASE) with the spring-boot-starter-data-neo4j dependency and a standalone, locally running 3.5.6 Neo4j database server.
But for some reason searching for an entity by a simple property and an attached, referenced entity, does not work, althought the relation is present in the database.
This is the part of the model, that is currently giving me a headache:
#NodeEntity(label = "Season")
open class Season(
#Id
#GeneratedValue
var id: Long? = null,
#Index(unique = true)
var name: String,
var seasonNumber: Long,
#Relationship(type = "IN_LEAGUE", direction = Relationship.OUTGOING)
var league: League?,
var start: LocalDate,
var end: LocalDate
)
#NodeEntity(label = "League")
open class League(
#Id
#GeneratedValue
var id: Long? = null,
#Index(unique = true)
var name: String,
#Relationship(type = "BELONGS_TO", direction = Relationship.OUTGOING)
var country: Country?
)
(I left out the Country class, as I'm pretty sure that it is not part of the problem)
To allow running the import more than once, I want to check if the corresponding entity is already present in the database and only import newer ones. So I added the following method SeasonRepository:
open class SeasonRepository : CrudRepository<Season, Long> {
fun findBySeasonNumberAndLeague(number: Long, league: League): Season?
}
But it is giving me a null result instead of the existing entity on consecutive runs, hence I get duplicates in my database.
I would have expected spring-data-neo4j to reduce the passed League to its Id and then have a generated query that looks somewhat like this:
MATCH (s:Season)-[:IN_LEAGUE]->(l:League) WHERE id(l) = {leagueId} AND s.seasonNumber = {seasonNumber} WITH s MATCH (s)-[r]->(o) RETURN s,r,o
but when I turn on finer logging on the neo4j package I see this output in the log file:
MATCH (n:`Season`) WHERE n.`seasonNumber` = { `seasonNumber_0` } AND n.`league` = { `league_1` } WITH n RETURN n,[ [ (n)-[r_i1:`IN_LEAGUE`]->(l1:`League`) | [ r_i1, l1 ] ] ], ID(n) with params {league_1={id=30228, name=1. Bundesliga, country={id=29773, name=Deutschland}}, seasonNumber_0=1}
So for some reason, spring-data seems to think, that the league property is a simple / primitive property and not a full releation, that needs to be resolved by the id (n.league= {league_1}).
I only got it to work, by passing the id of the league, and providing a custom query using the #Query annotation but I actually thought, that it would work with spring-data-neo4j out of the box.
Any help appreciated. Let me know if you need more details.
Spring Data Neo4j does not support objects as parameters at the moment. It is possible to query for properties on related entities/nodes e.g. findBySeasonNumberAndLeagueName if this is a suitable solution.

Better way to handle this code

I am working on a MVC3 application with nhibernate and SQL server. Have written a normal method which is re-usable. Please find the below code and let me know a better way to handle it. I have observed to execute this piece of code it is taking a long time.
private void GetParentCompany(IEnumerable<Company> companiesList)
{
foreach (var company in companiesList)
{
long? dunsUltimateParent = company.DUNSUltimateParent;
Company ultimateParent = _companyService.GetCompanyByDUNS(Convert.ToInt64(dunsUltimateParent));
if (ultimateParent != null)
{
company.UltimateParentName = ultimateParent.CompanyName;
company.UltimateCompanyId = ultimateParent.CompanyId;
company.UltimateParentDuns = ultimateParent.DUNS;
}
}
}
Adding an index to your company.DUNS column might help. However consider to introduce a many-to-one relationship from company to (parent) company.
Place a UltimateParent property with type company in the company class. The fields UltimateParentName and UltimateParentDuns would then be redundant and you could simply get company.UltimateParent.Name for example. The mapping of UltimateParent can be done using 'References' in fluent-nhibernate.
References(x => x.UltimateParent);

Resources