Apache Mahout - Read preference value from String - hadoop

I'm in a situation where I have a dataset that consists of the classical UserID, ItemID and preference values, however they are all strings.
I have managed to read the UserID and ItemID strings by Overriding the readItemIDFromString() and readUserIDFromString() methods in the FileDataModel class (which is a part of the Mahout library) however, there doesnt seem to be any support for the conversion of preference values if I am not mistaken.
If anyone has some input to what an approach to this problem could be I would greatly appreciate it.
To illustrate what I mean, here is an example of my UserID string "Conversion":
#Override
protected long readUserIDFromString(String value) {
if (memIdMigtr == null) {
memIdMigtr = new ItemMemIDMigrator();
}
long retValue = memIdMigtr.toLongID(value);
if (null == memIdMigtr.toStringID(retValue)) {
try {
memIdMigtr.singleInit(value);
} catch (TasteException e) {
e.printStackTrace();
}
}
return retValue;
}
String getUserIDAsString(long userId) {
return memIdMigtr.toStringID(userId);
}
And the implementation of the AbstractIDMigrator:
public class ItemMemIDMigrator extends AbstractIDMigrator {
private FastByIDMap<String> longToString;
public ItemMemIDMigrator() {
this.longToString = new FastByIDMap<String>(10000);
}
public void storeMapping(long longID, String stringID) {
longToString.put(longID, stringID);
}
public void singleInit(String stringID) throws TasteException {
storeMapping(toLongID(stringID), stringID);
}
public String toStringID(long longID) {
return longToString.get(longID);
}
}

Mahout is deprecating the old recommenders based on Hadoop. We have a much more modern offering based on a new algorithm called Correlated Cross-Occurrence (CCO). Its is built using Spark for 10x greater speed and gives real-time query results when combined with a query server.
This method ingests strings for user-id and item-id and produces results with the same ids so you don't need to manage those anymore. You really should have look at the new system, not sure how long the old one will be supported.
Mahout docs here: http://mahout.apache.org/users/algorithms/recommender-overview.html and here: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
The entire system described, with SDK, input storage, training of model and real-time queries is part of the Apache PredictionIO project and docs for the PIO and "Universal Recommender" and here: http://predictionio.incubator.apache.org/ and here: http://actionml.com/docs/ur

Related

SpringBatch write to different entities

I have processed a "wrapperObject" (AimResponse in this case).
Depending on the property "type" I map to Document or SourceSpace object.
Then I need to persist these entities. I found an example similar to this one:
#Override
public void write(List<? extends List<AimResponse>> list)
throws Exception {
List<SourceSpace> sourceSpaces = new ArrayList<>();
List<Document> documents = new ArrayList<>();
for(List<AimResponse> item:list) {
for(AimResponse i:item) {
if(i.getType().indexOf("folder") >= 0) {
SourceSpace sourceSpace = Mapper.aimResponseToSourceSpace(i);
sourceSpace.setStatus(Status.FOUND.name());
sourceSpaces.add(sourceSpace);
} else if(i.getType().indexOf("document") >= 0) {
Document document = Mapper.aimResponseToDocument(i);
document.setStatus(Status.FOUND.name());
documents.add(document);
}
}
}
if(!CollectionUtils.isEmpty(sourceSpaces)) {
sourceSpaceWriter.write(sourceSpaces);
}
if(!CollectionUtils.isEmpty(documents)) {
documentWriter.write(documents);
}
}
In this example I'm not able to instantiate JdbcBatchItemWriter but anyway I think should be better if the processor could split into 2 different lists and call 2 different writers each one with its own type but I guess it's not possible.
Any help is appreciated.
ClassifierCompositeItemWriter is what you are looking for. It allows you to classify items according to a given criteria and call the corresponding writer.
In your case, you can classify items based on their type (i.getType()) and use a writer for each type. You can find an example of how to use that writer here.

How to retrieve data by property in Couchbase Lite?

My documents have the property docType that separated them based on the purpose of each type, in the specific case template or audit. However, when I do the following:
document.getProperty("docType").equals("template");
document.getProperty("docType").equals("audit");
The results of them are always the same, it returns every time all documents stored without filtering them by the docType.
Below, you can check the query function.
public static Query getData(Database database, final String type) {
View view = database.getView("data");
if (view.getMap() == null) {
view.setMap(new Mapper() {
#Override
public void map(Map<String, Object> document, Emitter emitter) {
if(String.valueOf(document.get("docType")).equals(type)){
emitter.emit(document.get("_id"), null);
}
}
}, "4");
}
return view.createQuery();
}
Any hint?
This is not a valid way to do it. Your view function must be pure (it cannot reference external state such as "type"). Once that is created you can then query it for what you want by setting start and end keys, or just a set of keys in general to filter on.

Querying single database row using rxjava2

I am using rxjava2 for the first time on an Android project, and am doing SQL queries on a background thread.
However I am having trouble figuring out the best way to do a simple SQL query, and being able to handle the case where the record may or may not exist. Here is the code I am using:
public Observable<Record> createRecordObservable(int id) {
Callable<Record> callback = new Callable<Record>() {
#Override
public Record call() throws Exception {
// do the actual sql stuff, e.g.
// select * from Record where id = ?
return record;
}
};
return Observable.fromCallable(callback).subscribeOn(Schedulers.computation());
}
This works well when there is a record present. But in the case of a non-existent record matching the id, it treats it like an error. Apparently this is because rxjava2 doesn't allow the Callable to return a null.
Obviously I don't really want this. An error should be only if the database failed or something, whereas a empty result is perfectly valid. I read somewhere that one possible solution is wrapping Record in a Java 8 Optional, but my project is not Java 8, and anyway that solution seems a bit ugly.
This is surely such a common, everyday task that I'm sure there must be a simple and easy solution, but I couldn't find one so far. What is the recommended pattern to use here?
Your use case seems appropriate for the RxJava2 new Observable type Maybe, which emit 1 or 0 items.
Maybe.fromCallable will treat returned null as no items emitted.
You can see this discussion regarding nulls with RxJava2, I guess that there is no many choices but using Optional alike in other cases where you need nulls/empty values.
Thanks to #yosriz, I have it working with Maybe. Since I can't put code in comments, I'll post a complete answer here:
Instead of Observable, use Maybe like this:
public Maybe<Record> lookupRecord(int id) {
Callable<Record> callback = new Callable<Record>() {
#Override
public Record call() throws Exception {
// do the actual sql stuff, e.g.
// select * from Record where id = ?
return record;
}
};
return Maybe.fromCallable(callback).subscribeOn(Schedulers.computation());
}
The good thing is the returned record is allowed to be null. To detect which situation occurred in the subscriber, the code is like this:
lookupRecord(id)
.observeOn(AndroidSchedulers.mainThread())
.subscribe(new Consumer<Record>() {
#Override
public void accept(Record r) {
// record was loaded OK
}
}, new Consumer<Throwable>() {
#Override
public void accept(Throwable throwable) {
// there was an error
}
}, new Action() {
#Override
public void run() {
// there was an empty result
}
});

Using org.xmlunit.diff.NodeFilters in XMLUnit DiffBuilder

I am using the XMLUnit in JUnit to compare the results of tests. I have a problem wherein there is an Element in my XML which gets the CURRENT TIMESTAMP as the tests run and when compared with the expected output, the results will never match.
To overcome this, I read about using org.xmlunit.diff.NodeFilters, but do not have any examples on how to implement this. The code snippet I have is as below,
final org.xmlunit.diff.Diff documentDiff = DiffBuilder
.compare(sourcExp)
.withTest(sourceActual)
.ignoreComments()
.ignoreWhitespace()
//.withNodeFilter(Node.ELEMENT_NODE)
.build();
return documentDiff.hasDifferences();
My problem is, how do I implement the NodeFilter? What parameter should be passed and should that be passed? There are no samples on this. The NodeFilter method gets Predicate<Node> as the IN parameter. What does Predicate<Node> mean?
Predicate is a functional interface with a single test method that - in the case of NodeFilter receives a DOM Node as argument and returns a boolean. javadoc of Predicate
An implementation of Predicate<Node> can be used to filter nodes for the difference engine and only those Nodes for which the Predicate returns true will be compared. javadoc of setNodeFilter, User-Guide
Assuming your element containing the timestamp was called timestamp you'd use something like
.withNodeFilter(new Predicate<Node>() {
#Override
public boolean test(Node n) {
return !(n instanceof Element &&
"timestamp".equals(Nodes.getQName(n).getLocalPart()));
}
})
or using lambdas
.withNodeFilter(n -> !(n instanceof Element &&
"timestamp".equals(Nodes.getQName(n).getLocalPart())))
This uses XMLUnit's org.xmlunit.util.Nodes to get the element name more easily.
The below code worked for me,
public final class IgnoreNamedElementsDifferenceListener implements
DifferenceListener {
private Set<String> blackList = new HashSet<String>();
public IgnoreNamedElementsDifferenceListener(String... elementNames) {
for (String name : elementNames) {
blackList.add(name);
}
}
public int differenceFound(Difference difference) {
if (difference.getId() == DifferenceConstants.TEXT_VALUE_ID) {
if (blackList.contains(difference.getControlNodeDetail().getNode()
.getParentNode().getNodeName())) {
return DifferenceListener.RETURN_IGNORE_DIFFERENCE_NODES_IDENTICAL;
}
}
return DifferenceListener.RETURN_ACCEPT_DIFFERENCE;
}
public void skippedComparison(Node node, Node node1) {
}

Using eager loading with specification pattern

I've implemented the specification pattern with Linq as outlined here https://www.packtpub.com/article/nhibernate-3-using-linq-specifications-data-access-layer
I now want to add the ability to eager load and am unsure about the best way to go about it.
The generic repository class in the linked example:
public IEnumerable<T> FindAll(Specification<T> specification)
{
var query = GetQuery(specification);
return Transact(() => query.ToList());
}
public T FindOne(Specification<T> specification)
{
var query = GetQuery(specification);
return Transact(() => query.SingleOrDefault());
}
private IQueryable<T> GetQuery(
Specification<T> specification)
{
return session.Query<T>()
.Where(specification.IsSatisfiedBy());
}
And the specification implementation:
public class MoviesDirectedBy : Specification<Movie>
{
private readonly string _director;
public MoviesDirectedBy(string director)
{
_director = director;
}
public override
Expression<Func<Movie, bool>> IsSatisfiedBy()
{
return m => m.Director == _director;
}
}
This is working well, I now want to add the ability to be able to eager load. I understand NHibernate eager loading can be done by using Fetch on the query.
What I am looking for is whether to encapsulate the eager loading logic within the specification or to pass it into the repository, and also the Linq/expression tree syntax required to achieve this (i.e. an example of how it would be done).
A possible solution would be to extend the Specification class to add:
public virtual IEnumerable<Expression<Func<T, object>>> FetchRelated
{
get
{
return Enumerable.Empty<Expression<Func<T, object>>>();
}
}
And change GetQuery to something like:
return specification.FetchRelated.Aggregate(
session.Query<T>().Where(specification.IsSatisfiedBy()),
(current, related) => current.Fetch(related));
Now all you have to do is override FetchRelated when needed
public override IEnumerable<Expression<Func<Movie, object>>> FetchRelated
{
get
{
return new Expression<Func<Movie, object>>[]
{
m => m.RelatedEntity1,
m => m.RelatedEntity2
};
}
}
An important limitation of this implementation I just wrote is that you can only fetch entities that are directly related to the root entity.
An improvement would be to support arbitrary levels (using ThenFetch), which would require some changes in the way we work with generics (I used object to allow combining different entity types easily)
You wouldn't want to put the Fetch() call into the specification, because it's not needed. Specification is just for limiting the data that can then be shared across many different parts of your code, but those other parts could have drastically different needs in what data they want to present to the user, which is why at those points you would add your Fetch statements.

Resources