I have a fairly simple CoreData model expressed as:
There are a number of "Trips" that can contain multiple images. Additionally, a single image can be in multiple trips.
My issue is I want to maintain the order in which they were added to a Trip. Normally, I would add a field to the ImageData object that maintains its index. However, because an Image can be in any number of Trips this doesn't make sense.
What is a clean way to maintaining an index in this case?
You seek the ordered relationship checkbox in Xcode 4.
There is a performance cost to maintaining the order the of the relationship, so if your dataset is large (>10k objects) you might want to consider the traditional sort attribute + sort descriptor approach.
For your specific case you would probably end up with an intermediate join entity, which was a join between Image and Trip, i.e. ImageTripOrder and had a one-to-one with Trip as well as a one-to-one with Image.
The old way of handling this is to provide a linking entity that encodes the order.
Trip{
//...various attributes
images<-->>TripToImage.trip
}
Image{
//...various attributes
trips<-->>TripToImage.image
}
TripToImage{
trip<<-->Trip.images
image<<-->Image.trips
previous<-->TripToImage.next
next<-->TripToImage.previous
}
The linking entity is extending the modeling of the relationship. You can create an arbitary order of trips or images.
However, such ordering can be usually rendered superfluous by better model design. In this case, if images have specific dates and locations, how can they part of more than one trip? Are you modeling a time traveling agency? ;-)
Core Data should simulate/model the real-world objects, events and conditions that you app deals with. Real-world images and trips have attributes that uniquely describe them in space and time. If you entities accurately capture all that data, then ordering them based on the attributes becomes trivial.
Usually, you only need to use a linking entity when you have to model an arbitrary order such as a user's favorite ranking or other subjective ordering.
Update:
It takes images of places you'd like to see and put them in a list
for a trip. I.e. not images that you have personally taken.
Your current model is cumbersome because your not modeling your problem closely enough.
You're really creating an itinerary which is a series of locations and times but your trying to model that information indirectly by smooshing the Trip and Image objects together in an elaborate relationship.
Instead, you need an entity that models an itinerary specifically.
ItineraryStop{
name:string
arrivalTime:date
departureTime:date
location<<-->Location.stops
image<<-->Image.stop
trip<<-->Trip.stop
previousStop<-->IteneraryStop.nextStop
nextStop<-->InteneraryStop.previousStop
}
Now, everything everything just falls into place. To get all you stops in order for a particular trip, you can either fetch on trip and and sort on arrivalTimes, you can sort any particular Trip objects stops or if there are no dates, you can walk the relationships.
In any case, your model now more closely resembles the real-world objects, events or conditions that your app deals with.
Related
I was wondering if it is possible to have more than one time series identifier column in the model? Let's assume I'd like to create a forecast at a product and store level (which the documentation suggests should be possible).
If I select product as the series identifier, the only options I have left for store is either a covariate or an attribute and neither is applicable in this scenario.
Would concatenating product and store and using the individual product and store code values for that concatenated ID as attributes be a solution? It doesn't feel right, but I can't see any other option - am I missing something?
Note: I understand that this feature of Vertex AI is currently in preview and that because of that the options may be limited.
There isn't an alternate way to assign 2 or more Time Series Identifiers in the Forecasting Model on Vertex AI. The "Forecasting model" is in the "Preview" Product launch stage, as you are aware, with all consequences of that fact the options are limited. Please refer to this doc for more information about the best practices for data preparation to train the forecasting model.
As a workaround, the two columns can be concatenated and assigned a Time Series Identifier on that concatenated column, as you have mentioned in the question. This way, the concatenated column carries more contextual information into the training of the model.
Just to follow up on Vishal's (correct) answer in case someone is looking this up in the future.
Yes, concatenating is the only option for now as there can only be one time series identifier (I would hope this changes in the future). Having said that, I've experimented with adding the individual identifiers in the data as categorical attributes and it works actually pretty well. This way I have forecast generated at a product/store level, but I can aggregate all forecasts for individual products and the results are not much off from the models trained on aggregated data (obviously that would depend on the demand classification and selected optimisation method amongst other factors).
Also, an interesting observation. When you include things like product descriptions, you can classify them either as categorical or text. I wasn't able to find in the documentation if the model would only use unigrams (which is what the column statistics in the console would suggest) or a number of n-grams but it is definitely something you would want to experiment with with your data. My dataset was actually showing a better accuracy when the categorical classification was used, which is a bit counter-intuitive as it feels like redundant information, although it's hard to tell as the documentation isn't very detailed. It is likely to be specific to my data set, so as I said make sure you experiment with yours.
I am new to Hadoop, MapReduce, Big Data and am trying to evaluate it's viability for a specific use case that is extremely interesting to the project that I am working on. I am not sure however if what I would like to accomplish is A) possible or B) recommended with the MapReduce model.
We essentially have a significant volume of widgets (known structure of data) and pricing models (codified in JAR files) and what we want to be able to do is to execute every combination of widget and pricing model to determine the outcomes of the pricing across the permutations of the models. The pricing models themselves will examine each widget and determine pricing based on the decision tree within the model.
This makes sense from a parallel processing on commodity infrastructure perspective in my mind but from a technical perspective I do not know if it's possible to execute external models within the MR jobs and from a practical perspective whether or not I am trying to force a use case into the technology.
The question therefore becomes is it possible; does it make sense to implement in this fashion; and if not what are other options / patterns more suited to this scenario?
EDIT
The volume and variety will grow over time. Assume for the sake of discussion here that we have a terabyte of widgets and 10s of pricing models currently. We would then expect to gro into multiple terabytes and 100s of pricing models and that the execution of the permutations would would happen frequently as widgets change and/or are added and as new categories of pricing models are introduced.
You certainly need a scalable, parallel-izable solution and hadoop can be that. You just have to massage your solution a bit so it would fit into the hadoop world.
First, You'll need to make the models and widgets implement common interfaces (speaking very abstractly here) so that you can apply and arbitrary model to an arbitrary widget without having to know anything about the actual implementation or representation.
Second, you'll have to be able to reference both models and widgets by id. That will let you build objects (writables) that hold the id of a model and the id of a widget and would thus represent one "cell" in the cross product of widgets and models. You distribute these instances across multiple servers and in doing so distribute the application of models to widgets across multiple servers. These objects (call it class ModelApply) would hold the results of a specific model-to-widget application and can be processed in the usual way with hadoop to repost on best applications.
Third, and this is the tricky part, you need to compute the actual cross product of models to widgets. You say the number of models (and therefore model id's) will number in at most the hundreds. This means that you could load that list of id's into memory in a mapper and map that list to widget id's. Each call to the mapper's map() method would pass in a widget id and would write out one instance of ModelApply for each model.
I'll leave it at that for now.
The data structures that we use in applications often contain a great
deal of information of various types, and certian pieces of
information may be belong to multiple independent data structures. For
example, a file of personnel data may contain records with names,
addresses, and various other pieces of information about employees;
and each record may need to belong to one data structure for searching
for particular employees, to another data structure for answering
statistical queries, and so forth.
Despite this diverstiy and complexity, a large class of computing
applications involve generic manipulation of data objects, and need
access to the information associated with them for a limited number of
specific reasons. Many of the manipulations that are required are a
natural outgrowth of basic computational procedures, so they are
needed in broad variety of applications.
Above text is described in context of abstract data types by Robert Sedwick in Algorithms in C++.
My questions is what does author mean by first paragraph in above text?
Data structures are combinations of data storage and algorithms that work on those organisations of data to provide implementations of certain operations (searching, indexing, sorting, updating, adding, etc) with particular constraints. These are the building blocks (in a black box sense) of information representation in software. At the most basic level, these are things like queues, stacks, lists, hash maps/associative containers, heaps, trees etc.
Different data structures have different tradeoffs. You have to use the right one in the right situation. This is key.
In this light, you can use multiple (or "compound") data structures in parallel that allow different ways of querying and operating on the same logical data, hence filling each other's tradeoffs (strengths/weaknesses e.g. one might be presorted, another might be good at tracking changes, but be more costly to delete entries from, etc), usually at the cost of some extra overhead since those data structures will need to be kept synchronised with each other.
It would help if one knew what the conclusion of all this is, but from what I gather:
Employee record:
Name Address Phone-Number Salary Bank-Account Department Superior
As you can see, the employee database has information for each employee that by itself is "subdivided" into chunks of more-or-less independent pieces: the contact information for an employee has little or nothing to do with the department he works in, or the salary he gets.
EDIT: As such, depending on what kind of stuff needs to be done, different parts of this larger record need to be looked at, possibly in different fashion. If you want to know how much salary you're paying in total you'll need to do different things than for looking up the phone number of an employee.
An object may be a part of another object/structure, and the association is not unique; one object may be a part of multiple different structures, depending on context.
Say, there's a corporate employee John. His "employee record" will appear in the list of members of his team, in the salaries list, in the security clearances list, parking places assignment etc.
Not all data contained within his "employee record" will be needed in all of these contexts. His expertise fields are not necessary for parking place allotment, and his marital status should not play a role in meeting room assignment - separate subsystems and larger structures his entry is a part of don't require all of the data contained within his entry, just specific parts of it.
I am looking for a way to search in an efficient way for data in a huge multi-dimensional matrix.
My application contains data that is characterized by multiple dimensions. Imagine keeping data about all sales in a company (my application is totally different, but this is just to demonstrate the problem). Every sale is characterized by:
the product that is being sold
the customer that bought the product
the day on which it has been sold
the employee that sold the product
the payment method
the quantity sold
I have millions of sales, done on thousands of products, by hundreds of employees, on lots of days.
I need a fast way to calculate e.g.:
the total quantity sold by an employee on a certain day
the total quantity bought by a customer
the total quantity of a product paid by credit card
...
I need to store the data in the most detailed way, and I could use a map where the key is the sum of all dimensions, like this:
class Combination
{
Product *product;
Customer *customer;
Day *day;
Employee *employee;
Payment *payment;
};
std::map<Combination,quantity> data;
But since I don't know beforehand which queries are performed, I need multiple combination classes (where the data members are in different order) or maps with different comparison functions (using a different sequence to sort on).
Possibly, the problem could be simplified by giving each product, customer, ... a number instead of a pointer to it, but even then I end up with lots of memory.
Are there any data structures that could help in handling this kind of efficient searches?
EDIT:
Just to clarify some things: On disk my data is stored in a database, so I'm not looking for ways to change this.
The problem is that to perform my complex mathematical calculations, I have all this data in memory, and I need an efficient way to search this data in memory.
Could an in-memory database help? Maybe, but I fear that an in-memory database might have a serious impact on memory consumption and on performance, so I'm looking for better alternatives.
EDIT (2):
Some more clarifications: my application will perform simulations on the data, and in the end the user is free to save this data or not into my database. So the data itself changes the whole time. While performing these simulations, and the data changes, I need to query the data as explained before.
So again, simply querying the database is not an option. I really need (complex?) in-memory data structures.
EDIT: to replace earlier answer.
Can you imagine you have any other possible choice besides running qsort( ) on that giant array of structs? There's just no other way that I can see. Maybe you can sort it just once at time zero and keep it sorted as you do dynamic insertions/deletions of entries.
Using a database (in-memory or not) to work with your data seems like the right way to do this.
If you don't want to do that, you don't have to implement lots of combination classes, just use a collection that can hold any of the objects.
I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).
Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.
I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").
Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.
Obviously, the best way of doing that matching is highly dependent on your taxonomy, the nature of your "reference documents", and the expected relationships you'd like created.
However, based on the information provided, I'd suggest the following:
Start by Building a word-based (rather than letter based) unigram or bigram model for each of your categories, based on the reference documents. If there are only few of these for each category (It seems you might have only one), you could use a semi-supervised approach, and throw in also the automatically categorized documents for each category. A relatively simple tool for building the model might be the CMU SLM toolkit.
Calculate the mutual-information (infogain) of each term or phrase in your model, with relation to other categories. if your categories are similar, you might need you use only neighboring categories to get meaningful result. This step would give the best separating terms higher scores.
Correlate the categories to each other based on the top-infogain terms or phrases. This could be done either by using euclidean or cosine distance between the category models, or by using a somewhat more elaborated techniques, like graph-based algorithms or hierarchic clustering.