I have a concern that always accompanies me on the consistency of the data model with respect to future changes and backwards compatibility.
Assuming we have an application that cycles through periods (each year) and a model where a portion of their data, these organized hierarchically and that this hierarchy may or may not change between periods, there are implementations where it simply separated into a different database each cycle, but is the problem of data interoperate between cycles, as we can keep this hierarchy and its changes in each cycle, without having to store the entire hierarchy each cycle, since it does not necessarily change much less change it all, but there is that possibility.
Example:
academic information system, we have a hierarchy of subjects in each knowledge area
Mathematics
Algebra
Trigonometry
Arithmetic
Social Sciences
History
Geography
now based on this hierarchy holdings keep the qualifications of each student in the 2010 period, now in the period following 2011, the hierarchy changes
Mathematics
Trigonometry
Arithmetic
Algebra / * here's a change * /
Algebra
Social Sciences
History
Geography
or
Mathematics
Trigonometry
Arithmetic
/ * here's other change no more algebra * /
Social Sciences
History
Geography
the system is working and continue to keep the grades of students in the period 2011, now a student needs its past rating period, but the hierarchy has changed, as you can get the system the previous hierarchy
how I can fix this problem?
Here is a modeling suggestion: a subject entity should have attributes
subject_id (unique primary key)
name
superordinate_subject_id (if empty, you have a top node in your hierarchy)
lifetime (from_year, to_year; when to_year is empty, it is the currently active subject)
Subjects of similar names should not have overlapping lifetimes. Every time you make a change to an active subject in the hierarchy, make a copy of the subject and change the lifetime fields accordingly. As long as the hierarchy does not change, you have nothing to change in your data.
To match your example:
subject Mathematics, lifetime: from_year=2010, to_year=NULL
Algebra: lifetime: from_year=2010, to_year=2010
Trigonometry: lifetime: from_year=2010, to_year=NULL
Arithmetic: from_year=2010, to_year=NULL
subject Algebra: lifetime from_year=2011,to_year=NULL
Algebra: lifetime from_year=2011,to_year=NULL
Another option is to have a single "year" field in your subject instead of a lifetime; that may be a much simpler solution, better suited to the case when you want to store a different grade for each student per subject per year. But that would mean to store the entire hierarchy each cycle, what is what you excluded.
Don't mix up the identity of each subject and its position in the heirarchy.
If I got a B+ in Algebra in 2010, that data looks like:
ClassID StudentID Grade
100 100001 B+
the ID of the 'Algebra' class shouldn't be changing when your categories change.
Related
I have a dataset like below. I have datetime column as index, type is a column with sequence. For ex; R,C,D,D,D,R,R is a sequence.
start_time type
2019-12-14 09:00:00 RCDDDRR
2019-12-14 10:00:00 CCRD
2019-12-14 11:00:00 DDRRCC
2019-12-14 12:00:00 ?
I want to predict what would be the next sequence at time 12:00:00? which is the best algorithm to predict the next sequence?
I know that we can use Markov chain to predict the probable sequence. However, are there any other better algorithms?
Thanks
you can use from knn,svm for prediction.but the first of all you have to change database and define feature for training dataset for example
you can use from another method base on deep learning , I think this link can help you
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
LSTMs have an edge over conventional feed-forward neural networks and RNN in many ways. This is because of their property of selectively remembering patterns for long durations of time.
LSTMs on the other hand, make small modifications to the information by multiplications and additions. With LSTMs, the information flows through a mechanism known as cell states. This way, LSTMs can selectively remember or forget things. The information at a particular cell state has three different dependencies.
Let’s take the example of predicting stock prices for a particular stock. The stock price of today will depend upon:
The trend that the stock has been following in the previous days, maybe a downtrend or an uptrend.
The price of the stock on the previous day, because many traders compare the stock’s previous day price before buying it.
The factors that can affect the price of the stock for today. This can be a new company policy that is being criticized widely, or a drop in the company’s profit, or maybe an unexpected change in the senior leadership of the company.
These dependencies can be generalized to any problem as:
The previous cell state (i.e., the information that was present in the memory after the previous time step).
The previous hidden state (this is the same as the output of the previous cell).
The input at the current time step (i.e., the new information that is being fed in at that moment).
Maybe this link and method could help you
https://www.bioinf.jku.at/publications/older/2604.pdf
https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/
I want to look at the change in the percentage of income savings measured in dollars using repeated measures data for older adults. I want to compare three groups based on their father's occupation while growing up.
I want to estimate different slopes and intercepts for the three groups. For example: group A is subjects whose father's were blue-collar; group B is subjects whose fathers were white-collar; group C is subjects whose fathers were other types of workers. I have repeated measures for 10 annual surveys and subjects reported how much of their income they saved. I want to see if the intercept and trajectory differ for the three groups.
This is what I come up with but I am not sure if this gets at the appropriate verbal explanation.
m1
xtmixed savings time || subjects:, var
m1 looks at the impact of time on savings and estimates intercepts and slopes for each subject.
m2
xtmixed savings time##fathers_occupation || subjects:, var
Does m2 examine the differences in the intercept and slope for the two groups? Or do I need to add fathers_occupation to the right side of || as well?
You don't need it in the right part, unless you think the effect of fathers_occupation on savings is different between all the subjects. m2 estimates different intercepts for each subject and then the time##fathers_occupation estimates the slopes of fathers_occupation in each measurement (time) group of subjects.
I don't know your data but I think it's very unlikely to have different effects for fathers_occupation for all the different subjects, so I think you could rely on the simpler, random intercept model with the interaction term, as specified by m2.
I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.
For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.
When there are no ratings, a common scenario is to use implicit feedback (items bought, pageviews, clicks, ...) to suggests recommendations. I'm using a model-based approach and I wondering how to deal with multiple identical feedback.
As an example, let's imagine that consummers buy items more than once. Should I have to consider the number of feedback (pageviews, items bought, ...) as a rating or compute a custom value ?
To model implicit feedback, we usually have a mapping procedure to map implicit user feedback into the explicit ratings. I guess in most domains, repeated user action against the same item indicates that the user's preference over the item is increasing.
This is certainly true if the domain is music or video recommendation. In a shopping site, such a behavior might indicate the item is consumed periodically, e.g., diapers or printer ink.
One way I am aware of to model this multiple implicit feedback is to create a numeric rating mapping function. When the number of times (k) of implicit feedback increases, the mapped value of rating should increase. At k = 1, you have a minimal rating of positive feedback, for example 0.6; when k increases, it approaches 1. For sure, you don't need to map to [0,1]; you can have integer ratings, 0,1,2,3,4,5.
To give you a concrete example of the mapping, here is what they did in a music recommendation domain. For short, they used the statistic info of the items per user to define the mapping function.
We assume that the more
times the user has listened to an artist the more the user
likes that particular artist. Note that user’s listening habits
usually present a power law distribution, meaning that a few
artists have lots of plays in the users profile, while the rest
of the artists have significantly less play counts. Therefore,
we compute the complementary cumulative distribution of
artist plays in the users’ profile. Artists located in the top
80-100% of the distribution are assigned a score of 5, while
artists in the 60-80% range assign a score of 4.
Another way I have seen in the literature is to create another variable besides a binary rating variable. They call it confidence levels. See here for details.
Probably not that helpful for OP any longer, but it might be for others in the same boat.
Evaluating Various Implicit Factors in E-commerce
Modelling User Preferences from Implicit Preference Indicators via Compensational Aggregations
If anyone knows more papers/methods, please share as I'm currently looking for state of the art approaches to this problem. Thanks in advance.
You typically use a sum of clicks, or some weighted sum of events, as a "score" for each user-item pair in implicit feedback systems. It's not a rating, and that's more than a semantic distinction. You won't get good results if you feed these values into a process that's expecting rating-like and trying to minimize a squared-error loss.
You treat 3 clicks as adding 3 times the value of 1 click to the user-item interaction strength. Other events, like a purchase, might be weighted much more highly than a click. But in the end it also adds to a sum.
I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.