Generate Price Data from 3 variables and data - algorithm

I'm trying to come up with an algorithm that would generate price based on 3 variables. I have to come up with a way from extracting this from some data.
For instance, I'm trying to come up with the price for a used car. The 3 variables would be:
The make of the car (i.e. Honda Civic)
The year of the car (i.e. 2006)
Kilometer's Driven (i.e. 200,000 KM)
I would feed it data extracted from a listing site. The data I would have is the same as above as well as the listing price.
The user can then pick the make, year, and kilometers driven and it will generate an average price based on that data.
Any ideas at all would be helpful! I'm creating this on PHP with an MySQL database.
Thanks so much!

if you are looking for something simple and that is just based on the available data just using SQL will suffice. You need to GROUP BY, use AVG and filter with WHERE.
If you are looking for something fancier and are looking to make predictions based on limited data or incomplete queries, you should have a look at things like regressions trees.

Related

How to create visualization using ratio of fields

I have a data set similar to the table below (simplified for brevity)
I need to calculate the total spend per conversion per team for every month, with ability to plot this as time based line chart being an additional nicety. The total spend is equal to the sum of Phone Expenditure, Travel allowance & Misc. Allowance, this can be a calculated field.
I cannot add a calculated field for the ratio, as for some sales person, the number of conversion can be 0 for a given month. So, averaging over team is not option. How can I go about this?
Thanks for help and suggestions in advance!
I've discussed the question with the Harish offline. I've learned that he is trying to calculate ratio per group, not per row.
To perform calculations per group, users can add calculated fields inside a QuickSight analysis and use level aware aggregation expressions. (Note that level aware aggregations can only be used in an analysis, not in the data prep view). Here is a link to the documentation about level aware aggregations if you want to learn more about this area https://docs.aws.amazon.com/quicksight/latest/user/level-aware-aggregations.html

Get a list of user matching different terms with a specified ratio

Let's say I have the following simple document structure.
{
username: string,
hobby: string
}
I want to get, in one request, a list of users containg 80% of users with football as hobby, 10% with rugby, 5% with volley, 5% with tennis.
Is this possible ? How can you achieve that ?
If so, is it possible to say that i want a percentage of user with a random hobby value.
Thanks a lot,
Julien
No. Elasticsearch does not give partially calculated results.
Another flaw is that the numbers might not match the exact percentage (in any database).
For example, you have 4 users in total, one with each hobby you specified. So here you cannot achieve the desired list with exact percentage. And there are infinite possibilities of such combinations.
Another improvement: If you have exactly this structure, consider Relational Database (like SQL).

Cassandra Modeling for filter and range queries

I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.
I want to be able to write queries like these:
get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds
or
get all users who are men who are 6'0" are ages 31-37 and have black hair
How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.
EDIT:
Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?
IF I UNDERSTAND THINGS CORRECTLY
If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?
So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
or my long answer here:
cassandra - how to perform table query?
All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.
Edited with Additional Information
Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:
CREATE TABLE user_by_haircolor_weight_height (
haircolor text,
weight float,
height_in int,
user varchar,
PRIMARY KEY ((haircolor), weight, height_in)
);
You could then query this by:
SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;
For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a
PRIMARY KEY ((haircolor, sex), height_in, age)
In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.
my recommendation is :
1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,
2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.
3) use multiple query to get what you need.
4) last option is, use DSE solr search capability.
Just to reiterate the end of the conversation:
"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"
Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.

Reducing data with data stage

I've been asked to reduce an existing data model using Data Stage ETL.
It's more of an exercice and a way to get to know this program which I'm very new to.
Of course, the data shall be reduced following some functionnal rules.
Table : MEMBERSHIP (..,A,B,C) # where A,B,C are different attributes (our filters)
Reducing data from ~700k rows to 7k rows or so.
I was thinking about keeping the same percentage as in the data source.
Therefore if we have the 70% of A, 20% of B and 10% of C, we would pretty much have the same percentage on the reduced version.
I'm looking for the best way to do so and the inner tools to use(maybe with the aggregator stage?).
Is there any way to do some scripting similar to PL with DataStage ?
I hope I've been clear enough. If you have any advice I'd be very grateful.
Thanks to all of you.
~Whitoo
Datastage does not do percentage wise reductions
What you can do is to use a tranformer stage or a filter stage to filter out the data from the source based on certain conditions. But like I said conditions have to be very specific. (for example - select only those records which have A = [somevalue] or A not= [somevalue])
DataStage PX has the sample stage that allows you to specify what percent of data you want it to sample: http://datastage4you.blogspot.com/2014/01/sample-stage-in-datastage.html.

SSAS Performance: Multiple measures+no Dim vs one measure+DimType

I am building a finance cube and trying to understand the best practice while designing my main fact table.
What do you think will be a better solution:
Have one column in the fact (amount) and have an additional field which will indicate the type of financial transaction (costs, income, tax, refund, etc).
T
TransType Amount Date
Costs 10 Aug-1
Income 15 Aug-1
Refunds 5 Aug-2
Costs 5 Aug-2
"Pivot" the table to create several columns according to the type of the transaction.
Costs Income Refund Date
10 15 NULL Aug-1
5 NULL 5 Aug-2
Of course, the cube will follow whatever option is selected - several real measures vs several calculated measures which each one of the are based on one main measure while being sliced on a member from a "Transaction Type" dimension.
(in general all transaction types has the same number of rows)
thank you in advanced.
Oren.
For a finance related cube, I believe it is much better to use account dimension functionality.
By using account dimension, you can add/remove accounts to the dimension without changing the structure of your model. Also if you use account dimension, time balance(aggregate function) functionality of the cube cube can help you a lot.
However SSAS account dimension has its own problems as well. For example, if you assign time balance to a formula or a hierachical parent, it is silently ignored and that is not documented as far as I know. So be ready to fix the calculations in the calculation script.
You can also use custom rollup member functionality to load your financial formulas.
In our case, we have 6000+ accounts, and the formulas can change without our control.
So having custom rollup member functionality helps a lot.
You need to be careful with solve orders(ratios..) etc, but that is as usual for any complicated/financial cube.

Resources