Determine housing supply by MSA using census - survey

I am looking for a good census table for determining housing supply by MSA. Most of the tables I have come across report on household. What I would like to do is determine supply of total housing units (preferably by unit type) and measure vacancy rates by MSA to view alongside demographics.
Is there a good census table I can be referring to for determining housing supply (ACS, CPS, etc)? Thanks.

For those interested: CPS has a Housing Vacancy list of tables that report on some good data. Through that I found DP04 is a good top-level report from the ACS 5-yr Housing Characteristics to use for some of these descriptive statistics.

Related

Should I use multiple fact tables for each grain or just aggregate from lowest grain?

Fairly new to data warehouse design and star schemas. We have designed a fact table which is storing various measures about Memberships, our grain is daily and some of the measures in this table are things like qty sold new, qty sold renewing, qty active, qty cancelled.
My question is this, the business will want to see the measures at other grains such as monthly, quarterly, yearly etc.. so would typically the approach here just be to aggregate the day level data for whatever time period was needed or would you recommend creating separate fact tables for the "key" time periods for our business requirements e.g. monthly, quarterly, yearly? I have read some mixed information on this which is mainly why I'm seeking others views.
Some information I read had people embedding a hierarchy in the fact table to designate different grains which was then identified via a "level" type column, which was advised against by quite a few people and didn't seem good to me also, those advising against we're suggesting separate fact tables per grain but to be honest I don't see why we wouldn't just aggregate from the daily entries we have, what benefits would we get from a fact table for each grain other than some slight performance improvements maybe?
Each DataMart will have its own "perspective", which may require an aggregated fact grain.
Star schema modeling is a "top-down" process, where you start from a set of questions or use cases and build a schema that makes those questions easy to answer. Not a "bottom-up" process where you start with the source data and figure out the schema design from there.
You may end up with multiple data marts that share the same granular fact table, but which need to aggregate it in different ways, either for performance, or to have a gran to calculate and store a measure that only makes sense at the aggregated grain.
Eg
SalesFact (store,day,customer,product,quantiy,price,cost)
and
StoreSalesFact(store, week, revenue, payroll_expense, last_year_revenue)

Matchmaking Algorithm

I'm trying to figure out how to best match a borrower to a lender for a real estate transaction. Let’s say there’s a network of a 1000 lenders on a platform. A borrower would log in, and be asked to provide the following:
Personal Information and Track Record (how many projects they have done, credit score, net worth etc.)
Loan Information (loan size, type, leverage etc.)
Project Information (number of units, floors, location, building type etc.)
On the other side, a lender would provide criteria on which they would agree to lend on. For example, a lender agrees to lend to a borrower if:
They have done more than 5 projects
Credit Score > 700
Net Worth > Loan Amount
$500,000 < Loan Amount < $5,000,000
Leverage < 75%
Building Size > 10 Units
Location = CA, AZ, NY, CO
etc...
I want to create a system that matches a lender to a borrower based on the information the borrower provided and the criteria the lender provided. Ideally, the system would assign a 1000 scores to the borrower that represent the “matchmaking” score for each lender on the platform. A borrower that meets more of the lender’s lending requirements would get a higher score since the match should be better. What machine learning algorithm would be best suited to generate such a score? Or would this problem be solved using combinatorial optimization?
Thanks!
If you don't have the system yet, you are unlikely to have good data for machine learning.
So write a few custom rules and start collecting data. Once you have data, do something like build a logistic regression for estimating the probability of acceptance. Once the model is good enough to beat your home grown rules in an A/B test, switch to the machine learning model.
But you can't invoke the magic of machine learning until you have data to learn from.

Difference between secondary analysis and data mining

I am trying to extract some consumption patterns of certain demographic groups from large multidimensional datasets built for other purposes. I am using clustering and regression analysis with manual methods (SPSS). Is this considered to be secondary analysis or data mining? I understand the difference between statistical analysis and data mining, but in this case seems to be sort of in between... Thanks
"Secondary analysis" means that the data was collected for "primary" research project A, but then was analyzed again for "secondary" project B with a very different objective that was not originally planned. Maybe much later maybe by different people. Fairly common in medicine if you want to avoid the cost of doing the experiments yourself, and someone else has published suitable data.
A theoretical example:
Research group A does a clinical trial on drug B, and measures body mass, and insuline levels.
Data is published, for both the study group (with drug B) and the control group (without drug B).
... ten years later ...
Research group C wants to know if there is a correlation between body mass and insuline levels. They do not care about drug B, so they only look at the control group. They join the data with the data of many other groups instead of doing own experiments.
This is not a "meta" study, because they disregard any results with respect to drug B. They do not use the results of group A, only their data, for a different purpose. Since this is secondary use of the data, it is called "secondary analysis".
The analysis could be as simple as computing correlation - something usually not considered to be "data mining" (you do not search, nor use advanced statistics) but tradational statistical hypothesis testing.

Is a fact table in normalized or de-normalized form?

I did a bit R&D on the fact tables, whether they are normalized or de-normalized.
I came across some findings which make me confused.
According to Kimball:
Dimensional models combine normalized and denormalized table structures. The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized. While we advise against a fully normalized with snowflaked dimension attributes in separate tables (creating blizzard-like conditions for the business user), a single denormalized big wide table containing both metrics and descriptions in the same table is also ill-advised.
The other finding, which I also I think is ok, by fazalhp at GeekInterview:
The main funda of DW is de-normalizing the data for faster access by the reporting tool...so if ur building a DW ..90% it has to be de-normalized and off course the fact table has to be de normalized...
So my question is, are fact tables normalized or de-normalized? If any of these then how & why?
From the point of relational database design theory, dimension tables are usually in 2NF and fact tables anywhere between 2NF and 6NF.
However, dimensional modelling is a methodology unto itself, tailored to:
one use case, namely reporting
mostly one basic type (pattern) of a query
one main user category -- business analyst, or similar
row-store RDBMS like Oracle, SQl Server, Postgres ...
one independently controlled load/update process (ETL); all other clients are read-only
There are other DW design methodologies out there, like
Inmon's -- data structure driven
Data Vault -- data structure driven
Anchor modelling -- schema evolution driven
The main thing is not to mix-up database design theory with specific design methodology. You may look at a certain methodology through database design theory perspective, but have to study each methodology separately.
Most people working with a data warehouse are familiar with transactional RDBMS and apply various levels of normalization, so those concepts are used to describe working a star schema. What they're doing is trying to get you to unlearn all those normalization habits. This can get confusing because there is a tendency to focus on what "not" to do.
The fact table(s) will probably be the most normalized since they usually contain just numerical values along with various id's for linking to dimensions. They key with fact tables is how granular do you need to get with your data. An example for Purchases could be specific line items by product in an order or aggregated at a daily, weekly, monthly level.
My suggestion is to keep searching and studying how to design a warehouse based on your needs. Don't look to get to high levels of normalized forms. Think more about the reports you want to generate and the analysis capabilities to give your users.

abstract data types in algorithms

The data structures that we use in applications often contain a great
deal of information of various types, and certian pieces of
information may be belong to multiple independent data structures. For
example, a file of personnel data may contain records with names,
addresses, and various other pieces of information about employees;
and each record may need to belong to one data structure for searching
for particular employees, to another data structure for answering
statistical queries, and so forth.
Despite this diverstiy and complexity, a large class of computing
applications involve generic manipulation of data objects, and need
access to the information associated with them for a limited number of
specific reasons. Many of the manipulations that are required are a
natural outgrowth of basic computational procedures, so they are
needed in broad variety of applications.
Above text is described in context of abstract data types by Robert Sedwick in Algorithms in C++.
My questions is what does author mean by first paragraph in above text?
Data structures are combinations of data storage and algorithms that work on those organisations of data to provide implementations of certain operations (searching, indexing, sorting, updating, adding, etc) with particular constraints. These are the building blocks (in a black box sense) of information representation in software. At the most basic level, these are things like queues, stacks, lists, hash maps/associative containers, heaps, trees etc.
Different data structures have different tradeoffs. You have to use the right one in the right situation. This is key.
In this light, you can use multiple (or "compound") data structures in parallel that allow different ways of querying and operating on the same logical data, hence filling each other's tradeoffs (strengths/weaknesses e.g. one might be presorted, another might be good at tracking changes, but be more costly to delete entries from, etc), usually at the cost of some extra overhead since those data structures will need to be kept synchronised with each other.
It would help if one knew what the conclusion of all this is, but from what I gather:
Employee record:
Name Address Phone-Number Salary Bank-Account Department Superior
As you can see, the employee database has information for each employee that by itself is "subdivided" into chunks of more-or-less independent pieces: the contact information for an employee has little or nothing to do with the department he works in, or the salary he gets.
EDIT: As such, depending on what kind of stuff needs to be done, different parts of this larger record need to be looked at, possibly in different fashion. If you want to know how much salary you're paying in total you'll need to do different things than for looking up the phone number of an employee.
An object may be a part of another object/structure, and the association is not unique; one object may be a part of multiple different structures, depending on context.
Say, there's a corporate employee John. His "employee record" will appear in the list of members of his team, in the salaries list, in the security clearances list, parking places assignment etc.
Not all data contained within his "employee record" will be needed in all of these contexts. His expertise fields are not necessary for parking place allotment, and his marital status should not play a role in meeting room assignment - separate subsystems and larger structures his entry is a part of don't require all of the data contained within his entry, just specific parts of it.

Resources