hadoop use cases with Valid Loan Data - hadoop

I have some - important data sets for loan Acquisitions in the TXT file
Note- Data are available to me for Q1-Q4 for last 3 years.
Also Please find the Field description of each column for the Acquisitions File in the Image file.
1) Can you please help me to generate - some VALID Logical Business use cases which I want to implement using HDFS and Mapreduce JAVA programming.
Because Most of the sample use cased - related to word count and weather data analysis.
To get the data file - just do a sign in
Data - File
Link ---------- loanperformancedata.fanniemae.com/lppub-docs/acquisition-sample-file.txt
100009503314|CORRESPONDENT|WELLS FARGO BANK, N.A.|3.75|320000|360|12/2011|02/2012|67|67|1|32|798|NO|PURCHASE|PUD|1|PRINCIPAL|CA|949||FRM
100010175842|RETAIL|OTHER|3.875|255000|360|02/2012|04/2012|73|73|1|49|778|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|OH|432||FRM
100013227768|BROKER|FLAGSTAR CAPITAL MARKETS CORPORATION|3.875|415000|360|12/2011|03/2012|46|46|2|21|780|NO|NO CASH-OUT REFINANCE|PUD|1|PRINCIPAL|VA|223||FRM
100016880542|RETAIL|WELLS FARGO BANK, N.A.|4.25|417000|360|11/2011|012012|90|90|2|40|794|NO|PURCHASE|SF|1|PRINCIPAL|CA|956|25|FRM
2) Column description of the data
link - loanperformancedata.fanniemae.com/lppub-docs/lppub_file_layout.pdf
LOAN IDENTIFIER
CHANNEL
SELLER NAME
ORIGINAL INTEREST RATE
ORIGINAL UNPAID PRINCIPAL BALANCE (UPB)
ORIGINAL LOAN TERM
ORIGINATION DATE
FIRST PAYMENT DATE
ORIGINAL LOAN-TO-VALUE (LTV)
ORIGINAL COMBINED LOAN TO -VALUE
NUMBER OF BORROWERS
DEBT - TO -INCOME RATIO
CREDIT SCORE
FIRST -TIME HOME BUYER INDICATOR
LOAN PURPOSE
PROPERTY TYPE
NUMBER OF UNITS
OCCUPANCY STATUS
PROPERTY STATE
ZIP (3-DIGIT)
MORTGAGE INSURANCE PERCENTAGE
PRODUCT TYPE
link ------- >loanperformancedata.fanniemae.com/lppub-docs/lppub_glossary.pdf
Please help me - to build some valid Business use cases and Java program to implement the same.
Most of the Data for - Hadoop are - Weather count and Word count example :(

You can do simple filtering and aggregation to identify a state having maximum number of loans and minimum credit score. That may give an insight to identify issues in approving loans where default rate is much higher.

Related

PBCS: Custom rules to aggregate Period members

I have an input account (never share) in which the user types a parameter for each month, I want that into aggregate members of Period dimension, for example on YearTotal, the value will be the weighted average between two other accounts representing the cost and the quantity.
With the account properties I can rollup my account in addition or as simple average between months, obviously in this way I get wrong data in both cases.
Anyone know a solution to my question?
Thanks a lot,
Daniele
Not sure exactly what you are asking. But I assume the following in my answer:
data entry for user on account Parameter (from the context, I think it is a price)
data entry for user on level0 Period, i.e. the months
you want Essbase to show the Parameter value as typed in at the month level (Jan .. Dec)
you want Essbase to show Costs / Quantity for Q1/2/3/4 and the YearTotal
the Account and Period dimension are of density: dense
You did not specify if you are also reporting on YTD values and how you have implemented this in Essbase. I assume you do, but the preferred solution depends on how you have implemented this, so I take the "safe" solution here:
solution 1
This is the most straightforward solution:
Implement a "parameter_inp" account on which the user keys in the data. Set the account to "never consolidate".
Create a new "parameter" account, dynamic calc, and give it the formula "Costs/Quantity;".
Refer to "parameter" in your reports, and to "parameter_inp" for user entry
solution 2 - alternative
If you have a lot of these parameters, you'll end up with a system making it unpleasant for data entry and reporting for the end-users. To solve it using data entry and reporting on the same "parameter" account, you need to tune your implementation for Quarter and YearTotal calculation, including the YTD calculation. I see no way of getting this correct if you are using DTS.
This is the way to go forward:
Make use of a new dimension called "View", data entry on PER (= periodic), additional dynamic calc member "YTD", density: dense, place it after Period (so Account, Period, View)
Add a UDA to the "parameter", for example "WA"
Set custom dynamic calculations on Quarter and YearTotal level, something like: IF (#ISUDA("WA")) THEN ELSIF <check on FLOW/BALANCE> ... logic for regular aggregation of FLOW and BALANCE items hereby overriding Essbase's native time logic)
Set custom dynamic calculations for YTD (overiding DTS), and make an exception for UDA "WA"

Use case for NER libraries

Hi I am exploring NER libraries to parse through some financial documents, company filings - prospectus etc.
These documents have information like the company name - some keywords and a value associated with them.
I would like to tag and extract these as 3 different entities.
So say for instance I have a phrase or sentence that reads.
ABC corp submitted the following on 1/1/2017 ...We are offering $300,000,000 aggregate principal amount of Floating Rate Notes due 2014 (the “2014 Floating Rate Notes”), $400,000,000 aggregate principal amount of 2.100% Notes due 2014 (the “2014 Fixed Rate Notes”), $400,000,000 aggregate principal amount of 3.100% Notes due 2016 (the “2016 Notes”), and $400,000,000 aggregate principal amount of 4.625% Notes due 2021 (the “2021 Notes”).
I would like to tag ABC corp as organization.
The principal aggregate amount as the key word and
$400000000 as the number value.
I tried running some sample through http://corenlp.run/ it works great for the amounts the keywords and dates - however for the organization name I don't always have it tagged. IS this the standard use case for NER any idea as to why that might be the case for organization name.
Yes the NER model should tag organizations in text. Note that the model was trained on sentences that are different from your data, so performance will drop. Also, the model does not have 100% recall so it will make mistakes from time to time.

Column slicer in Power BI report?

I have a requirement to cook my report with local languages. I have three description columns in my table and need to show one at a time based on user input.
Example:
CustName | Product | English_Description | Swedish_Description
My table has 5 millions of records, so i can't go for un-pivot the description columns. if I do un-pivot, my table becomes 10 millions of records. it's not a feasible one.
Some sample data would be useful. However, you could do a disconnected (or parameter) table for the language selection:
Language
--------
English
Swedish
This table wouldn't be related to anything else, but you could then use measures for your product descriptions such as:
Multi-lingual Description =
IF (
'Disconnected Table'[Language] = "Swedish",
MAX ( [Swedish_Description] ),
MAX ( [English_Description] )
)
With this logic, if no language is picked, the English description will be used. You could use different behavior too (for example use HASONEVALUE to ensure a single value is selected, and display an error message if not).
MAX in the measure is because a measure has to aggregate; however, as long as your table has one product for each row, then the MAX of the product name will be the product name you expect. Having more than one product per row doesn't make sense, so this should be an acceptable limitation. Again, to make your measure more robust, you could build in logic using HASONEVALUE to show BLANK() or an error message if there is more than one product (e.g. for subtotals).
More information:
HASONEVALUE: https://msdn.microsoft.com/en-us/library/gg492190.aspx
Disconnected Tables: http://www.daxpatterns.com/parameter-table/

Creating DAX peer measure

The scenario:
We are an insurance brokerage company. Our fact table is claim metrics current table. This table has unique rows for multiple claim sid-s, so that, countrows(claim current) gives the correct count of the number of unique claims. Now, this table also has clientsid and industrysid. The relation between client and industry here is that, 1 industry can have multiple clients, and 1 client can belong to only 1 industry.
Now, let us consider a fact called claimlagdays, which is present in the table at the granularity of claimsid.
Now, one requirement is that, we need to find out "peer" sum(claimlagdays). This, for a particular client, is basically calculated as:
sum(claimlagdays) for the industry of the client being filtered (minus) sum(claimlagdays) for this particular client. Let's call this measure A.
Similar to above, we need to calculate "peer" claim count , which is claimcount for the industry of the client being filtered (minus) claimcount for this particular client.
Let's call this measure B.
In the final calculation, we need to divide A by B, to get the "peer" average lag days.
So basically, the hard part here is this: find the industry of the particular client which is being filtered for, and then, apply this filter to the fact table (claim metrics current) to find out the total claim count/other metric only for this industry. then of course, subtract the client figure from this industry figure to get the "peer" measure. This has to be done for each row, keeping intact any other filters which might be applied in the slicer(date/business unit, etc.)
There are a couple of other filters static which need to be considered, which are present in other tables, such as "Claim Type"(=Indemnity/Medical) and Claim Status(=Closed).
My solution:
For measure B
I tried creating a calculated column, as:
Claim Count_WC_MO_Industry=COUNTROWS(FILTER(FILTER('Claim Metrics Current',RELATED('Claim WC'[WC Claim Type])="Medical" && RELATED('Coverage'[Coverage Code])="WC" && RELATED('Claim Status'[Status Code])="CL"),EARLIER('Claim Metrics Current'[IndustrySID])='Claim Metrics Current'[IndustrySID]))
Then I created the measure
Claim Count - WC MO Peer:=CALCULATE(SUM([Claim Count_WC_MO_Industry])/[Claim - Count])- [Claim - Count WC MO]
{I did a sum because, tabular model doesn't directly allow me to use a calculated column as a measure, without any aggregation. And also, that wouldn't make any sense since tabular model wouldn't understand which row to take}
The second part of the above measure is obviously, the claim count of the particular client, with the above-mentioned filters.
Problem with my solution:
The figures are all wrong.I am not getting a client-wise or year-wise separation of the industry counts or the peer counts. I am only getting a sum of all the industry counts in the measure.
My suspicion is that this is happening because of the sum which is being done. However, I don't really have a choice, do I, as I can't use a calculated column as a measure without some aggregation...
Please let me know if you think the information provided here is not sufficient and if you'd like me to furnish some data (dummy). I would be glad to help.
So assuming that you are filtering for the specific client via a frontend, it sounds like you just want
ClientLagDays :=
CALCULATE (
SUM ( 'Claim Metrics Current'[Lag Days] ),
Static Filters Here
)
Just your base measure of appropriate client lag days, including your static filters.
IndustryLagDays :=
CALCULATE (
[ClientLagDays],
ALL ( 'Claim Metrics Current'[Client] ),
VALUES ( 'Claim Metrics Current'[IndustrySID] )
)
This removes the filter on client but retains the filter on Industry to get the industry-wide total of lag days.
PeerLagDays:=[IndustryLagDays]-[ClientLagDays]
Straightforward enough.
And then repeat for claim counts, and then take [PeerLagDays] / [PeerClaimCount] for your [Average Peer Lag Days].

Business Objects XI Web Intelligence Aggregation Issue (11.5.8.897)

I've got a multiple tabbed report. On one tab I have the details listed and on another I have a summary table (cross reference) type of aggregation based on the same dimensions utilized in the detail report. I've created a calculated field that takes the product of two measures, I've saved this as a variable. When I try to aggregate that variable on the summary report BOWI is not calculating correctly. Example:
QTY * PRICE = LineTotal
2 * 3 = 6
4 * 3 = 12
TotalOrder = $18
Calculates correctly on the detail report.
When I put this on the aggregate report it is doing the following:
Sum QTY * Sum Price = Total, in other words it is doing
6 * 6 = $36.
My totals on the aggregate are highly inflated. Firstly in what world does that order of precedence make sense? and secondly how can I tell BOWI to sum the TotalOrders instead of breaking it back up into it's components summing those and then multiplying?
Is it a bug?
Further Information
The detail report is Sectioned by Year, Region, State -> Detail lines
The summary report is dimensioned by Year, Region, State
The (QTY * PRICE) component is saved as a variable and utilized in both places.
Am I missing the secret handshake somewhere when calculated fields/variables can't be aggregated and they need to do so in the Universe?
I havent worked with WEBI for a while, I mainly develop DESKI reports, however what you describe sounds similar to the aggregation that occurs in DESKI. If a measure is set to sum aggregation, then it will add all the measures together that relate to the dimension that is added to the report.
For example, if like my details reports, you have the columns order number, qty, price then the aggregation will sum both qty and price at the order level, which is correct. However moving to the summary table then this will cause incorrect data.
To remove the aggregation you can change it from SUM to NONE, I cant recall off the top of my head on how to do this in WEBI (if you cant find out I can check my course materials which are work).
Alternatively, it is best to ensure the dimensions on your summary report are suitable for the data presented.
If you need any further information please let me know.
Matt

Resources