Hi I am exploring NER libraries to parse through some financial documents, company filings - prospectus etc.
These documents have information like the company name - some keywords and a value associated with them.
I would like to tag and extract these as 3 different entities.
So say for instance I have a phrase or sentence that reads.
ABC corp submitted the following on 1/1/2017 ...We are offering $300,000,000 aggregate principal amount of Floating Rate Notes due 2014 (the “2014 Floating Rate Notes”), $400,000,000 aggregate principal amount of 2.100% Notes due 2014 (the “2014 Fixed Rate Notes”), $400,000,000 aggregate principal amount of 3.100% Notes due 2016 (the “2016 Notes”), and $400,000,000 aggregate principal amount of 4.625% Notes due 2021 (the “2021 Notes”).
I would like to tag ABC corp as organization.
The principal aggregate amount as the key word and
$400000000 as the number value.
I tried running some sample through http://corenlp.run/ it works great for the amounts the keywords and dates - however for the organization name I don't always have it tagged. IS this the standard use case for NER any idea as to why that might be the case for organization name.
Yes the NER model should tag organizations in text. Note that the model was trained on sentences that are different from your data, so performance will drop. Also, the model does not have 100% recall so it will make mistakes from time to time.
Related
I am writing a thesis on Airbnb's presence in Ireland and its effect on house prices. I've downloaded data from InsideAirbnb (.CSV), which describes each Airbnb host and house on a monthly basis. Each host has a unique host_id, each house has a unique house_id, and each host can have multiple house_id's.
Due it being monthly statistics, the same users are documented each month which causes duplicates when the tables are merged. These duplicates have the exact same data columns except the date (written in format mmm yyyy) and the Row_ID.
I'm not sure how to handle this data as obviously it is inaccurate due to the duplicated data. Is there a way to group the data based on the date, or should I have an array of date values in a single column for each? Any suggestions would be greatly appreciated.
I have an input account (never share) in which the user types a parameter for each month, I want that into aggregate members of Period dimension, for example on YearTotal, the value will be the weighted average between two other accounts representing the cost and the quantity.
With the account properties I can rollup my account in addition or as simple average between months, obviously in this way I get wrong data in both cases.
Anyone know a solution to my question?
Thanks a lot,
Daniele
Not sure exactly what you are asking. But I assume the following in my answer:
data entry for user on account Parameter (from the context, I think it is a price)
data entry for user on level0 Period, i.e. the months
you want Essbase to show the Parameter value as typed in at the month level (Jan .. Dec)
you want Essbase to show Costs / Quantity for Q1/2/3/4 and the YearTotal
the Account and Period dimension are of density: dense
You did not specify if you are also reporting on YTD values and how you have implemented this in Essbase. I assume you do, but the preferred solution depends on how you have implemented this, so I take the "safe" solution here:
solution 1
This is the most straightforward solution:
Implement a "parameter_inp" account on which the user keys in the data. Set the account to "never consolidate".
Create a new "parameter" account, dynamic calc, and give it the formula "Costs/Quantity;".
Refer to "parameter" in your reports, and to "parameter_inp" for user entry
solution 2 - alternative
If you have a lot of these parameters, you'll end up with a system making it unpleasant for data entry and reporting for the end-users. To solve it using data entry and reporting on the same "parameter" account, you need to tune your implementation for Quarter and YearTotal calculation, including the YTD calculation. I see no way of getting this correct if you are using DTS.
This is the way to go forward:
Make use of a new dimension called "View", data entry on PER (= periodic), additional dynamic calc member "YTD", density: dense, place it after Period (so Account, Period, View)
Add a UDA to the "parameter", for example "WA"
Set custom dynamic calculations on Quarter and YearTotal level, something like: IF (#ISUDA("WA")) THEN ELSIF <check on FLOW/BALANCE> ... logic for regular aggregation of FLOW and BALANCE items hereby overriding Essbase's native time logic)
Set custom dynamic calculations for YTD (overiding DTS), and make an exception for UDA "WA"
For a certain program I have some type-keywords values like this:
Program Type Keyword
PIM Kind Additional
PIM Period Education
PIM Phase Specialized
PIM Skills Professional
The type is a fixed value, but the keyword depends of the Program and type. I want to transpose this result in analytics by making 4 columns with the type. The result has to look like this:
Program Kind period phase skills
PIM Additional Education Specialized Professional
I have tried by editing the column formula and putting this formula:
CASE WHEN "Type"='Partial period' THEN "Keyword" END
and so on for each different type. But it doesn't give me the result I want. all the new columns are empty.
I also tried with a pivot table, but the keyword isn't a measure, so I don't think this will work.
can someone help?
This simply doesn't make sense in an analytical way. You have no fact, nothing you measure. So no chance of using FILTER...USING... for example.
Don't forget you're not in Excel or a drawing tool. You're in an analytics tool which tries to make sense out of data and not "show data in a weird way".
You have to model things nicely either in the data source itself or be clever in the construction of your RPD.
It's doable in the RPD but it will be quite static and if the list of values changes you will have to adapt it.
tl;dr - garbage data, garbage result
I have some - important data sets for loan Acquisitions in the TXT file
Note- Data are available to me for Q1-Q4 for last 3 years.
Also Please find the Field description of each column for the Acquisitions File in the Image file.
1) Can you please help me to generate - some VALID Logical Business use cases which I want to implement using HDFS and Mapreduce JAVA programming.
Because Most of the sample use cased - related to word count and weather data analysis.
To get the data file - just do a sign in
Data - File
Link ---------- loanperformancedata.fanniemae.com/lppub-docs/acquisition-sample-file.txt
100009503314|CORRESPONDENT|WELLS FARGO BANK, N.A.|3.75|320000|360|12/2011|02/2012|67|67|1|32|798|NO|PURCHASE|PUD|1|PRINCIPAL|CA|949||FRM
100010175842|RETAIL|OTHER|3.875|255000|360|02/2012|04/2012|73|73|1|49|778|NO|NO CASH-OUT REFINANCE|SF|1|PRINCIPAL|OH|432||FRM
100013227768|BROKER|FLAGSTAR CAPITAL MARKETS CORPORATION|3.875|415000|360|12/2011|03/2012|46|46|2|21|780|NO|NO CASH-OUT REFINANCE|PUD|1|PRINCIPAL|VA|223||FRM
100016880542|RETAIL|WELLS FARGO BANK, N.A.|4.25|417000|360|11/2011|012012|90|90|2|40|794|NO|PURCHASE|SF|1|PRINCIPAL|CA|956|25|FRM
2) Column description of the data
link - loanperformancedata.fanniemae.com/lppub-docs/lppub_file_layout.pdf
LOAN IDENTIFIER
CHANNEL
SELLER NAME
ORIGINAL INTEREST RATE
ORIGINAL UNPAID PRINCIPAL BALANCE (UPB)
ORIGINAL LOAN TERM
ORIGINATION DATE
FIRST PAYMENT DATE
ORIGINAL LOAN-TO-VALUE (LTV)
ORIGINAL COMBINED LOAN TO -VALUE
NUMBER OF BORROWERS
DEBT - TO -INCOME RATIO
CREDIT SCORE
FIRST -TIME HOME BUYER INDICATOR
LOAN PURPOSE
PROPERTY TYPE
NUMBER OF UNITS
OCCUPANCY STATUS
PROPERTY STATE
ZIP (3-DIGIT)
MORTGAGE INSURANCE PERCENTAGE
PRODUCT TYPE
link ------- >loanperformancedata.fanniemae.com/lppub-docs/lppub_glossary.pdf
Please help me - to build some valid Business use cases and Java program to implement the same.
Most of the Data for - Hadoop are - Weather count and Word count example :(
You can do simple filtering and aggregation to identify a state having maximum number of loans and minimum credit score. That may give an insight to identify issues in approving loans where default rate is much higher.
I am working on a project that has some inputs like task type and frequency.
For example
if task type = Daily and frequency =2 then create 5 task every alternate day.
if task type = Daily and frequency =3 then create 5 task on today and 3rd and sixth days.
If task type = Weekly and frequency =2 then create 2 tasks every alternate week.
More over I have a calendar table so I need to check working day, if that is a weekend, that task should be generated on next working day. I have calendar_Holidays table as well so check and skin that date as well.
Can I use design pattern for this problem? can somebody show me how?
You might be interested in a paper about recurring events in calendars by Martin Fowler. It describes few very interesting techniques and patterns to use while dealing with scheduling events.
You are trying to apply design patterns too early in your implementation. Design patterns help when you have identified important classes and want to adjust their relationships, for example to reduce coupling and enable extension.
Here you have not yet identified any classes, and I would say that you haven't even got your requirements completely clear. On which days will weekly tasks be generated? What will you do for alternate days when you have a long weekend - in UK we could have Friday as public holiday, Sat, Sun as weekend, and Monday as public holiday - what's your rule now. Can you have monthly events? what other intervals? Again in UK we pay council tax monthly, but not in February and March, do you have cases like that?
So I'd recommend firs getting very clear the corner cases of your requirements. Then produce a natural OO design, and then look to see what patterns may help.