Updating Records in FileNet using CSV - filenet-p8

I have a csv file with unknown number of rows:
id,name,title,salary,time
123,abc,manager,10000,12:30
456,xyz,s manager,15000,13:45
789,tuv,junior,5000,09:15
123,abc,manager,10000,14:15
123,abc,manager,10000,15:35
Notice that above, I have a duplicate 3 records with id=123 and salary=10000.
In FileNet I have below records:
id,name,title,salary,status,sequence,time
123,abc,manager,10000,success,1,0
123,abc,manager,10000,failure,2,0
123,abc,manager,10000,failure,3,0
789,tuv,junior,5000,failure,1,0
Notice that above I have 3 duplicates, one with success and 2 with failure statuses.
My requirement is that I have to sequentially compare one by one row from my csv file with FileNet records, order by sequence and if any rows with the same id and salary match the records in FileNet (lookup is done using id and salary fields), I need to update the time and status to success.
E.g. one row (123,abc,manager,10000,12:30) in csv file above with one record (123,abc,manager,10000,failure,2,0) in FileNet.
The end result in FileNet should be:
id,name,title,salary,status,time
123,abc,manager,10000,success,1,0
123,abc,manager,10000,success,2,12:30
123,abc,manager,10000,success,3,14:15
789,tuv,junior,5000,success,1,09:15
note:
insertion of the first row
(123,abc,manager,10000,12:30) should update the FileNet record
(123,abc,manager,10000,failure,2,0).
The fourth row
(123,abc,manager,10000,14:15) should update the FileNet record (123,abc,manager,10000,failure,3,0).
The third row
(789,tuv,junior,5000,09:15) should update the FileNet record (789,tuv,junior,5000,failure,1,0).
Also note there is a another last row(123,abc,manager,10000,15:35) in csv file which will not update any record in FileNet as the records are updated sequentially in FileNet.
I hope the requirement is clear. Please help as I'm a FileNet newbie.

You should implement your algorithm in Java, reading the CSV and using the P8 Content Engine Java API to do the comparisons and updates to FileNet.
If you get stuck using the FileNet Java API, this may help.

Related

Performance in Elasticsearch

I am now beginning with elasticsearch.
I have two cases of data in a relational database, but in both cases I want to find the records from the first table as quickly as possible.
Case 1: binding tables 1: n (example Invoice - Items of invoice)
Have I been to save the data to the elasticsearch system: all rows from slave or master_id and group all data from slave to single string?
Case 2: binding tables n: 1 (example Invoice - Customer)
Have I been to save the data as in case 1 to independent index or add next column to previous index?
The problem is that sometimes I only need to search for records that contain a specific invoice item, sometimes a specific customer, and sometimes both an invoice item and a customer.
Should I create one index containing all the data, or all 3 variants?
Another problem, is it possible to speed up the search in elasticsearch somehow, when the stored data is eg only EAN (13 digit number) but not plain text?
Thank
Jaroslav
You should denormalize and just use single index for all your data(invoices, items and customer) for the best performance, Elasticsearch although supports joins and parent-child relationship but their performance is no where near to when all the data is part of single index and quick benchmark test on your data will prove it easily.

Delete last record from flat file in blob to Azure data warehouse

I have some pipe delimited flat files in Blob storage and in each file I have a header and footer record with filename, date of extract and the number of records. I am using ADF pipeline with Polybase toload into Azure DWH. I could skip header record but unable to skip the footer. The only way I could think of is creating staging table with all varchar and load into staging and then convert the data types back into main tables. But that is not working as the number of columns is different to the footer and the data. Is there any easier way to do this? Please advise.
Polybase does not have an explicit option for removing footer rows but it does have a set of rejection options which you could potentially take advantage of. If you set your REJECT_TYPE as VALUE (rather than PERCENTAGE) and your REJECT_VALUE AS 1 you are telling Polybase to reject one row only. If your footer is in a different format to the main data rows, it will be rejected but your query should not fail.
CREATE EXTERNAL TABLE yourTable
...
<reject_options> ::=
{
| REJECT_TYPE = value,
| REJECT_VALUE = 2
Please post a simple, anonmyised example of your file with headers, columns and footers if you need further assistance.
Update: Check this blog post for information on tracking rejected rows:
https://azure.microsoft.com/en-us/blog/load-confidently-with-sql-data-warehouse-polybase-rejected-row-location/

what happens when two update for same record comes in one file while loading in DB using INFORMATICA

Suppose I Have a table xyz:
id name add city act_flg start_dtm end_dtm
1 amit abc,z pune Y 21012018 null
and this table is loaded from a file Using Informatica using SCD2.
suppose there is one file that contains two record with id=2
ie. 2 vipul abc,z mumbai
2 vipul asdf bangalore
so who will this be loaded into db?
It depends how your doing the SCD type 2. If you are using a look-up with Static cache , both records will be added end date as null
Best case in this scenario is to use a dynamic lookup cache and read your source data in such a way that latest record is read last. This will ensure one record is expired with end date and only one active record( ie end date is null) exists per id.
Hmm 1 of 2 possibilities depending on what you mean... if you mean that you're pulling data from different source systems which sometimes have the same ids on those systems then its easy... just stamp both the natural key (i.e. the id) and a source system value on the dimension column along with the arbitrary surrogate key which is unique to your target table... (this is a datawarehousing basic so read kimball).
If you mean that you are somehow tracing realtime changes in the single record in the source system and writing these changes to the input files of your etl job then you need to agree with your client whether they're happy for you to aggregate them based on the timestamp of the change and just pick the most recent one or to create 2 records, one with its expiry datetime set and the other still open (which is the standard scd approach... again read kimball).

Calculate age in Dynamics CRM

So there are a couple of similar questions, but all are using javascript, which isn't ideal as it requires the record to be opened / saved.
So, how can you calculate age based off birthdate. There are 200,000 records this would need to be done on and it's using CRM 2015, so can involve calculated fields as well.
It's going to be reported on in the background, so we can't use Javascript.
Workflows are a possibility, but running them on 200,000 records daily isn't exactly elegant!
Any other suggestions?
I've come across this requirement a number of times.
I've solved it by writing a Scribe or SSIS job which runs nightly and updates the Contact.Age field.
In order to not update every Contact record with the calculated age (as most ages won't have changed), I've used one of the following:
For on-premise CRM (where I have SQL access to the database), I wrote a query to return:
contactid
contact age
contact DoB
calculated age (calculated column from DoB and getdate)
The Scribe or SSIS job would only update records where Contact.Age != CalculatedAge
For hosted CRM (where I don't have SQL access to the DB):
Add a field called 'Next birthday'
The Scribe/SSIS job would search for records where NextBirthday is null or prior to today. It would update the Age and NextBirthday field.
Both of these methods mean that if the nightly job doesn't run for whatever reason, then when it's next run it will catch up on any records that are now out of date.
http://blogs.msdn.com/b/crm/archive/2009/02/27/creating-a-birthday-contact-list.aspx has an example using a pre-plugin to populate the birth month, year, and day fields. This could be adapted to instead perform the calculation to populate an age field. That being said, this would only work for new records or records that are changed.
If you wanted to do this via workflow, you'd have to have a workflow assembly to perform the calculation to populate an age field. As an alternative that doesn't require any code, you could create an Advanced Find query for All birthdays in a certain time frame, i.e. "Birthday on or before 2/17/1975" (this should limit the amount of records returned and reduce it from the total of the 200,000). Include the birthday and a new Age field created in the columns shown. I simply created an age field as a text field with a size of 5 characters since I'm intending only to store the years old someone is in it. Export the contacts to Excel marking the options, "Static worksheet with records from all pages in the current view" and "Make this data available for re-importing by including required column headings". Make sure to include the Owner column in order to prevent reassigning all these records to yourself when reimporting the records.
Then in Excel, create a formula like the following in the Age column assuming the birthdate field is in Column L, "=DATEDIF($L2,NOW(),"y")", which will update the age field with how many years old someone is as of the current date. Note that you might have to perform this calculation on a separate column and copy in just the values in order to ensure that Excel does not change the data type for the Age column or you will not be able to import that data back into Microsoft CRM. Fill that formula down so all records are updated, and save the file. Then in Microsoft CRM, import these records by pointing to the updated XML file (Excel 2003 XML format). Here your only restrictions are going to be on the size of the import file (CRM Limits this to 8 MB per file) and will be restricted to 10,000 records for the export, so this is another reason to break up the records you are exporting for reimport.
If you do update these via a workflow, you can update more than 250 at a time using a solution like the one in the CRM 2013 Bulk Workflow tool for XRM Toolbox http://www.zero2ten.com/blog/crm-2013-bulk-workflow-tool-for-xrmtoolbox/ , which allows you to select a group of records using FetchXML as the criteria for the records to apply the workflow to, noting that this may take some time to process if you are running this at the same time for all 200,000 records.
Ideally, my preference would be to have a plugin or JavaScript, but can see with your requirements that you would need to have this run either daily or on a monthly basis (although I would not run it for all 200,000 since everyone's age does not change each day). Just choose the records that have birthdays in a particular month or on a particular date to run the workflows on or to export and reimport for since that's going to be much less intensive for server processing and will be able to complete much faster than having to update all 200,000 at a time.

Hive: How to have a derived column that has stores the sentiment value from the sentiment analysis API

Here's the scenario:
Say you have a Hive Table that stores twitter data.
Say it has 5 columns. One column being the Text Data.
Now How do you add a 6th column that stores the sentiment value from the Sentiment Analysis of the twitter Text data. I plan to use the Sentiment Analysis API like Sentiment140 or viralheat.
I would appreciate any tips on how to implement the "derived" column in Hive.
Thanks.
Unfortunately, while the Hive API lets you add a new column to your table (using ALTER TABLE foo ADD COLUMNS (bar binary)), those new columns will be NULL and cannot be populated. The only way to add data to these columns is to clear the table's rows and load data from a new file, this new file having that new column's data.
To answer your question: You can't, in Hive. To do what you propose, you would have to have a file with 6 columns, the 6th already containing the sentiment analysis data. This could then be loaded into your HDFS, and queried using Hive.
EDIT: Just tried an example where I exported the table as a .csv after adding the new column (see above), and popped that into M$ Excel where I was able to perform functions on the table values. After adding functions, I just saved and uploaded the .csv, and rebuilt the table from it. Not sure if this is helpful to you specifically (since it's not likely that sentiment analysis can be done in Excel), but may be of use to anyone else just wanting to have computed columns in Hive.
References:
https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-DDLOperations
http://comments.gmane.org/gmane.comp.java.hadoop.hive.user/6665
You can do this in two steps without a separate table. Steps:
Alter the original table to add the required column
Do an "overwrite table select" of all columns + your computed column from the original table into the original table.
Caveat: This has not been tested on a clustered installation.

Resources