Bulk Movement Jobs in FileNet 5.2.1 - filenet-p8

I have a requirement of moving documents from one storage area to another and planning to use Bulk Movement Jobs under Sweep Jobs in FileNet P8 v5.2.1.
My filter criteria is obviously (and only) the storage area id as I want to target a specific storage area and move the content to another storage area(kinda like archiving) without altering the security, relationship containment, document class etc.
When I run the job, though I have around 100,000 objects in the storage area that I am targeting; in examined objects field the job shows 500M objects and it took around 15hrs to move the objects. DBA analyze this situation to tell me that though I have all necessary indexes created on the docverion table(as per FileNet documentation), the job's still going for the full table scan.
Why would something like this happen?
What additional indexes can be used and how would that be helpful?
Is there a better way to do this with less time consumption?

Only for 2 and 3 questions.
About indexes you can use this documentation https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.0/com.ibm.p8.performance.doc/p8ppt237.htm
You can improve the performance of your jobs if you split all documents throught option "*Policy controlled batch size" (as i remember) at "Sweeps subsystem" tab in the Domain settings.
Use Time Slot management
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc179.htm?lang=ru
and Filter Timelimit option
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc203.htm?lang=ru
In commons you just split all your documents to the portions and process it in separated times and threads.

Related

Is Elasticsearch optimized for inserts?

I develop for a relatively large online store with a PHP backend, and it uses elasticsearch for some things (like text search, logging... etc).
Now, I'd like to start storing all kinds of information about user activity in ES. For instance, every page view (for instance: user enter product page/category page ,etc).
Is ES optimized for such a heavy load of continuous inserts, or should I consider some alternatives, like for instance having some sort of a buffer layer where I store all of my immediate inserts in memory, and then every minute or so, insert them into ES in bulk?
What is the industry standard? Or am I worrying in vain and ES is optimized for that?
Thanks.
Elasticsearch, when properly sized to handle your load, is definitely a valid alternative for such a use case.
You might decide, however, to store that streaming data into another cluster which is different from your production cluster, so as to not impact the health of the production cluster too much.
There are a lot variables to arrive at the correct decision, and we don't have enough information here, but it's definitely a valid way.

ELK indexes design according to development goal

I used elastic in the past to analyze logs but I don't have any experience in elastic "architecture". I have an application that I deployed to multiple machines (200+). I want to connect to each machine and gather metadata like logs, metrics, db stats and so on..
With that data I want to be able to :
Find problems in each machine and notify about them (finding problems requires joining data between different sources, for example, finding exception in log1 requires me to go check the db)
Analyze common issues for all machines and implement ML model that will be able to predict issues.
I need to create indexes, and I thought about 2 options:
Create one index per each machine and then all the data related to each machine will be available in its index.
Create index per data source. For example, all db logs from all machines will be available in one dedicated index. Another index will contain only data that related to machine metrics (cpu/ram usage..)
What would be the best to create those indexes?
Ok, now that I got a better understanding of your needs, here's my suggestion:
I strongly recommend to not create an index per machine. I don't know much about your use case(s) but I assume you want to search the data either in kibana or by implementing search requests in your application.
Let's say you are interested in the ram usage of every machine. You would need to execute 200 search requests against elasticsearch since the data (ram usage) is spread over 200 indices (of course one could create aliases but these had to be updated for every new machine). Furthermore you wouldn't be able to do basic aggregations like which machine has the highest ram usage? in a convenient way. In my opinion there are plenty more disadvantages like index-management, shard-allocation etc.
So what's a better solution?
As you have already suggested, you should create an index per datasource. With that, your indices have a dedicated "purpose", e.g. one index that stores database data, the other system metrics and so on. Referring to my examples above, you only would need to execute one search request to determine a) the ram usage of every machine and b) which machine has the highest ram usage. However, this would require, that every document contains a field that references the particular host like so:
PUT metrics/_doc/1
{
"system":{
"ram": {
"usage": "45%",
"free": "55%"
}
},
"host":{
"name": "YOUR HOSTNAME",
"ip": "192.168.17.100"
}
}
In addition to that I recommend using daily indices. So instead of creating one huge index for the system metrics you would create an index for every day like metrics-2020.01.01, metrics-2020.01.02 and so on. This approach has the following advantages:
your indices will be much smaller in size, making them better to manage and (re-)allocate.
after some time period, you can roughly estimate the data size and be able to define the number of shards much better. With only one huge index, you would constantly need to update the number of shards in order to handle your requests in a fast way.
furthermore, you can search your data on a day-by-day basis in a convenient way.
you are able to setup ilm-policies to automate the maintenance of your indices, e.g. delete metrics-indices that are older than X days.
...
I hope I could help you!

Hadoop vs Cassandra: Which is better for the following scenario?

There is a situation in our systems in which the user can view and "close" a report. After they close it, the report is moved to a temporary table inside the database where it is kept for 24 hrs, and then moved to an archives table(where the report is stored for next 7 years). At any point during the 7 years, a user can "reopen" the report and work on it. The problem is that archives storage is getting large and finding/reopening reports tend to be time consuming. And I need to get statistics on the archives from time to time(i.e. report dates, clients, average length "opened", etc). I want to use a big data approach but I am not sure whether to use Hadoop, Cassandra, or something else ? Can someone provide me with some guidelines how to get started and decide on what to use ?
If you archive is large and you'd like to get reports from it, you won't be able to use just Cassandra, as it has no easy means of aggregating the data. You'll end up collocating Hadoop and Cassandra on the same nodes.
From my experience archives (write once - read many) is not the best use case for Cassandra if you're having a lot of writes (we've tried it for a backend for a backup sysyem). Depending on your compaction strategy you'll pay either in space or in iops for having that. Added changes are propagated through the SSTable hierarchies resulting in a lot more writes than the original change.
It is not possible to answer your question in full without knowing other variables: how much hardware (servers, their ram/cpu/hdd/ssd) are you going to allocate? what is the size of each 'report' entry? how many reads / writes you usually serve daily? How large is your archive storage now?
Cassandra might work fine. Keep two tables, reports and reports_archive. Define the schema using a TTL of 24 hours and 7 years:
CREATE TABLE reports (
...
) WITH default_time_to_live = 86400;
CREATE TABLE reports_archive (
...
) WITH default_time_to_live = 86400 * 365 * 7;
Use the new Time Window Compaction Strategy (TWCS) to minimize write amplification. It could be advantageous to store the report metadata and report binary data in separate tables.
For roll-up analytics, use Spark with Cassandra. You don't mention the size of your data, but roughly speaking 1-3 TB per Cassandra node should work fine. Using RF=3 you'll need at least three nodes.

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

Resources