Qant, Create XTS spread object with bid ask and last data from xts objects - xts

i have 3 csv files with last/bid/ask (with volumes) tick data for two instruments (6 files in total), i want to create a spread from these files to be used for backtesting in Quantstrat.
i can get them into xts object and i have looked into the fn_SpreadBuilder. but any suggestions on the way forward would be good.
my strategy needs to act on the bid/ask of the spread.

Related

Load data from S3 to sort and allow timeline analysis

I'm currently trying to find out the best architecture approach for my use case:
I have S3 buckets (two totally separated) which contains data stored in JSON format. Data is partitioned by year/month/day prefixes, and inside particular day I can find e.g hundreds of files for this date
(example: s3://mybucket/2018/12/31/file1,
s3://mybucket/2018/12/31/file2, s3://mybucket/2018/12/31/file..n)
Unfortunately inside particular prefix for single day, in those tens..or hundreds files JSONs are not ordered by exact timestamp - so if we follow this example:
s3://mybucket/2018/12/31/
I can find:
file1 - which contains JSON about object "A" with timestamp "2018-12-31 18:00"
file100 - which contains JSON about object "A" with timestamp "2018-12-31 04:00"
What even worse...the same scenario I have with my second bucket.
What I want to do with this data?
Gather my events from both buckets, ordered by "ID" of object, in a sorted way (by timestamp) to visualize that in timeline at last step (which tools and how it's out of scope).
My doubts are more how to do it:
In cost efficient way
Cloud native (in AWS)
With smallest possible maintenance
What I was thinking of:
Not sure if...but loading every new file which arrived on S3 to DynamoDB (using Lambda triggered). AFAIK Creating table in proper approach - my ID as Hask key and timestamp as Range Key should works for me, correct?
As every new row inserted will be partitioned to particular ID, and already ordered in correct manner - but I'm not an expert.
Use Log-stash to load data from S3 to ElasticSearch - again AFAIK everything in ES can be indexed, so also sorted. Timelion will probably allow me to do those fancy analysis I need to created. But again....not sure if ES will perform as I want...price...volume is big etc.
??? No other ideas
To help somehow understand my need and show a bit data structure I prepared this: :)
example of workflow
Volume of data?
Around +- 200 000 events - each event is a JSON with 4 features (ID, Event_type, Timestamp, Price)
To summarize:
I need put data somewhere effectively, minimizing cost, sorted to maintain at next step front end to present how events are changing based on time - filtered by particular "ID".
Thank and appreciate for any good advice, some best practices, or solutions I can rely on!:)
#John Rotenstein - you are right I absolutely forgot to add those details. Basically I don't need any SQL functionality, as data will be not updated. Only scenario is that new event for particular ID will just arrive, so only new incremental data. Based on that, my only operation I will do on this dataset - is "Select". That's why I would prefer speed and instant answer. People will look at this mostly per each "ID" - so using filtering. Data is arriving every 15 minut on S3 (new files).
#Athar Khan - thanks for good sugestion!
As far as I understand this, I would choose the second option of Elasticsearch, with Logstash loading the data from S3, and Kibana as the tool to investigate, search, sort and visualise.
Having lambda pushing data from s3 to DynamoDB would probably work, but might be less efficient and cost more, as you are running a compute process on each event, while pushing to Dynamo in small/single-item bulks. Logstash, on the other hand, would read the files one by one and process them all. It also depends on how ofter do you plan to load fresh data to S3, but both solution should fit.
The fact that the timestamps are not ordered in the files would not make an issue in elasticsearch and you can index them on any order you can, you would still be able to visualise and search them in kibana in a time based sorted order.

What is the capacity of a BluePrism Internal Work Queue?

I am working in BluePrism Robotics Process Automation and trying to load an excel sheet with more than 100k records (It might go upwards of 300k in some cases).
I am trying to load internal work queue of BluePrism, but I get an error as quoted below:
'Load Data Into Queue' ERROR: Internal : Exception of type 'System.OutOfMemoryException' was thrown.
Is there a way to avoid this problem, in the way where I can free up more memory?
I plan to process records one by one from queue, and put them into new excel sheets categorically. Loading all that data in a collection and looping over it may be memory consuming, so I am trying to find out a more efficient way.
I welcome any and all help/tips.
Thanks!
Basic Solution:
Break up the number of Excel rows you are pulling into your Collection data item at any one time. The thresholds for this will depend on your resource system memory and architecture, as well as structure and size of the data in the Excel Worksheet. I've been able to quickly move 50k 10-column-rows from Excel to a Collection and then into the Blue Prism queue very quickly.
You can set this up by specifying the Excel Worksheet range to pull into the Collection data item, and then shift that range each time the Collection has been successfully added to the queue.
After each successful addition to the queue and/or before you shift the range and/or at a predefined count limit you can then run a Clean Up or Garbage Collection action to free up memory.
You can do all of this with the provided Excel VBO and an additional Clean Up object.
Keep in mind:
Even breaking it up, looping over a Collection this large to amend the data will be extremely expensive and slow. The most efficient way to make changes to the data will be at the Excel Workbook level or when it is already in the Blue Prism queue.
Best Bet: esqew's alternative solution is the most elegant and probably your best bet.
Jarrick hit it on the nose in that Work Queue items should provide the bot with information on what they are to be working on and a Control Room feedback space, but not the actual work data to be implemented/manipulated.
In this case you would want to just use the items Worksheet row number and/or some unique identifier from a single Worksheet column as the queue item data so that the bot can provide Control Room feedback on the status of the item. If this information is predictable enough in format there should be no need to move any data from the Excel Worksheet to a Collection and then into a Work Queue, but rather simply build the queue based on that data predictability.
Conversely you can also have the bot build the queue "as it happens", in that once it grabs the single row data from the Excel Worksheet to work it, can as well add a queue item with the row number of the data. This will then enable Control Room feedback and tracking. However, this would, in almost every case, be a bad practice as it would not prevent a row from being worked multiple times unless the bot checked the queue first, at which point you've negated the speed gains you were looking to achieve in cutting out the initial queue building in the first place. It would also be impossible to scale the process for multiple bots to work the Excel Worksheet data efficiently.
This is a common issue for RPA, especially if working with large excel files. As far as I know, there are no 100% solutions, but only methods reduce the symptoms. I have run into this problem several times and these are the ways I would try to handle them:
Disable or Errors only for stage logging.
Don`t log parameters on action stages (especially ones that work with the excel files)
Run Garbage collection process
See if it is possible to avoid reading excel files into BP collections and use OLEDB to query the file
See if it is possible to increase the Ram memory on the machines
If they’re using the 32-bit version of the app, then it doesn’t really matter how much memory you feed it, Blue Prism will cap out at 2 GB.
This is may be because of BP Server as the memory is shared between Processes and Work queue.Better option is to use two bots and multiple queues to avoid Memory Error.
If you're using Excel documents or CSV files, you can use the OLEDB object to connect and query against it as if it were a database. You can use the SQL syntax to limit the amount of rows that are returned at a time and paginate through them until you've reached the end of the document.
For starters, you are making incorrect use of the Work Queue in Blue Prism. The Work Queue should not be used to store this type and amount of data. (please read the BP documentation on Work Queues thoroughly).
Solving the issue at hand, being the misuse requires 2 changes:
Only store references in your Item Data which point to the Excel file containing the data.
If you're consulting this much data many times, perhaps convert the file into a CSV, write a VBO that queries the data directly in the CSV.
The first change is not just a recommendation, but as your project progresses and IT Architecture and InfoSec comes into play, it will be mandatory.
As for the CSV VBO, take a look at C#, it will make your life a lot easier than loading all this data into BP (time consuming, unreliable, ...).

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

Bulk Movement Jobs in FileNet 5.2.1

I have a requirement of moving documents from one storage area to another and planning to use Bulk Movement Jobs under Sweep Jobs in FileNet P8 v5.2.1.
My filter criteria is obviously (and only) the storage area id as I want to target a specific storage area and move the content to another storage area(kinda like archiving) without altering the security, relationship containment, document class etc.
When I run the job, though I have around 100,000 objects in the storage area that I am targeting; in examined objects field the job shows 500M objects and it took around 15hrs to move the objects. DBA analyze this situation to tell me that though I have all necessary indexes created on the docverion table(as per FileNet documentation), the job's still going for the full table scan.
Why would something like this happen?
What additional indexes can be used and how would that be helpful?
Is there a better way to do this with less time consumption?
Only for 2 and 3 questions.
About indexes you can use this documentation https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.0/com.ibm.p8.performance.doc/p8ppt237.htm
You can improve the performance of your jobs if you split all documents throught option "*Policy controlled batch size" (as i remember) at "Sweeps subsystem" tab in the Domain settings.
Use Time Slot management
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc179.htm?lang=ru
and Filter Timelimit option
https://www-01.ibm.com/support/knowledgecenter/SSNW2F_5.2.1/com.ibm.p8.ce.admin.tasks.doc/p8pcc203.htm?lang=ru
In commons you just split all your documents to the portions and process it in separated times and threads.

Widget with search facility in wxwidgets

I was wondering if there were any nice widgets in wxwidgets, with search ability. I mean searching for data in large tables like, data in wxgrid.
Thanks in advance
It would be slow and inefficient in several ways to store all of a large dataset in wxGrid, and then to search through wxGrid.
It would be better to keep the dataset in a database and use the database engine to search through it. wxGrid need only store the data that is visible in the GUI.
Here is some high level pseudo code of how this works.
Load data into store. The store should probably be a database, but it might be a vector or any STL container. Even a text file could be made to work!
Set current viewable row to a sensible value. I am assuming your data is arranged in rows
Load rows including and around current into wxGrid.
User enters search term
Send search request to data store, which returns row containing target
Set current row to row containing target
Load rows including and around current into wxGrid.

Resources