IBPY get correct historical volume data - algorithmic-trading

I am trying to get historical data from IBPY.
I get it, but the volume is extremely low to the point it's useless.
I would like to know how to get the correct historical volume estimation.
I'm executing the following code:
from ib.opt import Connection, message
from ib.ext.Contract import Contract
from ib.ext.Order import Order
from time import sleep, strftime
def historical_data_handler(msg):
print(msg)
connection = Connection.create(port=7496, clientId=999)
connection.register(historical_data_handler, message.historicalData)
connection.connect()
req = Contract()
req.m_secType = "STK"
req.m_symbol = "TSLA"
req.m_currency = "USD"
req.m_exchange = "AMEX"
endtime = strftime('%Y%m%d %H:%M:%S')
connection.reqHistoricalData(1,req,endtime,"1 D","1 hour","TRADES",1,1)
sleep(5)
connection.disconnect()
and this is the output:
<historicalData reqId=1, date=20181123 16:30:00, open=333.21, high=333.33, low=331.04, close=332.92, volume=22, count=21, WAP=332.233, hasGaps=False>
<historicalData reqId=1, date=20181123 16:30:00, open=333.21, high=333.33, low=331.04, close=332.92, volume=22, count=21, WAP=332.233, hasGaps=False>
<historicalData reqId=1, date=20181123 17:00:00, open=332.93, high=334.2, low=327.0, close=328.2, volume=42, count=39, WAP=329.755, hasGaps=False>
<historicalData reqId=1, date=20181123 17:00:00, open=332.93, high=334.2, low=327.0, close=328.2, volume=42, count=39, WAP=329.755, hasGaps=False>
<historicalData reqId=1, date=20181123 18:00:00, open=329.0, high=330.37, low=327.96, close=327.96, volume=17, count=17, WAP=329.375, hasGaps=False>
<historicalData reqId=1, date=20181123 18:00:00, open=329.0, high=330.37, low=327.96, close=327.96, volume=17, count=17, WAP=329.375, hasGaps=False>
<historicalData reqId=1, date=20181123 19:00:00, open=328.5, high=328.6, low=326.07, close=326.07, volume=25, count=25, WAP=327.498, hasGaps=False>
<historicalData reqId=1, date=20181123 19:00:00, open=328.5, high=328.6, low=326.07, close=326.07, volume=25, count=25, WAP=327.498, hasGaps=False>
The data arrives, but the volumes of each row are impossibly low (~22 for hourly bars).
On their website:
https://interactivebrokers.github.io/tws-api/historical_bars.html#hd_what_to_show
it is stated that:
Note: IB's historical data feed is filtered for some types of trades
which generally occur away from the NBBO such as combos, block trades,
and derivatives. For that reason the historical data volume will be
lower than an unfiltered historical data feed
However, the retrieved volumes are so low they are useless.
I suppose I'm not the first to need historical volume data, and there is probably a way to get it.
Can you please tell me how?
Thanks!

Volume is reported in hundreds. For example 22 is 2,200 shares.
For a reference see the Trader Workstation API Reference Guide or online here.

Related

Reducing Latency for GPT-J

I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token='<|endoftext|>', eos_token='<|endoftext|>', truncation_side='left')
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_cache=True,
gradient_checkpointing=True,
)
model.to('cuda')
prompt = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
prompt = {key: value.to('cuda') for key, value in prompt.items()}
out = model.generate(**prompt,
n=1,
min_length=16,
max_new_tokens=75,
do_sample=True,
top_k=35,
top_p=0.9,
batch_size=512,
temperature=0.75,
no_repeat_ngram_size=4,
clean_up_tokenization_spaces=True,
use_cache=True,
pad_token_id=tokenizer.eos_token_id
)
res = tokenizer.decode(out[0])
As input to the model I'm using 2048 tokens and I produce 75 tokens as output. The latency is around 4-5 seconds. In the following blog post, I've read that using pipelines latency can be improved and that tokenization can be a bottleneck.
Can the tokenization be improved for my code and would using a pipeline reduce the latency? Are there any other things I can do to reduce the latency?

Can GA4-API fetch the data from requests made with a combination of minute and region and sessions?

Problem
With UA, I was able to get the number of sessions per region per minute (a combination of minute, region, and sessions), but is this not possible with GA4?
If not, is there any plan to support this in the future?
Detail
I ran GA4 Query Explorer with date, hour, minute, region in Dimensions and sessions in Metrics.
But I got an incompatibility error.
What I tried
I have checked with GA4 Dimensions & Metrics Explorer and confirmed that the combination of minute and region is not possible. (see image below).
(updated 2022/05/16 15:35)Checked by Code Execution
I ran it with ruby.
require "google/analytics/data/v1beta/analytics_data"
require 'pp'
require 'json'
ENV['GOOGLE_APPLICATION_CREDENTIALS'] = '' # service acount file path
client = ::Google::Analytics::Data::V1beta::AnalyticsData::Client.new
LIMIT_SIZE = 1000
offset = 0
loop do
request = Google::Analytics::Data::V1beta::RunReportRequest.new(
property: "properties/xxxxxxxxx",
date_ranges: [
{ start_date: '2022-04-01', end_date: '2022-04-30'}
],
dimensions: %w(date hour minute region).map { |d| { name: d } },
metrics: %w(sessions).map { |m| { name: m } },
keep_empty_rows: false,
offset: offset,
limit: LIMIT_SIZE
)
ret = client.run_report(request)
dimension_headers = ret.dimension_headers.map(&:name)
metric_headers = ret.metric_headers.map(&:name)
puts (dimension_headers + metric_headers).join(',')
ret.rows.each do |row|
puts (row.dimension_values.map(&:value) + row.metric_values.map(&:value)).join(',')
end
offset += LIMIT_SIZE
break if ret.row_count <= offset
end
The result was an error.
3:The dimensions and metrics are incompatible.. debug_error_string:{"created":"#1652681913.393028000","description":"Error received from peer ipv4:172.217.175.234:443","file":"src/core/lib/surface/call.cc","file_line":953,"grpc_message":"The dimensions and metrics are incompatible.","grpc_status":3}
Error in your code, Make sure you use the actual dimension name and not the UI name. The correct name of that dimension is dateHourMinute not Date hour and minute
dimensions: %w(dateHourMinute).map { |d| { name: d } },
The query explore returns this request just fine
results
Limited use for region dimension
The as for region. As the error message states the dimensions and metrics are incompatible. The issue being that dateHourMinute can not be used with region. Switch to date or datehour
at the time of writing this is a beta api. I have sent a message off to google to find out if this is working as intended or if it may be changed.

Ruby paging over API response dataset causes memory spike

I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.
I have a method used for paging an API response dataset.
def all_pages value_key = 'values', &block
response = {}
values = []
current_page = 1
total_pages = 1
offset = 0
begin
response = yield offset
#The following seems to be the culprit
values += response[value_key] if response.key? value_key
offset = response['offset']
total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
end while (current_page += 1) <= total_pages
values
end
I call this method as so:
all_pages("items") do |current_page|
get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end
I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?
Update
Response looks like this: {"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}
Update 2
Running with report shows the following:
[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"
Update 3
I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.
#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'
uri = URI.parse("https://someapi.com/data")
offset = 0
values = []
begin
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.set_debug_output($stdout)
request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
request.add_field("Content-Type", "application/json")
request.add_field("Accept", "application/json")
response = http.request(request)
json_response = JSON.parse(response.body)
values << json_response['items']
offset += 10000
end while offset < 100_000
values
Update 4
I've made a couple of improvements which seem to help but not completely alleviate the issue.
1) Using symbolize_keys turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.
2) Switching to ruby-yajl for JSON parsing consumes significantly less memory as well.
Memory consumption of processing 200k records:
JSON.parse(response.body): 861080KB (Before completely running out of memory)
JSON.parse(response.body, symbolize_keys: true): 573580KB
Yajl::Parser.parse(response.body): 357236KB
Yajl::Parser.parse(response.body, symbolize_keys: true): 264576KB
This is still an issue though.
Why does a dataset that's no more than 20MB take that much memory to process?
What is the "right way" to process large datasets like this?
What does one do when the dataset becomes 10x larger? 100x larger?
I will buy a beer for anyone who can thoroughly answer these three questions!
Thanks a lot in advance.
You've identified the problem to be using += with your array. So the likely solution is to add the data without creating a new array each time.
values.push response[value_key] if response.key? value_key
Or use the <<
values << response[value_key] if response.key? value_key
You should only use += if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.

Measure estimated completion time of ruby script

I've been running a lot of scripts lately that iterate over 10k - 300k objects, and I'm thinking of writing some code that estimates the completion time of the script (they take 20-180 minutes). I've got to imagine though that there's something out there that does this already. Is there?
To Clarify (edit):
Were I to write code to do this, it would work by measuring how long it takes to perform "the operation" on a single object, multiplying that amount of time by the number of objects left, and adding it to the current time.
Granted, this would only work in situations where you have a script involving a single loop that takes up 99% of the script's total run time, and in which you could reasonably expect to be able to calculate an semi-accurate average for each iteration of that loop. This is true of the scripts for which I'd like estimate completion time.
Have a look at the ruby-progressbar gem: https://github.com/jfelchner/ruby-progressbar
It generates a nice progressbar and estimates completion time (ETA):
example task: 67% |oooooooooooooooooooooo | ETA: 00:01:15
You can granularity measure the time of each method within your script and then sum the components as described here.
You let your process run, and after a set time of iterations, you measure the elapsed time. You then use that value as an estimation for the time left. This ensures that the time is always dynamically estimated according to the current task.
This example is extra verbose, like a code double whopper with triple cheese:
# Some variables for this test
iterations = 1000
probe_at = (iterations * 0.1).to_i
time_total = 0
#======================================
iterations.times do |i|
time_start = Time.now
#you could yield here if this were a function
5000.times do # <tedius task simulation>
Math.sqrt(rand(200000))
end # <end of tedious task simulation>
time_total += time_taken = Time.now - time_start
if i == probe_at
iteration_cost = (time_total / probe_at)
time_left = iteration_cost * (iterations - probe_at)
puts "Time taken (ACTUAL): #{time_total} | iteration: #{i}"
puts "Time left (ESTIMATE): #{time_left} | iteration: #{i}"
puts "Estimated total: #{time_total + time_left} | iteration: #{i}"
end
if i == 999
puts "Time taken (ACTUAL): #{time_total} | iteration: #{i}"
end
end
You could easily rewrite this into a class or a method.

Calculate time remaining with different length of variables

I will have to admit the title of this question sucks... I couldn't get the best description out. Let me see if I can give an example.
I have about 2700 customers with my software at one time was installed on their server. 1500 or so still do. Basically what I have going on is an Auto Diagnostics to help weed out people who have uninstalled or who have problems with the software for us to assist with. Currently we have a cURL fetching their website for our software and looking for a header return.
We have 8 different statuses that are returned
GREEN - Everything works (usually pretty quick 0.5 - 2 seconds)
RED - Software not found (usually the longest from 5 - 15 seconds)
BLUE - Software found but not activated (usually from 3 - 9 seconds)
YELLOW - Server IP mismatch (usually from 1 - 3 seconds)
ORANGE - Server IP mismatch and wrong software type (usually 5 - 10 seconds)
PURPLE - Activation key incorrect (usually within 2 seconds)
BLACK - Domain returns 404 - No longer exists (usually within a second)
UNK - Connection failed (usually due to our load balancer -- VERY rare) (never countered this yet)
Now basically what happens is a cronJob will start the process by pulling the domain and product type. It will then cURL the domain and start cycling through the status colors above.
While this is happening we have an ajax page that is returning the results so we can keep an eye on the status. The major problem is the Time Remaining is so volatile that it does not do a good estimate. Here is the current math:
# Number of accounts between NOW and when started
$completedAccounts = floor($parseData[2]*($parseData[1]/100));
# Number of seconds between NOW and when started
$completedTime = strtotime("now") - strtotime("$hour:$minute:$second");
# Avg number of seconds per account
$avgPerCompleted = $completedTime / $completedAccounts;
# Total number of remaining accounts to be scanned
$remainingAccounts = $parseData[2] - $completedAccounts;
# The total of seconds remaining for all of the remaining accounts
$remainingSeconds = $remainingAccounts * $avgPerCompleted;
$remainingTime = format_time($remainingSeconds, ":");
I could create a count on all of the green, red, blue, etc... and do an average of how long each color does, then use that for the average time, although I don't believe that would give much better results.
With the difference in times that are so varied, any suggestions would be grateful?
Thanks,
Jeff
OK, I believe I have figured it out. I had to create a class so I could calculate a single regression over a period of time.
function calc() {
$n = count($this->mDatas);
$vSumXX = $vSumXY = $vSumX = $vSumY = 0;
//var_dump($this->mDatas);
$vCnt = 0; // for time-series, start at t=0<br />
foreach ($this->mDatas AS $vOne) {
if (is_array($vOne)) { // x,y pair<br />
list($x,$y) = $vOne;
} else { // time-series<br />
$x = $vCnt; $y = $vOne;
} // fi</p>
$vSumXY += $x*$y;
$vSumXX += $x*$x;
$vSumX += $x;
$vSumY += $y;
$vCnt++;
} // rof
$vTop = ($n*$vSumXY – $vSumX*$vSumY);
$vBottom = ($n*$vSumXX – $vSumX*$vSumX);
$a = $vBottom!=0?$vTop/$vBottom:0;
$b = ($vSumY – $a*$vSumX)/$n;
//var_dump($a,$b);
return array($a,$b);
}
I take each account and start building an array, for the amount of time it takes for each one. The array then runs through this calculation so it will build a x and y time sets. Finally I then run the array through the predict function.
/** given x, return the prediction y */
function calcpredict($x) {
list($a,$b) = $this->calc();
$y = $a*$x+$b;
return $y;
}
I put static values in so you could see the results:
$eachTime = array(7,1,.5,12,11,6,3,.24,.12,.28,2,1,14,8,4,1,.15,1,12,3,8,4,5,8,.3,.2,.4,.6,4,5);
$forecastProcess = new Linear($eachTime);
$forecastTime = $forecastProcess->calcpredict(5);
This overall system gives me about a .003 difference in 10 accounts and about 2.6 difference in 2700 accounts. Next will be to calculate the Accuracy.
Thanks for trying guys and gals

Resources