GitHub API limit array length to 250 - limit

When calling the GitHub API with Octokit to get all ahead commits of a fork repo by comparing each of their branches together, if the fork repo contains more than 300 commits ahead then octokit only returns a maximum of 250. I have searched the entire Internet but still have no clue. Does anyone know how to bypass this limit? Any help would be appreciate

You might consider adding octokit/plugin-paginate-rest.js in order to include pagination to your results
The per_page parameter is usually defaulting to 30, and can be set to
up to 100, which helps retrieving a big amount of data without hitting
the rate limits too soon.
An optional mapFunction can be passed to map each page response to a
new value, usually an array with only the data you need.
This can help
to reduce memory usage, as only the relevant data has to be kept in
memory until the pagination is complete.
const issueTitles = await octokit.paginate(
"GET /repos/{owner}/{repo}/issues",
{
owner: "octocat",
repo: "hello-world",
since: "2010-10-01",
per_page: 100,
},
(response) => response.data.map((issue) => issue.title)
);

Related

Elasticsearch giving cached result even after 5-6 seconds

My System is calling elasticsearch. After updating a document I would like to fetch the same document again. While doing so elasticsearch sometimes fetches cached results (results before the update) even after retrying the elasticsearch get after 5-6 seconds.
I have used refresh:'wait_for' while updating the document. Can anyone help me what can be a workaround for this? I would like to fetch the latest revision of the updated document. My query to fetch is:
body: {
query: {
terms: {
_id: [
idsToFetch
]
}
}
}
First, you can check whats the refresh interval set for your index defaults to 1 second, in this case: refresh:wait_for should return back in maximum 1 second but as explained in official ES documents :
If the refresh interval is set to -1, disabling the automatic
refreshes, then requests with refresh=wait_for will wait indefinitely
until some action causes a refresh. Conversely, setting
index.refresh_interval to something shorter than the default like
200ms will make refresh=wait_for come back faster, but it’ll still
generate inefficient segment
You can get the whats the refresh_interval set for index using https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-get-settings.html, please note it would come in the result only if it's not set to its default value.
Let me know if you face any issue or have more question.

409 error when using streaming_bulk() - certain that document is only included once.

I am attempting to upload a large number of documents - about 7 million.
I have created actions for each document to be added and split them up into about 260 files, about 30K documents each.
Here is the format of the actions:
a = someDocument with nesting
esActionFromFile = [{
'_index': 'mt-interval-test-9',
'_type': 'doc',
'_id': 5641254,
'_source': a,
'_op_type': 'create'}]
I have tried using helpers.bulk, helpers.parallel_bulk, and helpers.streaming_bulk and have had partial success using helpers.bulk and helpers.streaming_bulk.
Each time I run a test, I delete, and then recreate the index using:
# Refresh Index
es.indices.delete(index=index, ignore=[400, 404])
es.indices.create(index = index, body = mappings_request_body)
When I am partially successful - many documents are loaded, but eventually I get a 409 version conflict error.
I am aware that there can be version conflicts created when there has not been sufficient time for ES to process the deletion of individual documents after doing a delete by query.
At first, I thought that something similar was happening here. However, I realized that I am often getting the errors from files the first time they have ever been processed (i.e. even if the deletion was causing issues, this particular file had never been loaded, so there couldn't be a conflict).
The _id value I am using is the primary key from the original database where I am extracting the data from - so I am certain they are unique. Furthermore, I have checked whether there was unintentional duplication of records in my actions arrays, or the files I created them from, and there are no duplicates.
I am at a loss to explain why this is happening, and struggling to find a solution to upload my data.
Any assistance would be greatly appreciated!
There should be information attached to the 409 response that should tell you exactly what's going wrong and which document caused it.
Another thing that could cause this would be a retry - when elasticsearch-py cannot connect to the cluster it will resend the request again to a different node. In some complex scenarios it can happen that a request will be thus sent twice. This is especially true if you enabled retry_on_timeout option.

How to get all the videos of a YouTube channel with the Yt gem?

I want to use the Yt gem to get all the videos of channel. I configure the gem with my YouTube Data API key.
Unfortunately when I use it it returns a maximum of ~1000 videos, even for channels having more than 1000 videos. Yt::Channel#video_count returns the correct number of videos.
channel = Yt::Channel.new id: "UCGwuxdEeCf0TIA2RbPOj-8g"
channel.video_count # => 1845
channel.videos.map(&:id).size # => 949
The Youtube API can't be set to return more than 50 items per request, so I guess Yt automatically performs several requests going through each next page of results to be able to return more than 50 results.
For some reason though it does not go through all the result pages. I don't see a way in Yt for me to control how it goes through the pages of results. In particular I could not find a way to force it to get a single page of results, access the returned value nextPageToken, in order to perform a new request with this value.
Any idea?
Looking into gem's /spec folder, you can see a test for your code.
describe 'when the channel has more than 500 videos' do
let(:id) { 'UC0v-tlzsn0QZwJnkiaUSJVQ' }
specify 'the estimated and actual number of videos can be retrieved' do
# #note: in principle, the following three counters should match, but
# in reality +video_count+ and +size+ are only approximations.
expect(channel.video_count).to be > 500
expect(channel.videos.size).to be > 500
end
end
I did some tests and what I have noticed it that: video_count is the number that is displayed on youtube next to channel's name. This value is not accurate. Not rly sure what it represents.
If you do channel.videos.size, the number is not accurate either, because the videos collection can contain some empty(?) records.
If you do channel.videos.map(&:id).size the returned value should be correct. By correct I mean it should equal to number of videos listed at:
https://www.youtube.com/channel/:channel_id/videos

Solr performance with commitWithin does not make sense

I am running a very simple performance experiment where I post 2000 documents to my application.
Who in tern persists them to a relational DB and sends them to Solr for indexing (Synchronously, in the same request).
I am testing 3 use cases:
No indexing at all - ~45 sec to post 2000 documents
Indexing included - commit after each add. ~8 minutes (!) to post and index 2000 documents
Indexing included - commitWithin 1ms ~55 seconds (!) to post and index 2000 documents
The 3rd result does not make any sense, I would expect the behavior to be similar to the one in point 2. At first I thought that the documents were not really committed but I could actually see them being added by executing some queries during the experiment (via the solr web UI).
I am worried that I am missing something very big. Is it possible that committing after each add will degrade performance by a factor of 400?!
The code I use for point 2:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc);
solrConnection.commit();
Where as the code for point 3:
SolrInputDocument = // get doc
SolrServer solrConnection = // get connection
solrConnection.add(doc, 1); // According to API documentation I understand there is no need to call an explicit commit after this
According to this wiki:
https://wiki.apache.org/solr/NearRealtimeSearch
the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.
What you do in point 2 is hard-commit, i.e. flush the added documents to disk. Doing this after each document add is very expensive. So instead, post a bunch of documents and issue a hard commit or even have you autoCommit set to some reasonable value, like 10 min or 1 hour (depends on your user expectations).

Zend Framework Cache

I'm trying to make an ajax autocomplete search box that of course uses SQL, min 3 characters, and have a SQL view of relevant fields already set up and indexed in the db. The CPU still spikes when searching, which I expected as it's running a query for every character. I want to use Zend shm cache to speed up results and reduce CPU usage. The results are stored in an array which is to be cached like this:
while($row = db2_fetch_row($stmt)) {
$fSearch[trim($row[0]).trim($row[1])] = array(/*array built here*/);
}
if (zend_shm_cache_store('fSearch', $fSearch, 10 * 60) === false) {
error_log('Failed to store search cache!');
}
Of course there's actual data inside the array instead of comments, I just shortened the code for simplicity. Rows 0&1 form the PK, and this has tested to be working properly. It's the zend_shm_cache_store that fails because the error log gets flooded with 'Failed to store search cache!'. I read that zend_shm_cache_store can store any array that can be serialized - how can I tell if my data is serialized or can be serialized? Are there any other potential causes? I did make a test page that only stored a string and that was successful, so I know caching is on.
Solved: cache size was too small for array - increased cache size and it worked fine. Sorry for the trouble.

Resources