C# NEST Bulk api failing with System.IO.IOException [duplicate] - elasticsearch

This question already has an answer here:
Elasticsearch bulk insert with NEST returns es_rejected_execution_exception
(1 answer)
Closed 5 years ago.
I am trying to bulk insert data from SQL to ElasticSearch index. Below is the code I am using and total number of records is around 1.5 million. I think it something to do with connection setting but I am not able to figure it out. Can someone please help with this code or suggest better way to do it?
public void InsertReceipts
{
IEnumerable<Receipts> receipts = GetFromDB() // get receipts from SQL DB
const string index = "receipts";
var config = ConfigurationManager.AppSettings["ElasticSearchUri"];
var node = new Uri(config);
var settings = new ConnectionSettings(node).RequestTimeout(TimeSpan.FromMinutes(30));
var client = new ElasticClient(settings);
var bulkIndexer = new BulkDescriptor();
foreach (var receiptBatch in receipts.Batch(20000)) //using MoreLinq for Batch
{
Parallel.ForEach(receiptBatch, (receipt) =>
{
bulkIndexer.Index<OfficeReceipt>(i => i
.Document(receipt)
.Id(receipt.TransactionGuid)
.Index(index));
});
var response = client.Bulk(bulkIndexer);
if (!response.IsValid)
{
_logger.LogError(response.ServerError.ToString());
}
bulkIndexer = new BulkDescriptor();
}
}
Code works fine but takes around 10 mins to complete. When I try to increase batch size, it fails with below error:
Invalid NEST response built from a unsuccessful low level call on
POST: /_bulk
Invalid Bulk items: OriginalException: System.Net.WebException: The
underlying connection was closed: An unexpected error occurred on a
send. ---> System.IO.IOException: Unable to write data to the
transport connection: An existing connection was forcibly closed by
the remote host. ---> System.Net.Sockets.SocketException: An existing
connection was forcibly closed by the remote host

A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

I had a similar problem. My problem was solved by adding following code, before the ElasticClient connection is established:
System.Net.ServicePointManager.Expect100Continue = false;

Related

Databricks reading duplicate rows coming from Event Hubs

I have been trying to read some data in Azure Databricks coming from Event Hubs. On the first read, it reads the data correctly, but the problem is that when I send the same data or some different data later after some minutes or hours, it is reading both my previous and the new records. And I only want to read the new stream. I am stuck on this part that how can I only get to read the new records, and not duplicates.
The code I am using is written below. Please note I am using Azure Free Tier account along with Databricks community edition.
connectionString = "Endpoint=sb://ingestionlayer1.servicebus.windows.net/;SharedAccessKeyName=userpolicy2;SharedAccessKey=...=;EntityPath=dataingestion1"
ehConf = {}
ehConf['eventhubs.connectionString'] = connectionString
ehConf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString)
ehConf['eventhubs.consumerGroup'] = "$default"
import json
startingEventPosition = {
"offset": -1,
"seqNo": -1,
"enqueuedTime": None,
"isInclusive": True
}
ehConf["eventhubs.startingPosition"] = json.dumps(startingEventPosition)
# Read events from the Event Hub
df_new = spark.readStream.format("eventhubs").options(**ehConf).load()
read_stream = df_new.withColumn("body", df_new["body"].cast("string"))
#display(read_stream)
# Writing to Data lake
saveloc = "/mnt/streamingdatastorage2/datastorage2/Bronze/Table1"
read_stream2.writeStream.format("delta").option("checkpointLocation", f"{saveloc}/_checkpoint").start(saveloc)
I would really appreciate if anyone help me out with this !

Jdbi transaction - multiple methods - Resources should be closed

Suppose I want to run two sql queries in a transaction I have code like the below:
jdbi.useHandle(handle -> handle.useTransaction(h -> {
var id = handle.createUpdate("some query")
.executeAndReturnGeneratedKeys()
.mapTo(Long.class).findOne().orElseThrow(() -> new IllegalStateException("No id"));
handle.createUpdate("INSERT INTO SOMETABLE (id) " +
"VALUES (:id , xxx);")
.bind("id")
.execute();
}
));
Now as the complexity grows I want to extract each update in into it's own method:
jdbi.useHandle(handle -> handle.useTransaction(h -> {
var id = someQuery1(h);
someQuery2(id, h);
}
));
...with someQuery1 looking like:
private Long someQuery1(Handle handle) {
return handle.createUpdate("some query")
.executeAndReturnGeneratedKeys()
.mapTo(Long.class).findOne().orElseThrow(() -> new IllegalStateException("No id"));
}
Now when I refactor to the latter I get a SonarQube blocker bug on the someQuery1 handle.createUpdate stating:
Resources should be closed
Connections, streams, files, and other
classes that implement the Closeable interface or its super-interface,
AutoCloseable, needs to be closed after use....*
I was under the impression, that because I'm using jdbi.useHandle (and passing the same handle to the called methods) that a callback would be used and immediately release the handle upon return. As per the jdbi docs:
Both withHandle and useHandle open a temporary handle, call your
callback, and immediately release the handle when your callback
returns.
Any help / suggestions appreciated.
TIA
SonarQube doesn't know any specifics regarding JDBI implementation and just triggers by AutoCloseable/Closable not being closed. Just suppress sonar issue and/or file a feature-request to SonarQube team to improve this behavior.

Query all in ElasticSearch using Nest v. 2.1

var settings = new ConnectionSettings(Constants.ElasticSearch.Node);
var client = new ElasticClient(settings);
var response = client.Search<DtoTypes.Customer.SearchResult>(s =>
s.From(0)
.Size(100000)
.Query(q => q.MatchAll()));
It works when the size is smaller, but I want to retrieve all documents in an index that has over 100k documents. Must be a configuration setting I'm missing to get around a limit. I've also tried Take() instead of Size()
The Debug Info returned back is
"Invalid NEST response built from a unsuccesful low level call on
POST: /_search\r\n# Audit trail of this API call:\r\n - BadResponse:
Node: http://127.0.0.1:9200/ Took: 00:00:00.2964038\r\n# ServerError:
ServerError: 500Type: search_phase_execution_exception Reason: \"all
shards failed\"\r\n# OriginalException: System.Net.WebException: The
remote server returned an error: (500) Internal Server Error.\r\n at
System.Net.HttpWebRequest.GetResponse()\r\n at
Elasticsearch.Net.HttpConnection.Request[TReturn](RequestData
requestData) in
C:\users\russ\source\elasticsearch-net\src\Elasticsearch.Net\Connection\HttpConnection.cs:line
138\r\n# Request:\r\n\r\n#
Response:\r\n\r\n"
Elasticsearch has a soft limit on the amount of results it allows to return. If you want more then 10.000 results in one go, you should use the scan and scroll functionality :)
From the Elasticsearch documentation:
"Note that from + size can not be more than the
index.max_result_window index setting which defaults to 10,000. See
the Scroll API for more efficient ways to do deep scrolling."
Reference:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
https://nest.azurewebsites.net/nest/search/scroll.html

Chrome.serial onReceive is slow

Hello I am using the chrome serial API to receive data at interval of 50-100ms.
Here how I connect with the device:
serial.connect("COM180", {bitrate: 115200,ctsFlowControl:true}, onDeviceEvent);
chrome.serial.onReceive.addListener(parseDataFromDevice);
my parseDataFromDevice sometimes is called at the specified interval of 50-100ms but sometimes it fails and deliver messages each 200ms.
I am using this to convert the data received to String:
var ab2str = function(buf)
{
return String.fromCharCode.apply(null, new Uint8Array(buf));
};
Any idea why this is happening and its so slow ?

BigQuery: 403 User Rate Limit Exceeded but error not shown in joblist

Im reciveing 403 User Rate Limit Exceeded error making querys but I'm sure I'm not exceding.
In the past I've reach the rate limimt doing inserts and It was reflected in the job list as
[errorResult] => Array
(
[reason] => rateLimitExceeded
[message] => Exceeded rate limits: too many imports for this project
)
But in this case the jobs-list doesn't reflect the query (nor error or done), and studing the job-list i haven't reach the limits or have been close to reach it (no more than 4 concurrent querys and each processing 692297 Bytes)
I've the billing active, and I've make only 2.5K querys in the las 28 days.
Edit: The user limit is set up to 500.0 requests/second/user
Edit: Error code recived
User Rate Limit Exceeded User Rate Limit Exceeded
Error 403
Edit: code that I use to make the query jobs and get results
function query_data($project,$dataset,$query,$jobid=null){
$jobc = new JobConfigurationQuery();
$query_object = new QueryRequest();
$dataset_object = new DatasetReference();
$dataset_object->setProjectId($project);
$dataset_object->setDatasetId($dataset);
$query_object->setQuery($query);
$query_object->setDefaultDataset($dataset_object);
$query_object->setMaxResults(16000);
$query_object->setKind('bigquery#queryRequest');
$query_object->setTimeoutMs(0);
$ok = false;
$sleep = 1;
while(!$ok){
try{
$response_data = $this->bq->jobs->query($project, $query_object);
$ok = true;
}catch(Exception $e){ //sleep when BQ API not avaible
sleep($sleep);
$sleep += rand(0,60);
}
}
try{
$response = $this->bq->jobs->getQueryResults($project, $response_data['jobReference']['jobId']);
}catch(Exception $e){
//do nothing, se repite solo
}
$tries = 0;
while(!$response['jobComplete']&&$tries<10){
sleep(rand(5,10));
try{
$response = $this->bq->jobs->getQueryResults($project, $response_data['jobReference']['jobId']);
}catch(Exception $e){
//do nothing, se repite solo
}
$tries++;
}
$result=array();
foreach($response['rows'] as $k => $row){
$tmp_row=array();
foreach($row['f'] as $field => $value){
$tmp_row[$response['schema']['fields'][$field]['name']] = $value['v'];
}
$result[]=$tmp_row;
unset($response['rows'][$k]);
}
return $result;
}
Is there any other rate limits? or it is a bug?
Thanks!
You get this error trying to import CSV files right?
It could be one of these reasons:
Import Requests
Rate limit: 2 imports per minute
Daily limit: 1,000 import requests per day (including failures)
Maximum number of files to import per request: 500
Maximum import size per file: 4GB2
Maximum import size per job: 100GB2
The query() call is, in fact, limited by the 20-concurrent limit. The 500 requests / second / user limit in developer console is somewhat misleading -- this is just the number of total calls (get, list, etc) that can be made.
Are you saying that your query is failing immediately and never shows up in the job list?
Do you have the full error that is being returned? I.e does the 403 message contain any additional information?
thanks
I've solved the problem by using only one server to make the requests.
Looking what I was doing different in the night cronjobs (that never fail) the only diference was I was using only one client in one server instead of using diferent clients in 4 diferent servers.
Now I have only one script in one server that manages the same number of querys and now it never gets the User Rate Limit Exceded error.
I think there is a bug managing many clients or many active IPs at a time, althought the total number of threads never exceds 20.

Resources