Debugging data structure errors in BigQuery - debugging

BigQuery often ends bq load with ambiguous
Waiting on <jobid> ... (68s) Current status: DONE
BigQuery error in load operation: Error processing job
'<jobid>': Too many errors
encountered. Limit is: {1}.
When I do bq --format=prettyjson show -j <jobid> to find out what's wrong, I get:
"status": {
"errorResult": {
"message": "Too many errors encountered. Limit is: {1}.",
"reason": "invalid"
},
"errors": [
{
"message": "Too many errors encountered. Limit is: {1}.",
"reason": "invalid"
}
],
"state": "DONE"
},
Which usually indicates that bq dislikes something about the data structure.
But how can I find out what is wrong? Which row or column bq exits on with error?
Update
Apparently, sometimes, bq returns "Failure details", where it says which column and line caused an error. But I couldn't replicate getting these details. They appear arbitrary on the same instance, data, and command.
I found a few options in bq help load to let the data pass through:
--[no]autodetect: Enable auto detection of schema and options for formats that are not self
describing like CSV and JSON.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized values in CSV or
JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job fails.
(default: '0')
(an integer)
They allow dropping bad values, but many rows may be lost. And I couldn't find where bq returns the number of dropped rows.

Related

how to use Slice function for json format in ruby?

{
"Message": "Action completed. Completed the Request. One or more of the subsequent operations did not succeed. Please check the logs.",
"Details": null
}
from the above json response i want to print only "Action completed. Completed the Request"
Here is another approach using the slice function as mentioned in the question. I've used a regexp match to extract the part of the string interesting for you.
require "json"
data = JSON.parse('{
"Message": "Action completed. Completed the Request. One or more of the subsequent operations did not succeed. Please check the logs.",
"Details": null
}')
res = payload.inject('') { |r, (k, v)| r = v.slice(/^.+\..*\./) if k == :Message; r }
# => "Action completed. Completed the request."
As #andredurao said, there are multiple ways to achieve it and slice is not the only one.
There are many different ways to do this...
One of them is:
split the message data sentences (split)
get the two first items from split ([0..1])
join them back with a dot (join("."))
split the message data sentences, using the dots as separator:
require "json"
payload = '{
"Message": "Action completed. Completed the Request. One or more of the subsequent operations did not succeed. Please check the logs.",
"Details": null
}'
data = JSON.parse(payload)
puts data["Message"].split(".")[0..1].join(".") # "Action completed. Completed the Request"

ElasticSearch not able to return data going above 10,000 offset, I am not allowed to make index level changes. Can't use Scroll API

I am running ES query step by step for different offset and limit. For example 100 to 149, then 150 to 199, then 200 to 249.. and so on.
When I keep offset+limit more than 10,000 then getting below exception:
{
"error": {
"root_cause": [
{
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "xyz",
"node": "123",
"reason": {
"type": "query_phase_execution_exception",
"reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter."
}
}
]
},
"status": 500
}
I know we can solve this by increasing the "max_result_window". I tried it and it helped too. I increased it to 15,000 then 30,000. But I am not allowed to make index level changes.
So, I changed it back to default one 10,000.
How can I solve this problem? This query is getting hit by an API call.
There are two approach which worked for me-
increasing the max_result_window
Using filter
a. by knowing the unique id of data records
b. by knowing the time frame
First approach was applied using below
PUT /index/_settings
{ "max_result_window" : 10000 }
This worked and solved my problem, but number of records is dynamic element and increasing very fast. So, it is not good to keep increasing this window. Also in my case we use index on sharing basis. So,this change will effect all the users or group on this shared index. So, we moved on to second approach.
Second approach
Part1: First I applied filter on last update timestamp and if record count is greater than 10K then I divide the time frame by half and keep doing it until it reaches count less than 10k.
Part2: As same data is also available in OLTP, I got the complete list of a unique identifier and sorted it. Then applied filter on that identifier and only fetched data in range of 10K. Once 10K data is fetched using pagination, then change the filter and move to next batch of 10k data.
Part3: Applied sorting on last updated timestamp and started fetching data using pagination. Once record count reaches 10k, get the timestamp of 9999 record and apply greater_than filter on identifier and then fetch next 10k records.
All mentioned solution helped me. But I selected the Part3 of second approach. As it is easy to implement and give a sorted data quickly.
Consider scroll API - https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This is also suggested in manual

Message order with Kafka Connect Elasticsearch Connector

We are having problems enforcing the order in which messages from a Kafka topic are sent to Elasticsearch using the Kafka Connect Elasticsearch Connector. In the topic the messages are in the right order with the correct offsets, but if there are two messages with the same ID created in quick succession, they are intermittently sent to Elasticsearch in the wrong order. This causes Elasticsearch to have the data from the second last message, not from the last message. If we add some artificial delay of a second or two between the two messages in the topic, the problem disappears.
The documentation here states:
Document-level update ordering is ensured by using the partition-level
Kafka offset as the document version, and using version_mode=external.
However I can't find any documentation anywhere about this version_mode setting, and whether it's something we need to set ourselves somewhere.
In the log files from the Kafka Connect system we can see the two messages (for the same ID) being processed in the wrong order, a few milliseconds apart. It might be significant that it looks like these are processed in different threads. Also note that there is only one partition for this topic, so all messages are in the same partition.
Below is the log snippet, slightly edited for clarity. The messages in the Kafka topic are populated by Debezium, which I don't think is relevant to the problem, but handily happens to include a timestamp value. This shows that the messages are processed in the wrong order (though they're in the correct order in the Kafka topic, populated by Debezium):
[2019-01-17 09:10:05,671] DEBUG http-outgoing-1 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE SECOND UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER SECOND UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716205205
}
" (org.apache.http.wire)
...
[2019-01-17 09:10:05,696] DEBUG http-outgoing-2 >> "
{
"op": "u",
"before": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM BEFORE FIRST UPDATE >> ...
},
"after": {
"id": "ac025cb2-1a37-11e9-9c89-7945a1bd7dd1",
... << DATA FROM AFTER FIRST UPDATE >> ...
},
"source": { ... },
"ts_ms": 1547716204190
}
" (org.apache.http.wire)
Does anyone know how to force this connector to maintain message order for a given document ID when sending the messages to Elasticsearch?
The problem was that our Elasticsearch connector had the key.ignore configuration set to true.
We spotted this line in the Github source for the connector (in DataConverter.java):
final Long version = ignoreKey ? null : record.kafkaOffset();
This meant that, with key.ignore=true, the indexing operations that were being generated and sent to Elasticsearch were effectively "versionless" ... basically, the last set of data that Elasticsearch received for a document would replace any previous data, even if it was "old data".
From looking at the log files, the connector seems to have several consumer threads reading the source topic, then passing the transformed messages to Elasticsearch, but the order that they are passed to Elasticsearch is not necessarily the same as the topic order.
Using key.ignore=false, each Elasticsearch message now contains a version value equal to the Kafka record offset, and Elasticsearch refuses to update the index data for a document if it has already received data for a later "version".
That wasn't the only thing that fixed this. We still had to apply a transform to the Debezium message from the Kafka topic to get the key into a plain text format that Elasticsearch was happy with:
"transforms": "ExtractKey",
"transforms.ExtractKey.type": "org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.ExtractKey.field": "id"

Field [] used in expression does not exist in mappings

The feature I try to fullfit is to create a metric in kibana that display the number of users "unvalidated".
I send a log sent when a user registers, then a log when a user is validated.
So the count I want is the difference between the number of registered and the number of validated.
In kibana I cannot do such a math operation, so I found a workaround:
I added a "scripted field" named "unvalidated" which is equal to 1 when a user registers and -1 when a user validates his account.
The sum of the "unvalidated" field should be the number of unvalidated users.
This is the script I defined in my scripted field:
doc['ctxt_code'].value == 1 ? 1 : doc['ctxt_code'].value == 2 ? -1 : 0
with:
ctxt_code 1 as the register log
ctxt_code 2 as the validated log
This setup works well when all my logs have a "ctxt_code", but when a log without this field is pushed kibana throws the following error:
Field [ctxt_code] used in expression does not exist in mappings
I can't understand this error because kibana says:
If a field is sparse (only some documents contain a value), documents missing the field will have a value of 0
which is the case.
Anyone has a clue ?
It's OK to have logs without the ctxt_code field... but you have to have a mapping for this field in your indices. I see you're querying multiple indices with logstash-*, so you are probably hitting one that does not have it.
You can include a mapping for your field in all indices. Just go into Sense and use this:
PUT logstash-*/_mappings/[your_mapping_name]
{
"properties": {
"ctxt_code": {
"type": "short", // or any other numeric type, including dates
"index": "not_analyzed" // Only works for non-analyzed fields.
}
}
}
If you prefer you can do it from the command line: CURL -XPUT 'http://[elastic_server]/logstash-*/_mappings/[your_mapping_name]' -d '{ ... same JSON ... }'

GoogleCustom API number of results

How is possible to get more results then 10 with googlecoustom API? I think its just take results from 1st page... when I type to search more then 10 I get this error:
Here is request:
https://www.googleapis.com/customsearch/v1?q=Montenegro&cx=002715630024689775911%3Ajczmrpp_vpo&num=10&key={YOUR_API_KEY}
num=10 is number of results
400 Bad Request
- Show headers -
{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Invalid Value"
}
],
"code": 400,
"message": "Invalid Value"
}
}
Well It is not possible to get more than 10 result from Google Custom Search API.
https://developers.google.com/custom-search/v1/using_rest#query-params
As You can see for num parameter you valid values are only between 1 and 10 inclusive.
To get more result you should make multiple calls. in each different call, increase the value of parameter 'start' by 10. That should do it
For first page result, use
https://www.googleapis.com/customsearch/v1?q=Montenegro&cx=002715630024689775911%3Ajczmrpp_vpo&num=10&start=1&key={YOUR_API_KEY}
This query asks google to provide 10 results starting from position 1. Now you can not ask google for more than 10 results at a time. So you have to query again asking for 10 result starting from 11. So In next query, keep num=10 and start=11. Now you can get all the results by changing start value.

Resources