Hbase Scan returning data out of range - hadoop

I was doing a scan using startRowKey and StopRowKey in HBase scan using HBase shell, but the output what I am receiving is outside the range passed. Please refer the Hbase Query -
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'TableName',{ LIMIT => 2 , STARTROW => '000|9223370554721275807', STOPROW => '101|9223370554727575807', FILTER => SingleColumnValueFilter.new(Bytes.toBytes('col_family'), Bytes.toBytes('col_qualifier'), CompareFilter::CompareOp.valueOf('EQUAL'), Bytes.toBytes('Some Value')), COLUMNS => 'col_family:col_qualifier', REVERSED => false}
But the out what is received is outside this range -
016|9223370554960173487
021|9223370555154148992
Please let me know is my search query is correct or what could be the root cause for this?? Any help will be really appreciated.
Thanks

If you put the four rowkeys mentioned in your question in a file and sort them the result will be:
000|9223370554721275807
016|9223370554960173487
021|9223370555154148992
101|9223370554727575807
Thus the values you received are not outside the range of your scan.

Related

AWS Lambda Python boto3 reading from dynamodb table with mulitple attibutes in KeyConditionExpression

basicSongsTable has 'artist' as Partition Key and 'song' as sort key.
I am able to read using Query if I have one artist. But I want to read 2 artists with the following code. It gives vague error saying ""errorMessage": "Syntax error in module 'lambda_function': positional argument follows keyword argument (lambda_function.py, line 17)","
import boto3
import pprint
from pprint import pprint
dynamodbclient = boto3.client('dynamodb')
def lambda_handler(event, context):
response = dynamodbclient.query(
TableName ='basicSongsTable',
KeyConditionExpression='artist = :varartistname1', 'artist =:varartistname2',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'},
':varartistname2':{'S': 'sree'}
}
)
pprint(response['Items'])
If I give only one keyconditionexpression it works.
KeyConditionExpression='artist = :varartistname1',
ExpressionAttributeValues={
':varartistname1': {'S': 'basam'}
}
Table
As per documentation:
KeyConditionExpression (string) --
The condition that specifies the key values for items to be retrieved
by the Query action.
The condition must perform an equality test on a single partition key
value.
What you are trying to do is, you are trying to perform an equality test on multiple partition key values, which doesn't work.
To do what you want to do, get data for both artists, you will have to either do two queries or do a scan (which I do not recommend).
For other options, I would recommend you to take a look at this answer and its pros and cons.

How to get specific rows in Hbase?

My rowKeys in HBase like this;
a1s1
a1s2
a1s3
a2s1
a3s1
a3s2
...
I want to get only these data;
a1s1
a2s1
a3s1
But when I run thise query; scan 't1', {STARTROW=>'a1s1', ENDROW=>'a4s1'}
It gives me;
a1s1
a1s2
a1s3
a2s1
a3s1
But I don't want to get a1s2 and a1s3. How can I do this?
You should use STARTROW-ENDROW and another filter with RegexStringComparator. If you use only start-end row filter, hbase performs this filtration for each character in your rowkey. Because rowkey is not numeric. In Hbase shell you can try this:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.RegexStringComparator
scan 't1', {STARTROW => 'a1s1', ENDROW => 'a4s1', FILTER => org.apache.hadoop.hbase.filter.RowFilter.new(CompareFilter::CompareOp.valueOf('EQUAL'),RegexStringComparator.new("s1$"))}
I assume, you want to get the row key starting with "a*" and ending with "s1".
So either you can use below:
scan 't1', { ENDROW=>'s1'}
Or
scan 't1', {STARTROW=>'a', ENDROW=>'s1'}
Another option is using regexString:
scan 't1', {FILTER => "RowFilter(=, 'regexstring:*s1')"}

Stored procedure/functions in SparkSql

Any ways to achieve sql features like stored procedure or functions in sparksql?
I'm aware about hpl sql and coprocessor in hbase. But want to know if anything similar like is available in spark or not.
You may consider of use User Defined Function and inbuilt function
A quick example
val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text")
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
// Apply the UDF to change the source dataset
scala> dataset.withColumn("upper", upperUDF('text)).show
Result
| id| text|upper|
+---+-----+-----+
| 0|hello|HELLO|
| 1|world|WORLD|
We cannot create SP/Functions in SparkSql. However best way is to create a temp table just like CTE and used those tables for further usage. Or you can create a UDF Function in Spark.

Transform a org.apache.spark.rdd.RDD[String] into Parallelized collections

I've a csv file in my HDFS with a collection of products like:
[56]
[85,66,73]
[57]
[8,16]
[25,96,22,17]
[83,61]
I'm trying to apply the Association Rules algorithm in my code. For that I need to run this:
scala> val data = sc.textFile("/user/cloudera/data")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/data MapPartitionsRDD[294] at textFile at <console>:38
scala> val distData = sc.parallelize(data)
But when I submit this I'm getting this error:
<console>:40: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val distData = sc.parallelize(data)
How can I transform a RDD[String] in a Sequence collection?
Many thanks!
What you are facing is simple. The error show to you.
To parallelize a object in spark you should add a Seq() object. And you are trying to add a RDD[String] object.
The RDD is already parallelized, the textFile method parallelize the file elements by lines in your cluster.
You can check the method description here:
https://spark.apache.org/docs/latest/programming-guide.html

BigQuery - Check if table already exists

I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Requirements:
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.
Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery
project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'
client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)
def if_tbl_exists(client, table_ref):
from google.cloud.exceptions import NotFound
try:
client.get_table(table_ref)
return True
except NotFound:
return False
if_tbl_exists(client, table_ref)
Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id):
bq.tables().delete(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id):
try:
bq.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except HttpError, err
if err.resp.status <> 404:
raise
return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).
Enjoy:
def doesTableExist(bigquery, project_id, dataset_id, table_id):
try:
bigquery.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except Exception as err:
if err.resp.status != 404:
raise
return False
There is an edit in exception.
you can use exists() now to check if dataset exists same with table
BigQuery exist documentation
recently big query introduced so called scripting statements that can be quite a game changer for some flows.
check them out here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now for example to check if table exists you can use something like this:
sql = """
BEGIN
IF EXISTS(SELECT 1 from `YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE) THEN
SELECT 'table_found';
END IF;
EXCEPTION WHEN ERROR THEN
# you can print your own message like above or return error message
# however google says not to rely on error message structure as it may change
select ##error.message;
END;
"""
With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists
It works for me using 0.27 of the Google Bigquery Python module
Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
dataset is required because you can't query __table_summary__ without specifying dataset
dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery
e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True
print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery
def table_exists(dataset_name, table_name):
client = bigquery.Client()
query = """
SELECT table_exists FROM
(
SELECT true as table_exists, 1 as ordering
FROM __TABLES_SUMMARY__ WHERE table_id = #table_name
UNION ALL
SELECT false as table_exists, 2 as ordering
) ORDER by ordering LIMIT 1"""
query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
if dataset_name is not None:
dataset_ref = client.dataset(dataset_name)
job_config.default_dataset = dataset_ref
query_job = client.query(
query,
job_config=job_config
)
results = query_job.result()
for row in results:
# There is only one row because LIMIT 1 in the SQL
return row.table_exists

Resources