How to cache and persist the temporary tables in Spark SQL? - caching

I have working code for reading a text file and using as a registered temporary table in memory. I would like to load a set of these tables using a script or a module import and then query them interactively. If if put this code into a script and a function, which is the object I should return? The sc context? The table? The HadoopRDD?
file = "/file.tsv"
lines = sc.textFile(file)
parts = lines.map(lambda l: l.split("\t")).filter(lambda line:len(line)==7)
active_sessions = parts.map(lambda p: Row(
session=p[0]
, user_id=p[1]
, created=p[2]
, updated=p[3]
, id=p[4]
, deleted=p[5]
, resource_id=p[6]))
schemaTable = sqlContext.inferSchema(active_sessions)
schemaTable.registerTempTable("active_sessions")
sqlContext.cacheTable("active_sessions")

I had the same issue and ended up returning:
return sqlContext.table("active_sessions")
I had registered it as a table rather than a temptable though, but it works with temptables as well.

Related

python-docx table styles not accepting styles

I am trying to make a table in Word with python-docx, but after creating the table and saving the file, the table doesn't have any lines / separators. So I tried to use the table.style option, but I just can't get any style to work, except Normal Table (which is the default).
This is the code I use to create the table:
import docx
file = docx.Document()
table = file.add_table(6, 4)
fRow = table.rows[0]
fRow[0].text = "some headline"
...
table.style = "<stylename here>"
file.save("test.docx")
All of the styles I tried are from this website:
https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html#built-in-styles
I am using Python 3.10.0b4 on Windows 11.
This is the code I used to accomplish this:
table = document.add_table(rows=1, cols=2, style='Table Grid')
for your line, believe it would be
table = file.add_table(6, 4, style='Table Grid')

Moving data to HBASE using Pig

I tried moving 851 data in my hbase for that i created hbase using below command
create 'customers', 'customers_data'
i moved the files using pig script. My pig script is
STOCK_A = LOAD '/user/cloudera/xxx' USING PigStorage('|');
data = FILTER STOCK_A BY ( $0 matches '.*MH.*');
MH_DATA = FOREACH data GENERATE $1, $3, $4;
STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
i got 851 data using my pig command. My data is
(aman,george,22)
(aman,george,22)
(aman,george,22)
.
.
.
.
.
851
but when i try to put this data in hbase using below command
PIG_CLASSPATH=/usr/lib/hbase/hbase.jar:/usr/lib/zookeeper/zookeeper-3.4.5-cdh4.4.0.jar /usr/bin/pig /home/cloudera/remot/pighl7
data that is getting stored in HBASE is
ROW COLUMN+CELL
\xB5~\x5C& column=customers_data:firstname, timestamp=1478700582076, value=george
\xB5~\x5C& column=customers_data:lastname, timestamp=1478700582076, value=22
I cant find my 851 records as well as the third parameter. I don't know what i am doing wrong.
Please help
I think you have missed giving alias in the generate statement (for safer side i have casted your tuples into chararray)
also at the end give name for you store relation
TRY:
MH_DATA = FOREACH data GENERATE (chararray)$1 AS firstname , (chararray)$3 AS lastname, (chararray)$4 AS age;
STORE_IN_HBASE = STORE MH_DATA into 'hbase://customers' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('customers_data:firstname, customers_data:lastname, customers_data:age');
for more information follow this link:
https://pig.apache.org/docs/r0.14.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html
After doing a lot of research and trail and error when i changed the row key from name to timestamp i solved my problem, As i am using using row key which is having same name as of others it always updates it.

BigQuery - Check if table already exists

I have a dataset in BigQuery. This dataset contains multiple tables.
I am doing the following steps programmatically using the BigQuery API:
Querying the tables in the dataset - Since my response is too large, I am enabling allowLargeResults parameter and diverting my response to a destination table.
I am then exporting the data from the destination table to a GCS bucket.
Requirements:
Suppose my process fails at Step 2, I would like to re-run this step.
But before I re-run, I would like to check/verify that the specific destination table named 'xyz' already exists in the dataset.
If it exists, I would like to re-run step 2.
If it does not exist, I would like to do foo.
How can I do this?
Thanks in advance.
Alex F's solution works on v0.27, but will not work on later versions. In order to migrate to v0.28+, the below solution will work.
from google.cloud import bigquery
project_nm = 'gc_project_nm'
dataset_nm = 'ds_nm'
table_nm = 'tbl_nm'
client = bigquery.Client(project_nm)
dataset = client.dataset(dataset_nm)
table_ref = dataset.table(table_nm)
def if_tbl_exists(client, table_ref):
from google.cloud.exceptions import NotFound
try:
client.get_table(table_ref)
return True
except NotFound:
return False
if_tbl_exists(client, table_ref)
Here is a python snippet that will tell whether a table exists (deleting it in the process--careful!):
def doesTableExist(project_id, dataset_id, table_id):
bq.tables().delete(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return False
Alternately, if you'd prefer not deleting the table in the process, you could try:
def doesTableExist(project_id, dataset_id, table_id):
try:
bq.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except HttpError, err
if err.resp.status <> 404:
raise
return False
If you want to know where bq came from, you can call build_bq_client from here: http://code.google.com/p/bigquery-e2e/source/browse/samples/ch12/auth.py
In general, if you're using this to test whether you should run a job that will modify the table, it can be a good idea to just do the job anyway, and use WRITE_TRUNCATE as a write disposition.
Another approach can be to create a predictable job id, and retry the job with that id. If the job already exists, the job already ran (you might want to double check to make sure the job didn't fail, however).
Enjoy:
def doesTableExist(bigquery, project_id, dataset_id, table_id):
try:
bigquery.tables().get(
projectId=project_id,
datasetId=dataset_id,
tableId=table_id).execute()
return True
except Exception as err:
if err.resp.status != 404:
raise
return False
There is an edit in exception.
you can use exists() now to check if dataset exists same with table
BigQuery exist documentation
recently big query introduced so called scripting statements that can be quite a game changer for some flows.
check them out here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting
Now for example to check if table exists you can use something like this:
sql = """
BEGIN
IF EXISTS(SELECT 1 from `YOUR_PROJECT.YOUR_DATASET.YOUR_TABLE) THEN
SELECT 'table_found';
END IF;
EXCEPTION WHEN ERROR THEN
# you can print your own message like above or return error message
# however google says not to rely on error message structure as it may change
select ##error.message;
END;
"""
With my_bigquery being an instance of class google.cloud.bigquery.Client (already authentified and associated to a project):
my_bigquery.dataset(dataset_name).table(table_name).exists() # returns boolean
It does an API call to test for the existence of the table via a GET request
Source: https://googlecloudplatform.github.io/google-cloud-python/0.24.0/bigquery-table.html#google.cloud.bigquery.table.Table.exists
It works for me using 0.27 of the Google Bigquery Python module
Inline SQL Alternative
tarheel's answer is probably the most correct at this point in time
but I was considering the comment from Ivan above that "404 could also mean the resource is not there for a bunch of reasons", so here is a solution that should always successfully run a metadata query and return a result.
It's not the fastest, because it always has to run the query, bigquery has overhead for small queries
A trick I've seen previously is to query information_schema for a (table) object, and union that to a fake query that ensures a record is always returned even if the the object doesn't. There's also a LIMIT 1 and an ordering to ensure the single record returned represents the table, if it does exist. See the SQL in the code below.
In spite of doc claims that Bigquery standard SQL is ISO compliant, they don't support information_schema, but they do have __table_summary__
dataset is required because you can't query __table_summary__ without specifying dataset
dataset is not a parameter in the SQL because you can't parameterize object names without sql injection issues (apart from with the magical _TABLE_SUFFIX, see https://cloud.google.com/bigquery/docs/querying-wildcard-tables )
#!/usr/bin/env python
"""
Inline SQL way to check a table exists in Bigquery
e.g.
print(table_exists(dataset_name='<dataset_goes_here>', table_name='<real_table_name'))
True
print(table_exists(dataset_name='<dataset_goes_here>', table_name='imaginary_table_name'))
False
"""
from __future__ import print_function
from google.cloud import bigquery
def table_exists(dataset_name, table_name):
client = bigquery.Client()
query = """
SELECT table_exists FROM
(
SELECT true as table_exists, 1 as ordering
FROM __TABLES_SUMMARY__ WHERE table_id = #table_name
UNION ALL
SELECT false as table_exists, 2 as ordering
) ORDER by ordering LIMIT 1"""
query_params = [bigquery.ScalarQueryParameter('table_name', 'STRING', table_name)]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
if dataset_name is not None:
dataset_ref = client.dataset(dataset_name)
job_config.default_dataset = dataset_ref
query_job = client.query(
query,
job_config=job_config
)
results = query_job.result()
for row in results:
# There is only one row because LIMIT 1 in the SQL
return row.table_exists

Hive Columnar Loader in HDP2.0

I am using HDP 2.0 and running a simple Pig Script.
I have registered the below jars and I am then executing the below code (updated the schema) -
register /usr/lib/pig/piggybank.jar;
register /usr/lib/hive/lib/hive-common-0.11.0.2.0.5.0-67.jar;
register /usr/lib/hive/lib/hive-exec-0.11.0.2.0.5.0-67.jar;
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The problem is , Though the value for F is available in the Hive table, the result always writes 0 records into the output. But it is able to load all the records into A.
Basically the Filter function is not working. My Hive table is not partitioned. I beleive that the problem could be in HiveColumarLoade but not able to figure out what it is.
Please let me know if you are aware of a solution. I am struggling a lot with this.
Thanks a lot for the help!!!
Based on the pig 0.12 documentation HiveColumnarLoader appears to require an intermediate relation before you can filter on a non-partition value. Given that id is not a partition that appears to be your problem.
try this:
A = LOAD '/apps/hive/warehouse/test.db/hivetables' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('id int, name string,age
int,create_dt string,timestamp string,accno int');
B = FOREACH GENERATE A.id, A.name, A.age, A.create_dt, A.timestamp, A.accno;
F = FILTER A BY (id == 85986249 );
STORE F INTO '/user/test/Pigout' USING PigStorage();
The documentation all seems to say that for processing the actual values you need intermediate relation B.

how to export and import BLOB data type in oracle

how to export and import BLOB data type in oracle using any tool. i want to give that as release
Answering since it has a decent view count even with it being 5 year old question..
Since this question was asked 5 years ago there's a new tool named SQLcl ( http://www.oracle.com/technetwork/developer-tools/sqlcl/overview/index.html)
We factored out the scripting engine out of SQLDEV into cmd line. SQLDev and this are based on java which allows usage of nashorn/javascript engine for client scripting. Here's a short example that is a select of 3 columns. ID just the table PK , name the name of the file to create, and content the BLOB to extract from the db.
The script command triggers this scripting. I placed this code below into a file named blob2file.sql
All this adds up to zero plsql, zero directories instead just some sql scripts with javascript mixed in.
script
// issue the sql
// bind if needed but not in this case
var binds = {}
var ret = util.executeReturnList('select id,name,content from images',binds);
// loop the results
for (i = 0; i < ret.length; i++) {
// debug messages
ctx.write( ret[i].ID + "\t" + ret[i].NAME+ "\n");
// get the blob stream
var blobStream = ret[i].CONTENT.getBinaryStream(1);
// get the path/file handle to write to
// replace as need to write file to another location
var path = java.nio.file.FileSystems.getDefault().getPath(ret[i].NAME);
// dump the file stream to the file
java.nio.file.Files.copy(blobStream,path);
}
/
The result is my table emptied into files ( I only had 1 row ). Just run as any plain sql script.
SQL> #blob2file.sql
1 eclipse.png
blob2file.sql eclipse.png
SQL>

Resources