I am trying out a very simple example in HBase. Following is how I create table and put data:
create 'newdb3','data'
put 'newdb3','row1','data:name','Thexxx Beatles'
put 'newdb3','row2','data:name','The Beatles'
put 'newdb3','row3','data:name','Beatles'
put 'newdb3','row4','data:name','Thexxx'
put 'newdb3','row1','data:duration',400
put 'newdb3','row2','data:duration',300
put 'newdb3','row3','data:duration',200
put 'newdb3','row4','data:duration',100
scan 'newdb3', {COLUMNS => 'data:name', FILTER => "SingleColumnValueFilter('data','duration', > ,'binaryprefix:200')"}
But the result is always all 4 columns. I tried number with or without string, and using hex values. I also tried 'binary' instead of 'binaryprefix'. How do I store and compare integer in hbase?
Does this produce the expected output?
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.SubstringComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'newdb3', { FILTER => SingleColumnValueFilter.new(Bytes.toBytes('data'), \
Bytes.toBytes('duration'),
CompareFilter::CompareOp.valueOf('GREATER'), \
BinaryComparator.new(Bytes.toBytes('200'))) }
NOTE: This will do a binary comparison and for numbers it will work only if they are 0-padded
Related
I am using xpath in pyspark to extract from xml which is stored as a column in a table.
Below works fine
entity_id="D8"
dfquestionstep=df_source_xml.selectExpr("disclosure_entity_id",
f'xpath(**xml**,"*/entities/entity[#type=\'TI\']/entity[#type=\'UNDERWRITING\']/entity[#type=\'DISCLOSURES\']/entity[**#id=\'{entity_id}\'**]/entity[#type=\'DECISION_PATH\']/entity[#type=\'QUESTION_STEP\']/#id") QUESTION_STEP_ID'
)
PROBLEM
Now I want to pass disclosure_entity_id which is a column in dataframe having values like D8, D9 etc. in place of entity_id, i.e. entity[#id=disclosure_entity_id]
But all I get is [] as result when I execute like this, i.e. xpath fails to find anything.
Is there a way to pass the DF column directly as argument to XPATH like above?
Some testdata:
data = [
['a','<x><a>a1</a><b>b1</b><c>c1</c></x>'],
['b','<x><a>a2</a><b>b2</b><c>c2</c></x>'],
['c','<x><a>a3</a><b>b3</b><c>c3</c></x>'],
]
df = spark.createDataFrame(data, ['col','data'])
Attempt 1:
Creating a column with an XPath expression can be done:
from pyspark.sql import functions as f
df.withColumn('my_path', f.concat(f.lit('//'), f.col('col'))) \
.selectExpr('xpath(data, my_path)').show()
But unfortunately the code above only yields the error message
AnalysisException: cannot resolve 'xpath(`data`, `my_path`)' due to data type mismatch:
path should be a string literal; line 1 pos 0;
The path parameter of the xpath function has to be a constant string. This string is parsed before Spark even looks at the data.
Attempt 2:
Another option is to use an udf and use standard Python functions to process the XPath expression inside of the udf:
import xml.etree.ElementTree as ET
from pyspark.sql import types as T
def find_val(col, data):
result= ET.fromstring(data).find(f'.//{col}')
if not result is None:
return result.text
find_val_udf=f.udf(find_val, returnType=T.StringType())
df.select('col', 'data', find_val_udf('col', 'data')).show(truncate=False)
Output:
+---+----------------------------------+-------------------+
|col|data |find_val(col, data)|
+---+----------------------------------+-------------------+
|a |<x><a>a1</a><b>b1</b><c>c1</c></x>|a1 |
|b |<x><a>a2</a><b>b2</b><c>c2</c></x>|b2 |
|c |<x><a>a3</a><b>b3</b><c>c3</c></x>|c3 |
+---+----------------------------------+-------------------+
i have a variable with values like 5600/06.. this is a strig variable but got wrongly imported as date format. how can i change it to string format again?
Thnx in advance...
Presumably you're using proc import to import this data?
For this sort of "custom" import, you must instead copy the data step used to import the data generated by proc import (it's in the log!) and adjust the relevent field. For example, if sas produces:
data result;
infile "<your file>";
input good_variable $ bad_variable date9.;
run;
you must change the import to:
data result;
infile "<your file>";
input good_variable $100 bad_variable $100;
run;
A help for the implementation best practice is needed.
The operating environment is as follows:
Log data file arrives irregularly.
The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB.
The number of records of a data file is from 13 lines to 22000 lines. The average is about 2700 lines.
Data file must be post-processed before aggregation.
Post-processing algorithm can be changed.
Post-processed file is managed separately with original data file, since the post-processing algorithm might be changed.
Daily aggregation is performed. All post-processed data file must be filtered record-by-record and aggregation(average, max min…) is calculated.
Since aggregation is fine-grained, the number of records after the aggregation is not so small. It can be about half of the number of the original records.
At a point, the number of the post-processed file can be about 200,000.
A data file should be able to be deleted individually.
In a test, I tried to process 160,000 post-processed files by Spark starting with sc.textFile() with glob path, it failed with OutOfMemory exception on the driver process.
What is the best practice to handle this kind of data?
Should I use HBase instead of plain files to save post-processed data?
I wrote own loader. It solved our problem with small files in HDFS. It uses Hadoop CombineFileInputFormat.
In our case it reduced the number of mappers from 100000 to approx 3000 and made job significantly faster.
https://github.com/RetailRocket/SparkMultiTool
Example:
import ru.retailrocket.spark.multitool.Loaders
val sessions = Loaders.combineTextFile(sc, "file:///test/*")
// or val sessions = Loaders.combineTextFile(sc, conf.weblogs(), size = 256, delim = "\n")
// where size is split size in Megabytes, delim - line break character
println(sessions.count())
I'm pretty sure the reason your getting OOM is because of handling so many small files. What you want is to combine the input files so you don't get so many partitions. I try to limit my jobs to about 10k partitions.
After textFile, you can use .coalesce(10000, false) ... not 100% sure that will work though because it's been a while since I've done it, please let me know. So try
sc.textFile(path).coalesce(10000, false)
You can use this
First You can get a Buffer/List of S3 Paths / Same for HDFS or Local Path
If you're trying with Amazon S3 then :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
I'm trying to store a set of records like these:
2342514224232 | some text here whatever
2342514224234| some more text here whatever
....
into separate files in the output folder like this:
output / 2342514224232
output / 2342514224234
the value of the idstr should be the file name and the text should be inside the file. Here's my pig code:
REGISTER /home/bytebiscuit/pig-0.11.1/contrib/piggybank/java/piggybank.jar;
A = LOAD 'cleantweets.csv' using PigStorage(',') AS (idstr:chararray, createdat:chararray, text:chararray,followers:int,friends:int,language:chararray,city:chararray,country:chararray,lat:chararray,lon:chararray);
B = FOREACH A GENERATE idstr, text, language, country;
C = FILTER B BY (country == 'United States' OR country == 'United Kingdom') AND language == 'en';
texts = FOREACH C GENERATE idstr,text;
STORE texts INTO 'output/query_results_one' USING org.apache.pig.piggybank.storage.MultiStorage('output/query_results_one', '0');
Running this pig script gives me the following error:
<file pigquery1.pig, line 12, column 0> pig script failed to validate: java.lang.RuntimeException: could not instantiate 'org.apache.pig.piggybank.storage.MultiStorage' with arguments '[output/query_results_one, idstr]'
Any help is really appreciated!
Try this option:
MultiStorage('output/query_results_one', '0', 'none', ',');
In case anybody stumbles across this post like I did, the problem for me was that my pig script looked like:
DEFINE MultiStorage org.apache.pig.piggybank.storage.MultiStorage();
...
STORE stuff INTO 's3:/...' USING MultiStorage('s3:/...','0','none',',');
The DEFINE statement was incorrectly not specifying inputs/outputs. Foregoing the DEFINE statement and directly putting the following fixed my problem.
STORE stuff INTO 's3:/...' USING org.apache.pig.piggybank.storage.MultiStorage('s3:/...','0','none',',');
I would import a csv file into python with FileChooser. Then when using rpy2, I can perform Statistical analyses with R I know much better compared to Python. Below is a piece of my code:
import pygtk
pygtk.require("2.0")
import gtk
from rpy2.robjects.vectors import DataFrame
def get_open_filename(self):
filename = None
chooser = gtk.FileChooserDialog("Open File...", self.window,
gtk.FILE_CHOOSER_ACTION_OPEN,
(gtk.STOCK_CANCEL, gtk.RESPONSE_CANCEL,
gtk.STOCK_OPEN, gtk.RESPONSE_OK))
response = chooser.run()
if response == gtk.RESPONSE_OK:
don = DataFrame.from_csvfile(chooser.get_filename())
print(don)
chooser.destroy()
return filename
When runing the code, don is printed. But the question is: in don, there are two columns, X and Y I can't access to perform analyses. Thanks for your kind help
Did you check the documentation about extracting elements from a DataFrame ?