I am using hbase 0.98.4
I want to retrieve data from hbase table by using scanner with java api where my
startRow : username_uniqueId
stopRow is : username_uniqueId* so here there can be anything appended
I set these param to scan object but it is not fetching data from hbase table.
Basically i want to fetch all records which starts with some specific string.
For this i can use prefix filter but i came to know it kills hbase performance as it scan whole hbase table. So i am avoiding it.
Can anyone have a better solution apart from using prefix filter?
Related
Context:
I have data in a table in mysql with xml as one column.
For Ex: Table application has 3 fields.
id(integer) , details(xml) , address(text)
(In real case i have 10-12 fields here).
Now we want to query the whole table with all the fields in mysql table using pig.
Transferred the data from mysql into hdfs using sqoop with
record delimiter '\u0005' and column delimiter as "`" to /x.xml.
Then Load the data from x.xml into the Pig using
app = LOAD '/x.xml' USING PigStorage('\u0005') AS (id:int , details:chararray , address:chararray);
What is the best way to query such data.
Solution that i could currently think about.
Use a custom loader and extend Loadfunc to read the data.
If there is some way to load a particular column using xmlpathloader and rest loading normally. Please suggest if this can be done.
As all the examples i have seen using xpath are using XML loader while loading the file.
For Ex:
A = LOAD 'xmls/hadoop_books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
Is it good to use pig for querying such kind of data, please suggest if there are any other alternative technologies, that does it effectively.
The size of data present is around 500 GB.
FYI i am new to hadoop ecosytem and i might be missing something trivial.
Load a specific column:
Some other StackOverflow answers suggesting preprocessing the data with awk (generate a new input contains only the xml part.)
A nicer work-a-round to generate the specific data with an extra FOREACH from the xml column, like:
B = FOREACH app GENERATE details;
and store it to be able to load with an XML loader.
Check the StreamingXMLLoader
(You can also check Apache Drill it may support this case out of the box)
Or use UDF for the XML processing and in pig you just hand over the related xml field.
I have a scenario while was working on Hbase. Initially I have to bulkupload a csv file to Hbase table.Which I could do successfully by using Hbase bulkloading.
Now I want to update a particular field in hbase table by comparing to an new csv provided and if the value is updated have to maintain a flag which says the rowkey was updated. Any hint how I can do it easily.
Any help is really appreciated.
Thanks
HBase maintains versions for each cell. As long as you have the row key with you, you get a handle of the row, and you can just use put to add the updated column. Internally it maintains the versions, and you can have access to history of the updated values too.
However, you need comparing too, as I can see. So after bulk loading the fastest you can do it, use a map reduce as have HBase as source and sink. Look here at 7.2.2 section.
The idea is have mapreduce perform the scan, do comparision in map, and write the new updated put in output. Its like a basic fetch, modify and update sequence. But we are using map reduce parallel feature as we are dealing with large amount of data
I am wondering if there is a way to get previous versions of a particular rowkey in HBase without having to write a MapReduce program and average the values out. I was curious whether this was possible using Hive or Impala (or another similar program) and how you would do this.
My table looks like this:
Composite keys Values
(md5 + date + id) | (value)
I'd like to average all the values for the particular date and a sub string of the id ("411") for all versions.
Thanks ahead of time.
Impala uses the Hive metastore to map its logical notion of a table onto data physically stored in HDFS or HBase (for more details, see the Cloudera documentation).
To learn more about how to tell the Hive metastore about data stored in HBase, see the Hive documentation.
Unfortunately, as noted in the Hive documentation linked above:
there is currently no way to access the HBase timestamp attribute, and
queries always access data with the latest timestamp
There was some work done to add this feature against an older version of Hive in HIVE-2828, though unfortunately that work has not yet been merged into trunk.
So for your application you'll have to redesign your HBase schema to include a "version" column, tell the Hive metastore about this new column, and make your application aware of this column.
I have a doubt with pig/cassandra load.
Am using pig to load data from cassandra using CqlStorage. like this
data = LOAD 'cq://ks/cf' using CqlStorage();
i want to load only few data by filtering, the column that i want to filter are partition keys and there is bug in this https://issues.apache.org/jira/browse/CASSANDRA-6151) so i cannot do this.
so planning to filter with pig, b = filter data by col1 == 'something';
My doubt is, whether pig loads all the data from cassandra and then filter? or will it send filter condition to CqlStorage to load only required data from cassandra.?
I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.