I have a doubt with pig/cassandra load.
Am using pig to load data from cassandra using CqlStorage. like this
data = LOAD 'cq://ks/cf' using CqlStorage();
i want to load only few data by filtering, the column that i want to filter are partition keys and there is bug in this https://issues.apache.org/jira/browse/CASSANDRA-6151) so i cannot do this.
so planning to filter with pig, b = filter data by col1 == 'something';
My doubt is, whether pig loads all the data from cassandra and then filter? or will it send filter condition to CqlStorage to load only required data from cassandra.?
Related
I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.
Context:
I have data in a table in mysql with xml as one column.
For Ex: Table application has 3 fields.
id(integer) , details(xml) , address(text)
(In real case i have 10-12 fields here).
Now we want to query the whole table with all the fields in mysql table using pig.
Transferred the data from mysql into hdfs using sqoop with
record delimiter '\u0005' and column delimiter as "`" to /x.xml.
Then Load the data from x.xml into the Pig using
app = LOAD '/x.xml' USING PigStorage('\u0005') AS (id:int , details:chararray , address:chararray);
What is the best way to query such data.
Solution that i could currently think about.
Use a custom loader and extend Loadfunc to read the data.
If there is some way to load a particular column using xmlpathloader and rest loading normally. Please suggest if this can be done.
As all the examples i have seen using xpath are using XML loader while loading the file.
For Ex:
A = LOAD 'xmls/hadoop_books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
Is it good to use pig for querying such kind of data, please suggest if there are any other alternative technologies, that does it effectively.
The size of data present is around 500 GB.
FYI i am new to hadoop ecosytem and i might be missing something trivial.
Load a specific column:
Some other StackOverflow answers suggesting preprocessing the data with awk (generate a new input contains only the xml part.)
A nicer work-a-round to generate the specific data with an extra FOREACH from the xml column, like:
B = FOREACH app GENERATE details;
and store it to be able to load with an XML loader.
Check the StreamingXMLLoader
(You can also check Apache Drill it may support this case out of the box)
Or use UDF for the XML processing and in pig you just hand over the related xml field.
I am using hbase 0.98.4
I want to retrieve data from hbase table by using scanner with java api where my
startRow : username_uniqueId
stopRow is : username_uniqueId* so here there can be anything appended
I set these param to scan object but it is not fetching data from hbase table.
Basically i want to fetch all records which starts with some specific string.
For this i can use prefix filter but i came to know it kills hbase performance as it scan whole hbase table. So i am avoiding it.
Can anyone have a better solution apart from using prefix filter?
I have a setup of Hadoop,Cassandra, Pig, Mysql
My goal is to read 1 month data from cassandra process it and put result to mysql periodically.
What is the best practice to do.?
Is it i need to load all the data and filter in pig for 1 month or filter while loading from cassandra using pig/cql(using CqlStorage).
Here the problem is,
if i need to filter while loading from cassandra pig has a bug of having where clause on cql(https://issues.apache.org/jira/browse/CASSANDRA-6151).
or
problem with another solution of loading all data and filter through pig is, the data is too large nearly 200 million records, is it a better solution to load all data, if so what about the performance and time taken by pig script to run.
Is there a way to load records from HBase into a pig relation based on the value of a particular column in HBase? Thank You
If you look at the source code for the pig HBase loader you can see it can filter on key range and timestamps and it can get columns by prefix but not filter by column value.
You can write your own loader (even based on that code) and add the capability you need. Note that the performance for filtering on column values would not be great anyway and filtering for that value in the mapper, while slower than filtering in an HBase filter, will not be that different (you'd be basically saving the interprocess communication from the regionserver to the pig mapper)